Quick commands for manipulating sequence files
This section first starts off with some quick useful commands that will be helpful to assess and manipulate your fasta and fastq file. They can be done locally in your command line and don’t require any clumsy dependancies.
Credit to some helpful inspiration
These examples simply use sample.fa
or sample.fq
or similar format to represent the file you want to investigate. Remember you will need to navigate to the the path of the file or ensure you are in the correct working directory. The following use some basic sed, awk and pearl commands - often these are already installed if working on a macOS terminal environment
Convert a fastq file to a fasta file
sed '/^@/!d;s//>/;N' sample.fq > sample1.fa
Count number of sequences
Fastq file
grep -c '^@' sample.fq
Fasta file
grep -c '>' sample.fasta
Get some basic info on your file(s)
perl -ne 'if(/^>(\S+)/){print "$1\n"}' sample.fa
You may want to print them to a text file
perl -ne 'if(/^>(\S+)/){print "$1\n"}' sample.fa > sample_names.txt
Get the sequence name and length for every individual sequence within a fasta file
cat sample.fasta | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }'
Calculate the mean length of reads in a fastq file:
awk 'NR%4==2{sum+=length($0)}END{print sum/(NR/4)}' sample.fastq
Add something to the end of all header lines
sed 's/>.*/&WHATEVERYOUWANT/' file.fa > outfile.fa
Clean up a fasta file so that only the first column of the header is outputted
awk '{print $1}' file.fa > output.fa
Remove dupilcate sequences from your fasta file
sed -e '/^>/s/$/@/' -e 's/^>/#/' file.fasta | tr -d '\n' | tr "#" "\n" | tr "@" "\t" | sort -u -t $'\t' -f -k 2,2 | sed -e 's/^/>/' -e 's/\t/\n/'
Get A T (or U) G C counts for all sequences in a fasta file
echo -e "seq_id\tA\tU\tG\tC"; while read line; do echo $line | grep ">" | sed 's/>//g'; for i in A U G C;do echo $line | grep -v ">" | grep -o $i | wc -l | grep -v "^0"; done; done < your_fasta_file.fa | paste - - - - -
Split a multifasta in to separate files (with arbitrary names).
splitfa(){
i=1;
while read line ; do
if [ ${line:0:1} == ">" ] ; then
header="$line"
echo "$header" >> seq"${i}".fasta
else
seq="$line"
echo "$seq" >> seq"${i}".fasta
((i++))
fi
done < $1
}
Extract sequences by their ID
Call just certain sequences of interest - in this case we are interested in sequences named seq1
, seq2
and seq3
and put them in a file call new.fasta
perl -ne 'if(/^>(\S+)/){$c=grep{/^$1$/}qw(seq1 seq2 seq3)}print if $c' sample.fa > new.fasta
Say you have a text file with a list of sequences you want to further investigate you can use this text file (called seqs_of_interest.txt
) to call out the sequences in the sample.fasta
file and put them in a few file seqs_of_interest.fasta
perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' seqs_of_interest.txt sample.fasta > seqs_of_interest.fasta
seqmagick
is a simple yet powerful utility that helps with basic manipulation and conversion/formatting of sequence files. Super easy to install and use with minimal requirements (likely already installed) - you just need Python >=3.5 and biopython >=1.70.
Link to github here and additional documentation here.
Install
Open a new terminal window and type the following
pip install seqmagick
Convert a .fasta
file to a .nexus
format.
seqmagick convert --output-format nexus --alphabet dna input.fasta output.nex
Convert a .fasta
file to a .phylip
format.
seqmagick convert --output-format phylip --alphabet dna input.fasta output.phy
Reverse complement a sequence
seqmagick mogrify --reverse-complement sequence.fasta
Remove duplicate sequences
seqmagick mogrify --deduplicate-sequences sequence.fasta
There are a lot of programs out that. Some use the command line interface, there is also plenty of GUI and web-based programs too. Below are just two on the most useful ones that I like (I tend to like the command line interface ones, as they can easily be upscaled and used on your VM/supercomputer if needed).
A cross platform and ultrafast toolkit for fasta and fastq file manipulation
The above commands are good for some basic editing and information but if you want to do more with your fasta and fastq files your going to need some dedicated programs.
seqkit
is a super useful program for any bioinformatician follow this link for offical downloads and documenation.
We’ll run through the basic below but always refer to the link above for any issues and more details.
Install (Linux-like systems)
Just download compressed executable file of your operating system. Navigate to the file in your command line and decompress it with tar -zxvf *.tar.gz
command. * If you have root privilege simply copy it to /usr/local/bin:
sudo cp seqkit /usr/local/bin/`` * Or copy to anywhere in the environment variable PATH:
mkdir -p $HOME/bin/; cp seqkit $HOME/bin/`
This is a super useful addition to the awk range of commands specific to bioinformatics - see github page here
conda install Easy install using terminal, as long as you have conda installed https://anaconda.org/bioconda/bioawk
conda install -c bioconda bioawk
conda install -c bioconda/label/cf201901 bioawk
Useful tutorials
Detailed bioinformatics workbook from the Genome Informatics Facility - tutorial available here, and github account here.
Work by Siobhon Egan
siobhon.egan@murdoch.edu.au