Sequence file formats and basics

Quick commands for manipulating sequence files

Basic information and links

This section first starts off with some quick useful commands that will be helpful to assess and manipulate your fasta and fastq file. They can be done locally in your command line and don’t require any clumsy dependancies.

Credit to some helpful inspiration

Some simple commands

These examples simply use sample.fa or sample.fq or similar format to represent the file you want to investigate. Remember you will need to navigate to the the path of the file or ensure you are in the correct working directory. The following use some basic sed, awk and pearl commands - often these are already installed if working on a macOS terminal environment

Convert a fastq file to a fasta file

sed '/^@/!d;s//>/;N' sample.fq > sample1.fa

Count number of sequences

Fastq file

grep -c '^@' sample.fq

Fasta file

grep -c '>' sample.fasta

Get some basic info on your file(s)

perl -ne 'if(/^>(\S+)/){print "$1\n"}' sample.fa

You may want to print them to a text file

perl -ne 'if(/^>(\S+)/){print "$1\n"}' sample.fa > sample_names.txt

Get the sequence name and length for every individual sequence within a fasta file

cat sample.fasta | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }'

Calculate the mean length of reads in a fastq file:

awk 'NR%4==2{sum+=length($0)}END{print sum/(NR/4)}' sample.fastq

Add something to the end of all header lines

sed 's/>.*/&WHATEVERYOUWANT/' file.fa > outfile.fa

Clean up a fasta file so that only the first column of the header is outputted

awk '{print $1}' file.fa > output.fa

Remove dupilcate sequences from your fasta file

sed -e '/^>/s/$/@/' -e 's/^>/#/' file.fasta | tr -d '\n' | tr "#" "\n" | tr "@" "\t" | sort -u -t $'\t' -f -k 2,2  | sed -e 's/^/>/' -e 's/\t/\n/'

Get A T (or U) G C counts for all sequences in a fasta file

echo -e "seq_id\tA\tU\tG\tC"; while read line; do echo $line | grep ">" | sed 's/>//g'; for i in A U G C;do echo $line | grep -v ">" | grep -o $i | wc -l | grep -v "^0"; done; done < your_fasta_file.fa | paste - - - - -

Split a multifasta in to separate files (with arbitrary names).

splitfa(){
i=1;
while read line ; do
  if [ ${line:0:1} == ">" ] ; then
    header="$line"
    echo "$header" >> seq"${i}".fasta
  else
    seq="$line"
    echo "$seq" >> seq"${i}".fasta
    ((i++))
  fi
done < $1
}

Extract sequences by their ID

Call just certain sequences of interest - in this case we are interested in sequences named seq1, seq2 and seq3 and put them in a file call new.fasta

perl -ne 'if(/^>(\S+)/){$c=grep{/^$1$/}qw(seq1 seq2 seq3)}print if $c' sample.fa > new.fasta

Say you have a text file with a list of sequences you want to further investigate you can use this text file (called seqs_of_interest.txt) to call out the sequences in the sample.fasta file and put them in a few file seqs_of_interest.fasta

perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' seqs_of_interest.txt sample.fasta > seqs_of_interest.fasta

Seqmagick

seqmagick is a simple yet powerful utility that helps with basic manipulation and conversion/formatting of sequence files. Super easy to install and use with minimal requirements (likely already installed) - you just need Python >=3.5 and biopython >=1.70.

Link to github here and additional documentation here.

Install

Open a new terminal window and type the following

pip install seqmagick

Convert a .fasta file to a .nexus format.

seqmagick convert --output-format nexus --alphabet dna input.fasta output.nex

Convert a .fasta file to a .phylip format.

seqmagick convert --output-format phylip --alphabet dna input.fasta output.phy

Reverse complement a sequence

seqmagick mogrify --reverse-complement sequence.fasta

Remove duplicate sequences

seqmagick mogrify --deduplicate-sequences sequence.fasta

Other useful programs

There are a lot of programs out that. Some use the command line interface, there is also plenty of GUI and web-based programs too. Below are just two on the most useful ones that I like (I tend to like the command line interface ones, as they can easily be upscaled and used on your VM/supercomputer if needed).

SeqKit

A cross platform and ultrafast toolkit for fasta and fastq file manipulation

The above commands are good for some basic editing and information but if you want to do more with your fasta and fastq files your going to need some dedicated programs.

seqkit is a super useful program for any bioinformatician follow this link for offical downloads and documenation.

We’ll run through the basic below but always refer to the link above for any issues and more details.

Install (Linux-like systems)

Just download compressed executable file of your operating system. Navigate to the file in your command line and decompress it with tar -zxvf *.tar.gz command. * If you have root privilege simply copy it to /usr/local/bin: sudo cp seqkit /usr/local/bin/`` * Or copy to anywhere in the environment variable PATH:mkdir -p $HOME/bin/; cp seqkit $HOME/bin/`

Bioawk

This is a super useful addition to the awk range of commands specific to bioinformatics - see github page here

conda install Easy install using terminal, as long as you have conda installed https://anaconda.org/bioconda/bioawk

conda install -c bioconda bioawk
conda install -c bioconda/label/cf201901 bioawk

Useful tutorials

Detailed bioinformatics workbook from the Genome Informatics Facility - tutorial available here, and github account here.

Work by Siobhon Egan

siobhon.egan@murdoch.edu.au