Sequences

genomics

sequences

Dealing with sequences.

Author

Siobhon Egan

Published

June 23, 2021

Quick tips in working with sequence files

Viewing

Viewing your nucleotide sequences

When you receive your raw sanger sequences you will usually receive files with the following extensions .ab1 and .fasta. The .ab1 file is the most useful as it contains the cromatogram which contains information on quality of each base call.

🔗 There are lots of programs that you can use to visualise and edit these files.

FinchTV - a free program, that allows you to quickly and easily visualise your ab1 files. It also allows you to do some basic edits on your sequences.
4 peaks - another free program, that works well on macOS. It works in a similar fashion to FinchTV and also has a good short cut to BLAST sequences.
UGENE - is a free program that allows has a lot more features than the first two programs. It is a popular, widely used program. In additional to basic visualisation and sequence editing it has some useful features to design primers (integrated with primer3), sequence alignment (including short reads), contig assembly, phylogenetic trees and more.
GeneStudio - is a collection of free programs that allow you to do everything from viewing & editing sequences to building phylogenies and editing trees.
Geneious - a paid program, that is one of the most widely used by molecular biologists. It has a range of features and plugins that allow you to do a wide range of phylogenetic and bioinformatic analysis.

Alignments

Sequence alignment with MUSCLE

MUSCLE is one of the most common tools to align sequences. In most cases, only a few command line options are needed.

Free download available here

🔗 Muscle documentation

Make an alignment and save to a file in FASTA format:

muscle -in seqs.fa -out seqs_muscle.afa

Write alignment to the console in CLUSTALW format (more readable than FASTA):

muscle -in seqs.fa -clw

If you have an existing alignment, you can add a new sequence using the following:

muscle -profile -in1 existing_msa.afa -in2 new_seq.fa -out combined.afa

If you have more than one new sequence you need to align them first, and then can merge the two alignments:

muscle -in new_seqs.fa -out new_seqs.afa
muscle -profile -in1 existing_aln.afa -in2 new_seqs.afa -out combined.afas

Trees

Commandline muscle is very helpful for making quick trees. Depending on the depth of phylogenetics needed, a simple NJ tree maybe all you need.

Make a UPGMA tree from a multiple alignment:

muscle -maketree -in seqs_muscle.afa -out seqs_muscle_tree.phy

Make a Neighbor-Joining tree from a multiple alignment:

muscle -maketree -in seqs_muscle.afa -out seqs_muscle_tree.phy -cluster neighborjoining

Sequence quick bits

Quick commands for manipulating sequence files

This is to cover off on some basic quick command line tools to work with sequence data (mainly written for fasta and fastq format but generally useful for most types of sequence files. There are a lot of programs out that. Some use the command line interface, there is also plenty of GUI and web-based programs too. Below are just two on the most useful ones that I like (I tend to like the command line interface ones, as they can easily be upscaled and used on your VM/supercomputer if needed).

Contents

Some simple commands
Seqmagick
SeqKit
Bioawk

Basic information and links

This section first starts off with some quick useful commands that will be helpful to assess and manipulate your fasta and fastq file. They can be done locally in your command line and don’t require any clumsy dependancies.

🔗 Credit to some helpful inspiration

Some simple commands

These examples simply use sample.fa or sample.fq or similar format to represent the file you want to investigate. Remember you will need to navigate to the the path of the file or ensure you are in the correct working directory. The following use some basic sed, awk and pearl commands - often these are already installed if working on a macOS terminal environment

Convert a fastq file to a fasta file

sed '/^@/!d;s//>/;N' sample.fq > sample1.fa

Count number of sequences

Fastq file

grep -c '^@' sample.fq

Fasta file

grep -c '>' sample.fasta

Get some basic info on your file(s)

perl -ne 'if(/^>(\S+)/){print "$1\n"}' sample.fa

You may want to print them to a text file

perl -ne 'if(/^>(\S+)/){print "$1\n"}' sample.fa > sample_names.txt

Get the sequence name and length for every individual sequence within a fasta file

cat sample.fasta | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }'

Calculate the mean length of reads in a fastq file:

awk 'NR%4==2{sum+=length($0)}END{print sum/(NR/4)}' sample.fastq

Add something to the end of all header lines

sed 's/>.*/&WHATEVERYOUWANT/' file.fa > outfile.fa

Clean up a fasta file so that only the first column of the header is outputted

awk '{print $1}' file.fa > output.fa

Remove dupilcate sequences from your fasta file

sed -e '/^>/s/$/@/' -e 's/^>/#/' file.fasta | tr -d '\n' | tr "#" "\n" | tr "@" "\t" | sort -u -t $'\t' -f -k 2,2  | sed -e 's/^/>/' -e 's/\t/\n/'

Get A T (or U) G C counts for all sequences in a fasta file

echo -e "seq_id\tA\tU\tG\tC"; while read line; do echo $line | grep ">" | sed 's/>//g'; for i in A U G C;do echo $line | grep -v ">" | grep -o $i | wc -l | grep -v "^0"; done; done < your_fasta_file.fa | paste - - - - -

Split a multifasta in to separate files (with arbitrary names).

splitfa(){
i=1;
while read line ; do
  if [ ${line:0:1} == ">" ] ; then
    header="$line"
    echo "$header" >> seq"${i}".fasta
  else
    seq="$line"
    echo "$seq" >> seq"${i}".fasta
    ((i++))
  fi
done < $1
}

Extract sequences by their ID

Call just certain sequences of interest - in this case we are interested in sequences named seq1, seq2 and seq3 and put them in a file call new.fasta

perl -ne 'if(/^>(\S+)/){$c=grep{/^$1$/}qw(seq1 seq2 seq3)}print if $c' sample.fa > new.fasta

Say you have a text file with a list of sequences you want to further investigate you can use this text file (called seqs_of_interest.txt) to call out the sequences in the sample.fasta file and put them in a few file seqs_of_interest.fasta

perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' seqs_of_interest.txt sample.fasta > seqs_of_interest.fasta

Usefull command line (bash) programs

Seqmagick

seqmagick is a simple yet powerful utility that helps with basic manipulation and conversion/formatting of sequence files. Super easy to install and use with minimal requirements (likely already installed) - you just need Python >=3.5 and biopython >=1.70.

Link to github here and additional documentation here.

Install

Open a new terminal window and type the following

pip install seqmagick

Convert a .fasta file to a .nexus format.

seqmagick convert --output-format nexus --alphabet dna input.fasta output.nex

Convert a .fasta file to a .phylip format.

seqmagick convert --output-format phylip --alphabet dna input.fasta output.phy

Reverse complement a sequence

seqmagick mogrify --reverse-complement sequence.fasta

Remove duplicate sequences

seqmagick mogrify --deduplicate-sequences sequence.fasta

SeqKit

A cross platform and ultrafast toolkit for fasta and fastq file manipulation

The above commands are good for some basic editing and information but if you want to do more with your fasta and fastq files your going to need some dedicated programs.

seqkit is a super useful program for any bioinformatician follow this link for offical downloads and documenation.

We’ll run through the basic below but always refer to the link above for any issues and more details.

Install (Linux-like systems)

Just download compressed executable file of your operating system. Navigate to the file in your command line and decompress it with tar -zxvf *.tar.gz command. * If you have root privilege simply copy it to /usr/local/bin: sudo cp seqkit /usr/local/bin/`` * Or copy to anywhere in the environment variable PATH:mkdir -p $HOME/bin/; cp seqkit $HOME/bin/`

Extract the first 100 sequences in a fasta file to new file

seqkit head -n 100 reads_all.fasta > reads_001-100.fasta

Extract a range of sequences, e.g. sequences 5-15 (inclusive) in a fasta file to new file

seqkit range -r 5:15 reads_all.fasta > reads_005-015.fasta

Bioawk

This is a super useful addition to the awk range of commands specific to bioinformatics - see github page here

conda install Easy install using terminal, as long as you have conda installed https://anaconda.org/bioconda/bioawk

conda install -c bioconda bioawk
conda install -c bioconda/label/cf201901 bioawk

Useful tutorials

Detailed bioinformatics workbook from the Genome Informatics Facility - tutorial available here, and github account here.

Blastn command line

blastn -db ~/db/16S_ribosomal_RNA_BLAST/16S_ribosomal_RNA -query taxa_asvs.fasta -out blastn_16SrRNA.txt -outfmt 6

Getting taxonomic info

1. `Accession numbers` to `taxid`

These next bits of code are to get taxonomic and description information from nucleotide accession numbers.

First if you have a list of nucleotide accession numbers get the taxid reference for each. Code copied from this thread here. (Note: don’t need to download any databases or programs for this!)

for ACC in A00002 X53307 BB145968 CAA42669 V00181  AH002406  HQ844023
  do
    echo -n -e "$ACC\t"
    curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${ACC}&rettype=fasta&retmode=xml" | \
    grep TSeq_taxid | \
    cut -d '>' -f 2 | \
    cut -d '<' -f 1 | \
    tr -d "\n"
    echo
  done

To get sequence description information use the following:

for ACC in A00002 X53307 BB145968 CAA42669 V00181  AH002406  HQ844023
  do
    echo -n -e "$ACC\t"
    curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${ACC}&rettype=fasta&retmode=xml" | \
    grep TSeq_defline | \
    cut -d '>' -f 2 | \
    cut -d '<' -f 1 | \
    tr -d "\n"
    echo
  done

2. `taxid` to lineage information

Use TaxonKit to get lineage information. Documentation here, GitHub repo here

Install - for linux, other platforms available here

wget https://github.com/shenwei356/taxonkit/releases/download/v0.7.1/taxonkit_linux_amd64.tar.gz
wget https://github.com/shenwei356/csvtk/releases/download/v0.22.0/csvtk_linux_amd64.tar.gz
#unzip
tar -zxvf *.tar.gz

Download and decompress taxdump

wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -zxvf taxdump.tar.gz

mkdir -p $HOME/.taxonkit
cp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit

Example data - taxids.txt

    cat taxids.txt
    863246
    1826986
    312471
    268293
    5696
    75058
    351716
    5696
    266814
    10117

Show name and rank only

taxonkit lineage -r -n -L taxids.txt \
    | csvtk pretty -t

Show lineage consisting of taxids:

taxonkit lineage -t  taxids2.txt \
    | csvtk pretty -t

Write lineage information to text file

taxonkit lineage taxids2.txt | awk '$2!=""' > lineage2.txt

Reformat lineage file to the following output: {k};{p};{c};{o};{f};{g};{s}

taxonkit reformat lineage2.txt | tee lineage.txt.reformat
cut -f 1,3 lineage.txt.reformat

Align output to show taxonomic classification clearly

lineage2.txt \
    | taxonkit reformat \
    | csvtk -H -t cut -f 1,3 \
    | csvtk -H -t sep -f 2 -s ';' -R \
    | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \
    | csvtk pretty -t

3. `taxize` to go from accession to lineage

The R package taxize does the job of the two sections above.

You can go from accession number, to taxid and to lineage all in R.

First you need some blastn output (see command line blastn for example), -fmtout 6 is good option.

Then in RStudio we read in the file and find lineage.

blastn_taxa <-
  read.table("../data/blastn_16SrRNA.txt",
             sep = "\t",
             header = FALSE)

blastn_fmt6 <-
  c(
    "qseqid",
    "sseqid",
    "pident",
    "length",
    "mismatch",
    "gapopen",
    "qstart",
    "qend",
    "sstart",
    "send",
    "evalue",
    "bitscore"
  )
colnames(blastn_taxa) <- blastn_fmt6

blastn_taxa_sort <-
  data.table::setorder(blastn_taxa, qseqid,-pident)

blastn_taxa_top <-
  blastn_taxa_sort[match(unique(blastn_taxa_sort$qseqid), blastn_taxa_sort$qseqid), ]

Formatting to get taxid and lineage for each unique genbank accession number

unique_sseqid <- unique(blastn_taxa_top$sseqid)
ncbi_lineage <-
  classification(genbank2uid(id = unique_sseqid), db = 'ncbi')

ncbi_lineage_df <- cbind(ncbi_lineage)
ncbi_lineage_df$sseqid <- unique_sseqid

ncbi_lineage_df <-
  dplyr::select(
    ncbi_lineage_df,
    sseqid,
    query,
    superkingdom,
    phylum,
    class,
    order,
    family,
    genus,
    species
  )

Check matches and then join blast top hit table with lineage

idx <- which(blastn_taxa_top$sseqid %in% ncbi_lineage_df2$sseqid)
blastn_taxa_top_lineage <-
      blastn_taxa_top %>% left_join(ncbi_lineage_df2,
                                    by = "sseqid")

Getting sequences

The rentrez R package provides functions that work with the NCBI Eutils API to search, download data from, and otherwise interact with NCBI databases.

https://vbaliga.github.io/download-genbank-dna-protein-sequences-in-R/