BIO514

Bioinformatics - sequence processing

As mentioned there are lots of options for processing sequence data, this work flow uses the QIIME2 pipeline. While you can also perform statistical analysis and visualize your data in QIIME2, as it is a web based platform it is restricted in terms of analysis options, customising figures, cleaning & subsetting data and integrating other data.

My approach/recommendations:

You need to find a balance that works for you and your study question(s).

QIIME2

Official QIIME2 docs, and view objects via QIIME2 view.

Customised scripts available at https://github.com/siobhon-egan/qiime2_analysis

Pipeline created with QIIME2-2020.11, see QIIME2 documentation for install based on your platform.

Sequence data input

Paired end sequence data in directory raw_data/. Sample data named in the following format: SAMPLEID_L001_R1_001.fastq.gz (forward) or SAMPLEID_L001_R2_001.fastq.gz (reverse).

Activate QIIME2

conda activate qiime2-2020.11

Import data

Import .fastq.gz data into QIIME2 format using Casava 1.8 demultiplexed (paired-end) option. Remember assumes raw data is in directory labeled raw_data/ and file naming format as above.

qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path raw_data \
--input-format CasavaOneEightSingleLanePerSampleDirFmt \
--output-path 16S_demux_seqs.qza

# create visualization file
qiime demux summarize \
  --i-data 16S_demux_seqs.qza \
  --o-visualization 16S_demux_seqs.qzv

Inspect 16S_demux_seqs.qzv artifact for quality scores. This will help decide on QC parameters.

Denoising

Based on quality plot in the above output 16S_demux_seqs.qza adjust trim length to where quality falls.

Then you can also trim primers. In this case working with 16S V1-2 data.

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs 16S_demux_seqs.qza \
  --p-trim-left-f 20 \
  --p-trim-left-r 19 \
  --p-trunc-len-f 250 \
  --p-trunc-len-r 250 \
  --o-table 16S_denoise_table.qza \
  --o-representative-sequences 16S_denoise_rep-seqs.qza \
  --o-denoising-stats 16S_denoise-stats.qza

At this stage, you will have artifacts containing the feature table, corresponding feature sequences, and DADA2 denoising stats. You can generate summaries of these as follows.

qiime feature-table summarize \
  --i-table 16S_denoise_table.qza \
  --o-visualization 16S_denoise_table.qzv \
  --m-sample-metadata-file sample-metadata.tsv # Can skip this bit if needed.

qiime feature-table tabulate-seqs \
  --i-data 16S_denoise_rep-seqs.qza \
  --o-visualization 16S_denoise_rep-seqs.qzv

qiime metadata tabulate \
  --m-input-file 16S_denoise-stats.qza \
  --o-visualization 16S_denoise-stats.qzv

Export ASV table

To produce an ASV table with number of each ASV reads per sample that you can open in excel.

Need to make biom file first

qiime tools export \
--input-path 16S_denoise_table.qza \
--output-path feature-table

biom convert \
-i feature-table/feature-table.biom \
-o feature-table/feature-table.tsv \
--to-tsv

Phylogeny

Several downstream diversity metrics require that a phylogenetic tree be constructed using the Operational Taxonomic Units (OTUs) or Amplicon Sequence Variants (ASVs) being investigated.

qiime phylogeny align-to-tree-mafft-fasttree \
  --i-sequences rep-seqs.qza \
  --o-alignment aligned-rep-seqs.qza \
  --o-masked-alignment masked-aligned-rep-seqs.qza \
  --o-tree unrooted-tree.qza \
  --o-rooted-tree rooted-tree.qza

Export

Covert unrooted tree output to newick formatted file

qiime tools export \
  --input-path unrooted-tree.qza \
  --output-path exported-tree

Taxonomy

Assign taxonomy to denoised sequences using a pre-trained naive bayes classifier and the q2-feature-classifier plugin. Details on how to create a classifier are available here.

I am using a pre-training classifier for the 16S V1-2 with reference a SILVA database version 138.1.

Note that taxonomic classifiers perform best when they are trained based on your specific sample preparation and sequencing parameters, including the primers that were used for amplification and the length of your sequence reads.

qiime feature-classifier classify-sklearn \
--i-classifier /Taxonomy/QIIME2_classifiers_v2020.11/Silva_99_Otus/27F-388Y/classifier.qza \
--i-reads 16S_denoise_rep-seqs.qza \
--o-classification qiime2-taxa-silva/taxonomy.qza

qiime metadata tabulate \
--m-input-file qiime2-taxa-silva/taxonomy.qza \
--o-visualization qiime2-taxa-silva/taxonomy.qzv

Data output

Lets take a took at something prepared earlier.

There are two major outputs from the process above.

pkgs <- c("readr", "rmarkdown")
lapply(pkgs, require, character.only = TRUE)

1. The count data

This is the data contains the list of ASVs and number of sequences per sample.
In this example each row is a sample and a column is the OTU/ASV.

# read data
otu_table <- read_csv("data/otu_table.csv")
paged_table(otu_table)

2. The taxonomy data

This contains the taxonomy of the count (or OTU) data. Each row is a unique OTU/ASV and column reflect Kingdom, Phylum, Class, Order, Family, Genus, Species.

# read data
tax_table <- read_csv("data/tax_table.csv")
# To make it easier viewing remove sequence name in first column
drop <- c("X1")
df = tax_table[,!(names(tax_table) %in% drop)]
paged_table(df)

3. The sample data

You’ll also need to have some sample metadata as well - this contains the variables and information related to our samples. Lets take a quick look at the for the example data set we are about to use.

# read data
sam_data <- read_csv("data/sam_data.csv")
paged_table(sam_data)