Bioinformatics - sequence processing
As mentioned there are lots of options for processing sequence data, this work flow uses the QIIME2 pipeline. While you can also perform statistical analysis and visualize your data in QIIME2, as it is a web based platform it is restricted in terms of analysis options, customising figures, cleaning & subsetting data and integrating other data.
My approach/recommendations:
You need to find a balance that works for you and your study question(s).
Official QIIME2 docs, and view objects via QIIME2 view.
Customised scripts available at https://github.com/siobhon-egan/qiime2_analysis
Pipeline created with QIIME2-2020.11, see QIIME2 documentation for install based on your platform.
Sequence data input
Paired end sequence data in directory raw_data/
. Sample data named in the following format: SAMPLEID_L001_R1_001.fastq.gz
(forward) or SAMPLEID_L001_R2_001.fastq.gz
(reverse).
Activate QIIME2
conda activate qiime2-2020.11
Import .fastq.gz
data into QIIME2 format using Casava 1.8 demultiplexed (paired-end) option. Remember assumes raw data is in directory labeled raw_data/
and file naming format as above.
qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path raw_data \
--input-format CasavaOneEightSingleLanePerSampleDirFmt \
--output-path 16S_demux_seqs.qza
# create visualization file
qiime demux summarize \
--i-data 16S_demux_seqs.qza \
--o-visualization 16S_demux_seqs.qzv
Inspect 16S_demux_seqs.qzv
artifact for quality scores. This will help decide on QC parameters.
Based on quality plot in the above output 16S_demux_seqs.qza
adjust trim length to where quality falls.
Then you can also trim primers. In this case working with 16S V1-2 data.
qiime dada2 denoise-paired \
--i-demultiplexed-seqs 16S_demux_seqs.qza \
--p-trim-left-f 20 \
--p-trim-left-r 19 \
--p-trunc-len-f 250 \
--p-trunc-len-r 250 \
--o-table 16S_denoise_table.qza \
--o-representative-sequences 16S_denoise_rep-seqs.qza \
--o-denoising-stats 16S_denoise-stats.qza
At this stage, you will have artifacts containing the feature table, corresponding feature sequences, and DADA2 denoising stats. You can generate summaries of these as follows.
qiime feature-table summarize \
--i-table 16S_denoise_table.qza \
--o-visualization 16S_denoise_table.qzv \
--m-sample-metadata-file sample-metadata.tsv # Can skip this bit if needed.
qiime feature-table tabulate-seqs \
--i-data 16S_denoise_rep-seqs.qza \
--o-visualization 16S_denoise_rep-seqs.qzv
qiime metadata tabulate \
--m-input-file 16S_denoise-stats.qza \
--o-visualization 16S_denoise-stats.qzv
Export ASV table
To produce an ASV table with number of each ASV reads per sample that you can open in excel.
Need to make biom file first
qiime tools export \
--input-path 16S_denoise_table.qza \
--output-path feature-table
biom convert \
-i feature-table/feature-table.biom \
-o feature-table/feature-table.tsv \
--to-tsv
Several downstream diversity metrics require that a phylogenetic tree be constructed using the Operational Taxonomic Units (OTUs) or Amplicon Sequence Variants (ASVs) being investigated.
qiime phylogeny align-to-tree-mafft-fasttree \
--i-sequences rep-seqs.qza \
--o-alignment aligned-rep-seqs.qza \
--o-masked-alignment masked-aligned-rep-seqs.qza \
--o-tree unrooted-tree.qza \
--o-rooted-tree rooted-tree.qza
Export
Covert unrooted tree output to newick formatted file
qiime tools export \
--input-path unrooted-tree.qza \
--output-path exported-tree
Assign taxonomy to denoised sequences using a pre-trained naive bayes classifier and the q2-feature-classifier plugin. Details on how to create a classifier are available here.
I am using a pre-training classifier for the 16S V1-2 with reference a SILVA database version 138.1.
Note that taxonomic classifiers perform best when they are trained based on your specific sample preparation and sequencing parameters, including the primers that were used for amplification and the length of your sequence reads.
qiime feature-classifier classify-sklearn \
--i-classifier /Taxonomy/QIIME2_classifiers_v2020.11/Silva_99_Otus/27F-388Y/classifier.qza \
--i-reads 16S_denoise_rep-seqs.qza \
--o-classification qiime2-taxa-silva/taxonomy.qza
qiime metadata tabulate \
--m-input-file qiime2-taxa-silva/taxonomy.qza \
--o-visualization qiime2-taxa-silva/taxonomy.qzv
Lets take a took at something prepared earlier.
There are two major outputs from the process above.
1. The count data
This is the data contains the list of ASVs and number of sequences per sample.
In this example each row is a sample and a column is the OTU/ASV.
# read data
otu_table <- read_csv("data/otu_table.csv")
paged_table(otu_table)
2. The taxonomy data
This contains the taxonomy of the count (or OTU) data. Each row is a unique OTU/ASV and column reflect Kingdom, Phylum, Class, Order, Family, Genus, Species.
# read data
tax_table <- read_csv("data/tax_table.csv")
# To make it easier viewing remove sequence name in first column
drop <- c("X1")
df = tax_table[,!(names(tax_table) %in% drop)]
paged_table(df)
3. The sample data
You’ll also need to have some sample metadata as well - this contains the variables and information related to our samples. Lets take a quick look at the for the example data set we are about to use.
# read data
sam_data <- read_csv("data/sam_data.csv")
paged_table(sam_data)