Workshops and tutorials on microbiome and genomics
Today we will be doing through some microbiome bioinformatics.
Workshop notes and data available:
GitHub code repository siobhon-egan/2022-systMed-genomics
Website with information & tutorials siobhonlegan.com/2022-systMed-genomics
REQUIRED
We will be using RStudio to analyse the data set. It is recommend you have the following installed: RStudio version 1.4 or later and R version 4.0 or later. Further details on getting started in RStudio here.
Optional (not needed for today's workshop)
We will not be doing the sequence pre-processing steps today but if you did want to do this you will need to download conda and QIIME2.
If you are you are interested in genomic bioinformatics try install/set up this in the breaks or come make a time to see me if any issues.
Raw amplicon 16S sequence data from West et al. (2020) Gut 69, 1452-1459. doi: 10.1136/gutjnl-2019-319620.
Download raw data from NCBI Sequence Read Archive. Project number PRJNA493625
from https://sra-explorer.info/.
You will not be required to download this for today's tutorial but if you wanted you could use this data and follow the sequence processing page.
I have also uploaded pre-processed QIIME2 sequence data as outlined in sequence processing. This is available for download on FigShare. Download files and you can view them using QIIME2 view.
The easiest way to follow along with this tutorial is to download this GitHub repository using either option A or B below:
A. Go to https://github.com/siobhon-egan/2022-systMed-genomics and click on the green Code button. Select Download ZIP, open/unzip the file. Open the .Rmd
files in RStudio you will be able to follow along for the data analysis.
B. Use terminal and clone the GitHub repo.
git clone https://github.com/siobhon-egan/2022-systMed-genomics.git
Alternatively if you are using the RStudio cloud server as part of the BIO513 unit I have include the data we will need for today's lesson in the directory data/BIO514-microbiome directory.
You can then just create your own Rmarkdown file and save in the home/ directory. You can copy the code chunks in the script on the webpage dataViz or find a copy of the .Rmd
file here.
In the data/BIO514-microbiome/ directory you will find:
.csv
files which contain the output from the QIIME2 pipeline. The files are: (1) otu_tabe (count data), (2) tax_table (taxonomy) and (3) sam_data (sample meatdata)..Rdata
file which we will load into R for the analysis - really this is just a file that contains all three of the spreadsheets above in an R format that is already formatted and ready to go for analysis.Exert direct from West et al. 2020. Gut. doi: 10.1136/gutjnl-2019-319620.
Stool samples were randomised for processing and DNA was extracted (see online supplementary methods) using the PowerLyzer PowerSoil DNA Isolation Kit (Mo Bio). 16S rRNA gene amplicon sequencing targeting the V1-V2 regions was performed on the Illumina MiSeq platform as previously described21. Raw reads were processed in the R software environmentt19 following a published workflow22 which includes amplicon denoising implemented in ‘DADA2’23. See (online supplementary methods) for full details. Functions in the 'vegan' R package were used to calculate Shannon Diversity Indices (alpha-diversity) on data rarefied to the minimum sequencing depth and Bray-Curtis dissimilarity (beta-diversity) on log-transformed data (pseudocount of 1 added to each value). Significance of group separation in beta-diversity was assessed by permutational multivariate analysis of variance. Changes in relative abundance were tested at each taxonomic rank from phylum to genus using the Mann-Whitney U test while differentially abundant 16S rRNA gene sequences were identified using 'DESeq2'24. For 'DESeq2' analysis, data were pooled for each individual rather than analysing distinct time points.
[19] R Core Team. R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2017. https://www.R-project.org/. [21] Mullish BH , Pechlivanis A , Barker GF , et al . Functional microbiomics: evaluation of gut microbiota-bile acid metabolism interactions in health and disease. Methods 2018;149:49–58. doi:10.1016/j.ymeth.2018.04.028. [22] Callahan BJ , Sankaran K , Fukuyama JA , et al . Bioconductor workflow for microbiome data analysis: from raw reads to community analyses. Version 2. F1000Res 2016;5:1492. doi:10.12688/f1000research.8986.1 [23] Callahan BJ , McMurdie PJ , Rosen MJ , et al . DADA2: high-resolution sample inference from illumina amplicon data. Nat Methods 2016;13:581–3. doi:10.1038/nmeth.3869 [24] Love MI , Huber W , Anders S . Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biol 2014;15:550. doi:10.1186/s13059-014-0550-8
Microbiome, metagenomics and bioinformatics is a huge area of study so we certainly wont be covering all aspects of it here.
Targeted amplicon and metagenomic sequencing approaches1.
[1] Ref: Bharti And Grimm (2019) Briefings in Bioinformatics 22(1) doi: 10.1093/bib/bbz155.
Today there are two main molecular approaches that we use for microbiome studies.
1. Metagenomics = DNA
2. Metatranscriptomics = messenger RNA
Pros of amplicon over shotgun
Cheaper
Less data intensive
Easier to make sense of...e.g. good reference databases available.
More sensitive at detecting lower abundant bacteria (shot gun sequencing = mainly host DNA)
A schematic overview outlining various experimental and computational challenges associated with 16S rRNA-based and shotgun metagenomic sequencing1.
[1] Bharti And Grimm (2019) Briefings in Bioinformatics 22(1) doi: 10.1093/bib/bbz155.
Terminology note
We will only briefly go through these steps to give you an idea of what is involved. There are various programs and databases required for these steps - so you won't be performing all of these on your machines today.
Instead I'll go through the main steps and give you access to some scripts. Then I'll share with you the output files that we will use for the data visualization part.
There is a wealth of information and different pipelines available but generally most use very similar algorithms under the hood.
Merge - optional
Depending on sequence platform/pipeline if you have forward and reverse reads you may first need to merge these. Most pipelines have built in merge function so you can avoid using a separate program. In the case of QIIME2 you do not need to merge reads. This step is fairly straight forward and not much difference between programs. PEAR is a popular stand alone program.
Trim
Depending on pipeline this can be done along side filtering.
Filter
Comments as above. Depending on your samples and design you may need more stringent filtering. Many pipelines have additional filtering options i.e. removing low abundant sequences etc.
Terminology: The data produced from the clustering/denoising step is referred to a either "Operational Taxonomic Units (OTUs)" or "Amplicon Sequence Variants (ASVs)". Unfortunately terminology in genomics is not always consistent. But as a general rule of thumb OTUs refer to data produced via clustering and ASVs refers to data produced by denoising (however unoise3 in USEARCH refers to these as Zero-radius taxonomic units (ZOTUs) in this case ZOTU = ASV).
There are a number of different analysis and visualization options that you can use depending on your data and questions.
Some common examples include:
Overview of statistical and visualization methods for feature tables. Downstream analysis of microbiome feature tables, including alpha/beta-diversity (A/B), taxonomic composition (C), difference comparison (D), correlation analysis (E), network analysis (F), classification of machine learning (G), and phylogenetic tree (H)1.
[1] Liu, YX., Qin, Y., Chen, T. et al. A practical guide to amplicon and metagenomic analysis of microbiome data. Protein Cell (2020). 10.1007/s13238-020-00724-8.
In this part of the workshop we will go through some different ways you can visualize the data and some statistical analysis. We will do this in RStudio. Just like the bioinformatic sites above there is a wealth of options for this. My personal preference is RStudio as it is easily reproducible (VERY important for bioinformatics) and is easy to upscale. In addition with the ever increasing data being produced RStudio provides the best platform to integrate different data types and create custom pipelines.
Working within RStudio environment is not limited to just running code locally on your machine. RShiny allows you to make custom apps and web interface programs..
Further detail on cleaning data after processing sequences is covered here
Useful links for microbial genomics analysis
As mentioned there are lots of options for processing sequence data, this work flow uses the QIIME2 pipeline. While you can also perform statistical analysis and visualize your data in QIIME2, as it is a web based platform it is restricted in terms of analysis options, customising figures, cleaning & subsetting data and integrating other data.
You need to find a balance that works for you and your study question(s).
Before we begin let's just go over some different file format terminology.
FASTQ
@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos>:<UMI> <read>:<is filtered>:<control number>:<index>
FASTA
>
character followed by a code identifying the sequence (and description). The header line is followed by one or more lines containing the sequence itself. FASTA files may contain one or more sequences. open in text editor.BIOM
Trees
Trees can be encoded in a number of different formats, all of which must represent the nested structure of a tree. They may or may not encode branch lengths and other features. Standardized formats are critical for distributing and sharing trees without relying on graphics output that is hard to import into existing software. Commonly used formats are
Spreadsheets containing data
Open in excel/google sheets or text editor
R files
.R
- R scripts.Rmd
- R markdown file. Contain a mix of text and "code chunks".RData
or .rda
- for storing a complete R workspace or selected "objects" from a workspace in a form that can be loaded back by ROfficial QIIME2 docs, and view objects via QIIME2 view.
Customised scripts available at https://github.com/siobhon-egan/qiime2_analysis
Pipeline created with QIIME2-2020.11, see QIIME2 documentation for install based on your platform.
QIIME2 introduces its own files formats known as .qza.
and .qzv
files. This are unique to QIIME2 and you will likely need to convert these to some other readable format.
Import .fastq.gz
data into QIIME2 format using Casava 1.8 demultiplexed (paired-end) option. Remember assumes raw data is in directory labeled raw_data/
and file naming format as above.
qiime tools import \--type 'SampleData[PairedEndSequencesWithQuality]' \--input-path raw_data \--input-format CasavaOneEightSingleLanePerSampleDirFmt \--output-path 16S_demux_seqs.qza# create visualization fileqiime demux summarize \ --i-data 16S_demux_seqs.qza \ --o-visualization 16S_demux_seqs.qzv
Inspect 16S_demux_seqs.qzv
artifact for quality scores. This will help decide on QC parameters.
Based on quality plot in the above output 16S_demux_seqs.qza
adjust trim length to where quality falls.
Then you can also trim primers. In this case working with 16S V1-2 data.
qiime dada2 denoise-paired \ --i-demultiplexed-seqs 16S_demux_seqs.qza \ --p-trim-left-f 20 \ --p-trim-left-r 19 \ --p-trunc-len-f 250 \ --p-trunc-len-r 250 \ --o-table 16S_denoise_table.qza \ --o-representative-sequences 16S_denoise_rep-seqs.qza \ --o-denoising-stats 16S_denoise-stats.qza
At this stage, you will have artifacts containing the feature table, corresponding feature sequences, and DADA2 denoising stats. You can generate summaries of these as follows.
qiime feature-table summarize \ --i-table 16S_denoise_table.qza \ --o-visualization 16S_denoise_table.qzv \ --m-sample-metadata-file sample-metadata.tsv # Can skip this bit if needed.qiime feature-table tabulate-seqs \ --i-data 16S_denoise_rep-seqs.qza \ --o-visualization 16S_denoise_rep-seqs.qzvqiime metadata tabulate \ --m-input-file 16S_denoise-stats.qza \ --o-visualization 16S_denoise-stats.qzv
Export ASV table
To produce an ASV table with number of each ASV reads per sample that you can open in excel.
Need to make biom file first
qiime tools export \--input-path 16S_denoise_table.qza \--output-path feature-tablebiom convert \-i feature-table/feature-table.biom \-o feature-table/feature-table.tsv \--to-tsv
Several downstream diversity metrics require that a phylogenetic tree be constructed using the Operational Taxonomic Units (OTUs) or Amplicon Sequence Variants (ASVs) being investigated.
qiime phylogeny align-to-tree-mafft-fasttree \ --i-sequences rep-seqs.qza \ --o-alignment aligned-rep-seqs.qza \ --o-masked-alignment masked-aligned-rep-seqs.qza \ --o-tree unrooted-tree.qza \ --o-rooted-tree rooted-tree.qza
Export
Covert unrooted tree output to newick formatted file
qiime tools export \ --input-path unrooted-tree.qza \ --output-path exported-tree
Assign taxonomy to denoised sequences using a pre-trained naive bayes classifier and the q2-feature-classifier plugin. Details on how to create a classifier are available here.
I am using a pre-training classifier for the 16S V1-2 with reference a SILVA database version 138.1.
Note that taxonomic classifiers perform best when they are trained based on your specific sample preparation and sequencing parameters, including the primers that were used for amplification and the length of your sequence reads.
Classifier
qiime feature-classifier classify-sklearn \--i-classifier /Taxonomy/QIIME2_classifiers_v2020.11/Silva_99_Otus/27F-388Y/classifier.qza \--i-reads 16S_denoise_rep-seqs.qza \--o-classification qiime2-taxa-silva/taxonomy.qzaqiime metadata tabulate \--m-input-file qiime2-taxa-silva/taxonomy.qza \--o-visualization qiime2-taxa-silva/taxonomy.qzv
Lets take a took at something prepared earlier.
There are two major outputs from the process above.
pkgs <- c("readr", "rmarkdown")library("DT"); library("dplyr")lapply(pkgs, require, character.only = TRUE)otu_table <- read_csv("../data/BIO514-microbiome/otu_table.csv", skip = 1, col_names=FALSE)otu_table <- rename(otu_table, Sample_name = X1)otu_table_sub <- select(otu_table, Sample_name, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20)
1. The count data
This is the data contains the list of ASVs and number of sequences per sample.
In this example each row is a sample and a column is the OTU/ASV.
Count data - showing first 20 OTUs only
otu_table_sub %>% DT::datatable(class = "compact", extensions = "Buttons", options = list(dom = 'tBp', buttons = c("csv","excel"), pageLength = 8))
Sample_name | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 | X11 | X12 | X13 | X14 | X15 | X16 | X17 | X18 | X19 | X20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1_4 | 199 | 1595 | 1 | 24 | 31 | 0 | 19 | 14 | 0 | 4 | 17 | 1 | 50 | 0 | 0 | 0 | 1 | 19 | 0 |
2 | 10_1 | 5 | 1289 | 0 | 247 | 358 | 0 | 0 | 48 | 110 | 0 | 1 | 6 | 1 | 0 | 1 | 1 | 0 | 224 | 0 |
3 | 10_2 | 4 | 1434 | 1 | 1583 | 682 | 0 | 0 | 471 | 41 | 0 | 0 | 12 | 0 | 86 | 0 | 1 | 0 | 56 | 0 |
4 | 10_4 | 204 | 917 | 2017 | 0 | 1450 | 398 | 59 | 0 | 93 | 0 | 176 | 0 | 0 | 326 | 13 | 0 | 0 | 5 | 0 |
5 | 100_4 | 2485 | 34 | 0 | 1 | 1 | 1386 | 1102 | 0 | 1 | 0 | 532 | 0 | 0 | 0 | 1 | 1 | 0 | 528 | 0 |
6 | 11_1 | 969 | 359 | 57 | 0 | 314 | 0 | 0 | 0 | 2 | 5 | 127 | 34 | 3 | 0 | 67 | 2474 | 0 | 0 | 0 |
7 | 11_2 | 1356 | 487 | 109 | 2 | 438 | 1 | 1 | 14 | 0 | 16 | 154 | 4 | 0 | 0 | 60 | 2559 | 0 | 0 | 1 |
8 | 11_4 | 748 | 188 | 41 | 0 | 138 | 1 | 1 | 0 | 0 | 3 | 35 | 24 | 0 | 31 | 19 | 2508 | 0 | 0 | 0 |
Taxonomy
df_tax %>% DT::datatable(class = "compact", extensions = "Buttons", options = list(dom = 'tBp', buttons = c("csv","excel"), pageLength = 8))
Warning in instance$preRenderHook(instance): It seems your data is too bigfor client-side DataTables. You may consider server-side processing: https://rstudio.github.io/DT/server.html
...1 | Kingdom | Phylum | Class | Order | Family | Genus | |
---|---|---|---|---|---|---|---|
1 | GATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCAGCATGGTCTTAGCTTGCTAAGGCCGATGGCGACCGGCGCACGGGTGAGTAACACGTATCCAACCTGCCGTCTACTCTTGGACAGCCTTCTGAAAGGAAGATTAATACAAGATGGCATCATGAGTCCGCATGTTCACATGATTAAAGGTATTCCGGTAGACGATGGGGATGCGTTCCATTAGATAGTAGGCGGGGTAACGGCCCACCTAGTCTTCGATGGATAGGGGTTCTGAGAGGAAGGTCCCCCACATTGGAACTGAGACACGGTCCAA | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Bacteroidaceae | Bacteroides |
2 | GATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCAGCATGAACTTAGCTTGCTAAGTTTGATGGCGACCGGCGCACGGGTGAGTAACACGTATCCAACCTGCCGATGACTCGGGGATAGCCTTTCGAAAGAAAGATTAATACCCGATGGCATAGTTCTTCCGCATGGTAGAACTATTAAAGAATTTCGGTCATCGATGGGGATGCGTTCCATTAGGTTGTTGGCGGGGTAACGGCCCACCAAGCCTTCGATGGATAGGGGTTCTGAGAGGAAGGTCCCCCACATTGGAACTGAGACACGGTCCAA | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Bacteroidaceae | Bacteroides |
3 | GATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCAGCATGGTCTTAGCTTGCTAAGGCTGATGGCGACCGGCGCACGGGTGAGTAACACGTATCCAACCTGCCGTCTACTCTTGGCCAGCCTTCTGAAAGGAAGATTAATCCAGGATGGGATCATGAGTTCACATGTCCGCATGATTAAAGGTATTTTCCGGTAGACGATGGGGATGCGTTCCATTAGATAGTAGGCGGGGTAACGGCCCACCTAGTCAACGATGGATAGGGGTTCTGAGAGGAAGGTCCCCCACATTGGAACTGAGACACGGTCCAA | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Bacteroidaceae | Bacteroides |
4 | ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGACCAAAGAGGGGGACCTTCGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAG | Bacteria | Proteobacteria | Gammaproteobacteria | Enterobacteriales | Enterobacteriaceae | Escherichia/Shigella |
5 | GATGAACGCTAGCGGCAGGCTTAACACATGCAAGTCGAGGGGCAGCATAATGGATAGCAATATCTATGGTGGCGACCGGCGCACGGGTGCGTAACGCGTATGCAACCTACCTTTAACAGGGGGATAACACTGAGAAATTGGTACTAATACCCCATAATATCATAGAAGGCATCTTTTATGGTTGAAAATTCCGATGGTTAGAGATGGGCATGCGTTGTATTAGCTAGTTGGTGGGGTAACGGCTCACCAAGGCGACGATACATAGGGGGACTGAGAGGTTAACCCCCCACACTGGTACTGAGACACGGACCAG | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Rikenellaceae | Alistipes |
6 | GATGAACGCTAGCGACAGGCTTAACACATGCAAGTCGAGGGGCAGCATGATTTGTAGCAATACAGATTGATGGCGACCGGCGCACGGGTGAGTAACGCGTATGCAACTTACCTATCAGAGGGGGATAGCCCGGCGAAAGTCGGATTAATACCCCATAAAACAGGGGTCCCGCATGGGAATATTTGTTAAAGATTCATCGCTGATAGATAGGCATGCGTTCCATTAGGCAGTTGGCGGGGTAACGGCCCACCAAACCGACGATGGATAGGGGTTCTGAGAGGAAGGTCCCCCACATTGGTACTGAGACACGGACCAA | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Parabacteroides |
7 | GATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCAGCATGGTCTTAGCTTGCTAAGACTGATGGCGACCGGCGCACGGGTGAGTAACACGTATCCAACCTGCCGTCTACTCTTGGACAGCCTTCTGAAAGGAAGATTAATACAAGATGGCATCATGAGTCCGCATGTTCACATGATTAAAGGTATTCCGGTAGACGATGGGGATGCGTTCCATTAGATAGTAGGCGGGGTAACGGCCCACCTAGTCTTCGATGGATAGGGGTTCTGAGAGGAAGGTCCCCCACATTGGAACTGAGACACGGTCCAA | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Bacteroidaceae | Bacteroides |
8 | ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAACAGCTTGCTGTTTCGCTGACGAGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGACCAAAGAGGGGGACCTTCGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAG | Bacteria | Proteobacteria | Gammaproteobacteria | Enterobacteriales | Enterobacteriaceae | Escherichia/Shigella |
While we will not perform these steps today but we'll spend a few minutes just talking about some data processing steps after sequence clustering/denoising.
Depending on your aims the importance of these steps will differ.
Two steps/parameters to consider for taxonomy assignment
1. Assignment algorithms. 2. Reference databases.
Generally pipelines use curated bacteria 16S databases but if your sequence does not match any bacteria it will classify as unknown. Most people use NCBI BLAST against all sequence data to confirm identity for any taxa that require further confirmation. Curated database are used because they provide better accuracy, especially if you have a trained classifier as used in QIIME2. They also take up much less computer space (easier to store).
The more accurate your taxonomic assignment the better downstream analysis is - in particular functional assignment and 16S gene copy number prediction (see below).
Use controls to subtract that "background" bacteria noise.
Additional bonus points for
The Decontam
R Package
decontam with detailed tutorial
Reference: Davis et al. (2018) Microbiome, 6, 226 doi: 10.1186/s40168-018-0605-2.
Simple code for identifying contaminant taxa based on prevalence data.
Phyloseq object is ps
and the sample data has a column called Sample_or_Control
where the control samples are Control Sample
. Default prevalence threshold is set to 0.1
. I recommend you check what taxa is identified as "contaminant" and see if it makes sense (this where its good to know something about your samples/microbiology). You may need to adjust threshold as needed, e.g. for more aggressive classification threshold rather than the default try 0.5
.
Define control samples and identify taxa present in them using prevalence method.
sample_data(ps)$is.neg <- sample_data(ps)$Sample_or_Control == "Control Sample"contamdf.prev <- isContaminant(ps, method="prevalence", neg="is.neg")# Identify how many contaminantshead(which(contamdf.prev$contaminant))# Identify what the contaminants aretable(contamdf.prev$contaminant)
Set contaminant threshold (default is 0.1).
contamdf.prev01 <- isContaminant(ps, method="prevalence", neg="is.neg", threshold=0.1)table(contamdf.prev01$contaminant)
Then you "subtract" these taxa from your data set. Raw data is phyloseq object ps
and will create new phyloseq object for downstream analysis ps.decon
ps.decon <- prune_taxa(!contamdf.freq$contaminant, ps)ps.decon
Alternative: a very similar R package called microDecon
also available GitHub repo.
Reference: McKnight et al. (2019) Environmental DNA, 1, 14-25 doi: 10.1002/edn3.11.
Bacterial profiling based on 16S rRNA-based surveys gives a "who’s there?" answer. However as our knowledge improves more questions arise and now we are moving to answer question about "what can they do?".
Just like with taxonomy databases there are functional databases that group taxa into functional groups. The polypeptides predicted from these sequences are annotated by homology to gene function databases.
A word of caution
"...inference with the default database is likely limited outside of human samples and that development of tools for gene prediction specific to different non-human and environmental samples is warranted." - Quote from Sun et al. (2020) Microbiome 8, 45 doi: 10.1186/s40168-020-00815-y
Popular databases:
Number of copies of the 16S rRNA gene in bacteria varies (1-15). Still not widely used and so far databases/tools are not worthwhile.
Summary of findings in Louca et al. (2018). Microbiome 6, 41 doi: 10.1186/s40168-018-0420-9.
Some other references:
Make sure you have R installed for when we come back.
This data is phyloseq
format. This is the most commonly used data format for amplicon data in RStudio.
As we have skipped over getting our data into R, here are some help links on this matter phyloseq and customised tutorial here.
Essentially we need at least three bits of data that talk to each other:
Optional data
.fasta
format)The easiest way to follow along with this tutorial is to download this GitHub repository using either option 1 or 2 below:
.Rmd
files in RStudio you will be able to follow along for the data analysis.git clone https://github.com/siobhon-egan/2022-systMed-genomics.git
pkgs <- c("tidyverse", "santaR", "phyloseq", "ggpubr", "ggplot2", "vegan", "DESeq2", "mixOmics", "Hmisc", "igraph", "ppcor", "reshape2", "plotly", "microbiomeutilities", "ampvis2", "MicrobiotaProcess", "microbiome", "DirichletMultinomial", "magrittr")lapply(pkgs, require, character.only = TRUE)
Load data in RData - downloaded from https://github.com/ka-west/PBS_manuscript
load("../data/BIO514-microbiome/PBS_data.Rdata")# Quick glance at phyloseq objectps_M
phyloseq-class experiment-level objectotu_table() OTU Table: [ 4318 taxa and 68 samples ]sample_data() Sample Data: [ 68 samples by 15 sample variables ]tax_table() Taxonomy Table: [ 4318 taxa by 6 taxonomic ranks ]phy_tree() Phylogenetic Tree: [ 4318 tips and 4316 internal nodes ]
Number of reads
This will give you an overview of the number of reads per sample and per OTU. Important to know the 'depth' of sequencing. Generally for amplicon 16S microbiome you want many 10's of thousands of (good reads) per sample. The more complex the sample the more reads you need (but there is a very large variation in studies and not set rule).
readsumsdf = data.frame(nreads = sort(taxa_sums(ps_M), TRUE), sorted = 1:ntaxa(ps_M), type = "OTUs")readsumsdf = rbind(readsumsdf, data.frame(nreads = sort(sample_sums(ps_M), TRUE), sorted = 1:nsamples(ps_M), type = "Samples"))title = "Total number of reads"nreads = ggplot(readsumsdf, aes(x = sorted, y = nreads)) + geom_bar(stat = "identity")nreads = nreads + ggtitle(title) + scale_y_log10() + facet_wrap(~type, 1, scales = "free")nreads
Read density plot
Useful for QC purposes. This will show you the distribution of sequencing depth among samples. Ideally you want an even number of reads per sample. If you see lots of variation then library preparation needs to be optimised and you will need to perform more thorough data cleaning (i.e. rarefy reads - but this is not ideal.
Ref: McMurdie PJ, Holmes S. Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Comput Biol. 2014 3;10(4):e1003531. doi: https://doi.org/10.1371/journal.pcbi.1003531.
read_distrib <- plot_read_distribution(ps_M, groups = "Group", plot.type = "density")read_distrib
Rarefaction is a technique to assess species richness from the results of sampling - mainly used in ecology. This curve is a plot of the number of species as a function of the number of samples. Rarefaction curves generally grow rapidly at first, as the most common species are found, but the curves plateau as only the rarest species remain to be sampled. We use this plot to see if we have reached an adequate level of sequencing depth for our samples.
# set seedset.seed(1)# set subsamplesubsamples = c(10, 5000, 10000, 20000, 30000)rarecurve <- plot_alpha_rcurve(ps_M, index="observed", subsamples = subsamples, lower.conf = 0.025, upper.conf = 0.975, group="Group_label", label.color = "brown3", label.size = 3, label.min = TRUE)
Warning in vegan::rrarefy(t(abundances(ps)), sample.size): some row sums <'sample' and are not rarefiedWarning in vegan::rrarefy(t(abundances(ps)), sample.size): some row sums <'sample' and are not rarefied
# change color of line mycols <- c("brown3", "steelblue")rarecurve <- rarecurve + scale_color_manual(values = mycols) + scale_fill_manual(values = mycols)
Alpha diversity is the mean species diversity within a sample. There are different measurements/indexes. The most simplest being how many ASV/OTUs in each sample.
Today we will produce alpha diversity plots using 6 measures:
The reason we produce so many measures is that we might want to compare with various papers which might report on different measures.
Statistical analysis with wilcoxon pair-wise test
alpha_meas = c("Observed", "Chao1", "ACE", "Shannon", "Simpson", "InvSimpson")p <- plot_richness(ps_M, "Group_label", measures = alpha_meas)alphadiv <- p + geom_boxplot(data = p$data, aes(x = Group_label, y = value, fill = Group_label), alpha = 0.1) + theme_bw() + theme(axis.text.x = element_text(size = 8, angle = 90)) + theme(axis.title.x = element_blank()) + scale_fill_manual(values=c("brown3", "steelblue"))
Save your figures directly from R for bonus points on quality data reproducibility! This line with save your combined alpha diversity plots into a directory called plots/
ggsave("alphadiv.pdf", plot = alphadiv, path = "plots", width = 30, height = 30, units = "cm")
This plot is good to give you an idea of the how taxa are distribution within the data. It will give you an idea about general trends in the data and help guide how further analysis.
# Bariatric SurgeryNBS_ps <- subset_samples(ps_M, Group_label=="NBS")
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
NBS_ps_dis <- taxa_distribution(NBS_ps) + theme_biome_utils() + labs(title = "No Bariatric Surgery")
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
# MalabsorptiveMAL_ps <- subset_samples(ps_M, Group_label=="MAL")
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
MAL_ps_dis <- taxa_distribution(MAL_ps) + theme_biome_utils() + labs(title = "Malabsorptive")
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
# Merge the plots together for publication ready figures!distrib = ggarrange(NBS_ps_dis, MAL_ps_dis, ncol = 1, nrow = 2)
Warning: Transformation introduced infinite values in continuous x-axis
Warning: Removed 582 rows containing non-finite values (stat_density).
Warning: Transformation introduced infinite values in continuous x-axis
Warning: Removed 1054 rows containing non-finite values (stat_density).
Warning: Groups with fewer than two data points have been dropped.
Warning: Removed 1 row(s) containing missing values (geom_path).
Create a bar plot of phyla - showing difference in two groups No bariatric surgery
vs Malabsorptive
.
Note that depending on your study present a barplot might a quick way to see patterns in your data generally they are not used to represent the community composition in your final figures.
Use these for visualizing at higher taxonomic levels (mostly phylum level). Remember also that you need to be careful when looking at relative microbiome abundance!
mycols <- c("brown3", "steelblue")grp_abund <- get_group_abundances(ps_M, level = "Phylum", group="Group", transform = "compositional")
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
An additonal column Sam_rep with sample names is created for reference purpose
mean.plot.phy <- grp_abund %>% # input data ggplot(aes(x= reorder(OTUID, mean_abundance), # reroder based on mean abundance y= mean_abundance, fill=Group)) + # x and y axis geom_bar(stat = "identity", position=position_dodge()) + scale_fill_manual("Group", values=mycols) + # manually specify colors theme_bw() + # add a widely used ggplot2 theme ylab("Mean Relative Abundance") + # label y axis xlab("Phylum") + # label x axis coord_flip() # rotate plot
Create a bar plot of order - showing difference in two groups No bariatric surgery
vs Malabsorptive
mycols <- c("brown3", "steelblue")grp_abund <- get_group_abundances(ps_M, level = "Order", group="Group", transform = "compositional")
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
An additonal column Sam_rep with sample names is created for reference purpose
mean.plot.ord <- grp_abund %>% # input data ggplot(aes(x= reorder(OTUID, mean_abundance), # reroder based on mean abundance y= mean_abundance, fill=Group)) + # x and y axis geom_bar(stat = "identity", position=position_dodge()) + scale_fill_manual("Group", values=mycols) + # manually specify colors theme_bw() + # add a widely used ggplot2 theme ylab("Mean Relative Abundance") + # label y axis xlab("Order") + # label x axis coord_flip() # rotate plot
Composition barplot
To quickly visualize comparison in taxa between make a relative abundance barplot of taxa between sample group (aggregate taxa at family level).
# Get relative abundance and remove low abundant taxaps1.rel <- microbiome::transform(ps_M, "compositional")
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
ps1.fam.rel <-aggregate_rare(ps1.rel, level = "Family", detection = 0.005, prevalence = 0.5)
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
comp_plot <- plot_composition(ps1.fam.rel, average_by = "Group_label") + guides(fill = guide_legend(ncol = 1)) + labs(x = "Samples", y = "Relative abundance", title = "Relative abundance data") + scale_fill_brewer("Family", palette = "Paired")
Now we'll just take the top 5 family taxa. Lets plot the abundance between the two groups. Make some comments on the value of this type of analysis and how you might interpret the data (tell me also about the type of bacteria that were identified as well).
mycols <- c("brown3", "steelblue")top_tax <- plot_taxa_boxplot(ps_M, taxonomic.level = "Family", top.otu = 6, group = "Group_label", add.violin= TRUE, group.colors = mycols, title = "Top six family", keep.other = FALSE, dot.size = 1)
For plotting purpuses the phy_tree will be removed
Warning in (function (kind = NULL, normal.kind = NULL, sample.kind = NULL) :non-uniform 'Rounding' sampler used
Rather than a barplot heatmaps are much better at presenting the microbiome composition in samples. These are commonly used in publications!
Create heatmap of core microbiome tutorial
Keep only taxa with count above zero and transform to compositional (relative abundance).
ps.prune <- prune_taxa(taxa_sums(ps_M) > 0, ps_M)
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
pseq.rel <- microbiome::transform(ps.prune, "compositional")
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Aggregate data to genus level and make heat map of the most prevalent taxa.
library(RColorBrewer)ps.m3.rel.gen <- aggregate_taxa(pseq.rel, "Genus")
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
prevalences <- seq(.05, 1, .05)detections <- 10^seq(log10(1e-4), log10(.2), length = 10)core_heatmap <- plot_core(ps.m3.rel.gen, plot.type = "heatmap", colours = rev(brewer.pal(5, "RdBu")), prevalences = prevalences, detections = detections, min.prevalence = .5) + xlab("Detection Threshold (Relative Abundance (%))")
[1] "0.01%" "0.0232691816877636%" "0.0541454816418154%" [4] "0.125992104989487%" "0.293173318222417%" "0.682190320772197%" [7] "1.5874010519682%" "3.69375234895952%" "8.59505945175427%" [10] "20%"
Make a heat map of all samples - this can get a bit messy when you have a lot of samples but helpful to quickly see how different samples compare.
ps1.rel <-aggregate_rare(ps_M, level = "Family", detection = 10, prevalence = 0.5)
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
pseq.famlog <- microbiome::transform(ps1.rel, "log10")
Warning in microbiome::transform(ps1.rel, "log10"): OTU table contains zeroes.Using log10(1 + x) transform.
p.famrel.heatmap <- plot_composition(pseq.famlog, sample.sort = NULL, otu.sort = NULL, x.label = "Group_label", plot.type = "heatmap", verbose = FALSE)
To view interactively execute the following R code.
ggplotly(p.famrel.heatmap)
Variation of microbial communities between samples. Beta diversity shows the different between microbial communities from different environments.
Distance measures
Bray–Curtis dissimilarity
Jaccard distance
UniFrac
Ordination methods are used to highlight differences between samples based on their microbial community composition - also referred to as distance- or (dis)similarity measures.
These techniques reduce the dimensionality of microbiome data sets so that a summary of the beta diversity relationships can be visualized in 2D or 3D plots. The principal coordinates (axis) each explains a certain fraction of the variability (formally called inertia). This creates a visual representation of the microbial community compositional differences among samples. Observations based on ordination plots can be substantiated with statistical analyses that assess the clusters.
There are many options for ordination. Broadly they can be broken into:
1. Implicit and Unconstrained (exploratory)
2. Implicit and Constrained (explanatory)
3. Explicit and Unconstrained (exploratory)
Some extra explanatory notes on PCoA and nMDS
PCoA is very similar to PCA, RDA, CA, and CCA in that they are all based on eigenan alysis: each of the resulting axes is an eigen vector associated with an eigen value, and all axes are orthogonal to each other. This means that all axes reveal unique information about the inertia in the data, and exactly how much inertia is indicated by the eigenvalue. When plotting the ordination result in an x/y scatterplot, the axis with the largest eigenvalue is plotted on the first axis, and the one with the second largest on the second axis.
Some extra explanatory notes on PCoA and nMDS
NMDS attempts to represent the pairwise dissimilarity between objects in a low-dimensional space. Can use any dissimilarity coefficient or distance measure. NMDS is a rank-based approach based on an iterative algorithm. While information about the magnitude of distances is lost, rank-based methods are generally more robust to data which do not have an identifiable distribution. NMDS routines often begin by random placement of data objects in ordination space. The algorithm then begins to refine this placement by an iterative process, attempting to find an ordination in which ordinated object distances closely match the order of object dissimilarities in the original distance matrix. The stress value reflects how well the ordination summarizes the observed distances among the samples.
Implicit and Unconstrained (exploratory)
Ordination of samples using DCA. Leave distance blank, so default is chi-square.
# Ordinate the dataset.seed(4235421)mycols <- c("brown3", "steelblue")# proj <- get_ordination(pseq, "MDS", "bray")ord.dca <- ordinate(ps_M, "DCA")ord_DCA = plot_ordination(ps_M, ord.dca, color = "Group_label") + geom_point(size = 5) + scale_color_manual(values=mycols) + stat_ellipse() + theme_biome_utils()
Implicit and Constrained (explanatory)
Ordination of samples using CCA methods using Pearson chi-squared. Constrained variable used as Group_label
.
mycols <- c("brown3", "steelblue")pseq.cca <- ordinate(ps_M, "CCA", cca = "Group_label")ord_CCA <- plot_ordination(ps_M, pseq.cca, color = "Group_label")ord_CCA <- ord_CCA + geom_point(size = 4) + scale_color_manual(values=mycols) + stat_ellipse() + theme_biome_utils()
Implicit and Constrained (explanatory)
Ordination of samples using RDA methods using Euclidean distance. Constrained variable used as Group_label
.
mycols <- c("brown3", "steelblue")pseq.rda <- ordinate(ps_M, "RDA", cca = "Group_label")ord_RDA <- plot_ordination(ps_M, pseq.rda, color = "Group_label")ord_RDA <- ord_RDA + geom_point(size = 4) + scale_color_manual(values=mycols) + stat_ellipse() + theme_biome_utils()
PCoA is very similar to PCA, RDA, CA, and CCA in that they are all based on eigenanalysis: each of the resulting axes is an eigenvector associated with an eigenvalue, and all axes are orthogonal to each other. This means that all axes reveal unique information about the inertia in the data, and exactly how much inertia is indicated by the eigenvalue.
Ordination of samples using PCoA methods and jaccard (presence/absence) distance measure
# Ordinate the dataset.seed(4235421)mycols <- c("brown3", "steelblue")# proj <- get_ordination(pseq, "MDS", "bray")ord.pcoa.jac <- ordinate(ps_M, "PCoA", "jaccard")ord_PCoA_jac = plot_ordination(ps_M, ord.pcoa.jac, color = "Group_label") + geom_point(size = 5) + scale_color_manual(values=mycols) + stat_ellipse() + theme_biome_utils()
Ordination of samples using PCoA methods and bray curtis (abundance) distance measure.
# Ordinate the dataset.seed(4235421)mycols <- c("brown3", "steelblue")ord.pcoa.bray <- ordinate(ps_M, "PCoA", "bray")ord_PCoA_bray = plot_ordination(ps_M, ord.pcoa.bray, color = "Group_label") + geom_point(size = 5) + scale_color_manual(values=mycols) + stat_ellipse() + theme_biome_utils()
Unifrac analysis takes into account not only the differences in OTUs/ASVs but also takes into account the phylogeny of the taxa. I.e. how closely related are the taxa.
We can perform unweighted (using presence/absence abundance like jaccard) or weighted (incorporating abundance data - like bray curtis).
Unweighted unifrac
# Ordinate the dataset.seed(4235421)mycols <- c("brown3", "steelblue")ord_pcoa_ufuw <- ordinate(ps_M, "PCoA", "unifrac", weighted=FALSE)
Warning in UniFrac(physeq, ...): Randomly assigning root as --GATGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGAACTCCTATGAACGAGGTTTCGGCCAAGTGAATAGGATGTTTAGTGGCGGACGGGTGAGTAACGCGTGAGTAACCTGCCTTGGAGTGGGGAATAACACAGTGAAAATTGTGCTAATACCGCATAATGCATTTAGGTCGCATGACTTTGAATGCCAAAGATTTATCGCTCTGAGATGGACTCGCGTCTGATTAGCTAGTTGGCGGGGCAACGGCCCACCAAGGCGACGATCAGTAGCCGGACTGAGAGGTTGGCCGGCCACATTGGGACTGAGACACGGCCCAG-- in the phylogenetic tree in the data you provided.
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
ord_PCoA_ufuw = plot_ordination(ps_M, ord_pcoa_ufuw, color = "Group_label", shape="Time_point") + geom_point(size = 5) + scale_color_manual(values=mycols) + stat_ellipse() + theme_biome_utils()
Weighted unifrac
mycols <- c("brown3", "steelblue")ord_pcoa_ufw = ordinate(ps_M, "PCoA", "unifrac", weighted=TRUE)
Warning in UniFrac(physeq, ...): Randomly assigning root as --GATGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGAAGCACTTATCTTTGATTCTTCGGATGAAGAGATTTGTGACTGAGTGGCGGACGGGTGAGTAACGCGTGGGTAACCTGCCTCATACAGGGGGATAACAGTTAGAAATGACTGCTAATACCGCATAAGACCACAGAGCCGCATGGCTCGGTGGGAAAAACTCCGGTGGTATGAGATGGACCCGCGTCTGATTAGGTAGTTGGTGGGGTAACGGCCTACCAAGCCAACGATCAGTAGCCGACCTGAGAGGGTGACCGGCCACATTGGGACTGAGACACGGCCCAA-- in the phylogenetic tree in the data you provided.
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
ord_PCoA_ufw = plot_ordination(ps_M, ord_pcoa_ufw, color="Group_label", shape="Time_point")ord_PCoA_ufw <- ord_PCoA_ufw + geom_point(size = 4) + scale_color_manual(values=mycols) + stat_ellipse() + theme_biome_utils()
Finally lets perform ordination using Non-metric Multidimensional Scaling. We'll use the unifrac distance measure which takes into account phylogeny and also the WEIGHTED
option.
# Ordinate the dataset.seed(4235421)mycols <- c("brown3", "steelblue")ord_nmds_ufw <- ordinate(ps_M, "NMDS", "unifrac", weighted=TRUE)
Warning in UniFrac(physeq, ...): Randomly assigning root as --GATGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGAACTCCTATGAACGAGGTTTCGGCCAAGTGAATAGGATGTTTAGTGGCGGACGGGTGAGTAACGCGTGAGTAACCTGCCTTGGAGTGGGGAATAACACAGTGAAAATTGTGCTAATACCGCATAATGCATTTAGGTCGCATGACTTTGAATGCCAAAGATTTATCGCTCTGAGATGGACTCGCGTCTGATTAGCTAGTTGGCGGGGCAACGGCCCACCAAGGCGACGATCAGTAGCCGGACTGAGAGGTTGGCCGGCCACATTGGGACTGAGACACGGCCCAG-- in the phylogenetic tree in the data you provided.
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Run 0 stress 0.1465875 Run 1 stress 0.1697601 Run 2 stress 0.1466427 ... Procrustes: rmse 0.01918481 max resid 0.1376383 Run 3 stress 0.177508 Run 4 stress 0.1697601 Run 5 stress 0.2015363 Run 6 stress 0.1465878 ... Procrustes: rmse 0.000134178 max resid 0.001004719 ... Similar to previous bestRun 7 stress 0.1466427 ... Procrustes: rmse 0.01926118 max resid 0.1382319 Run 8 stress 0.1466427 ... Procrustes: rmse 0.01925229 max resid 0.1381674 Run 9 stress 0.1465874 ... New best solution... Procrustes: rmse 0.0002624536 max resid 0.001966408 ... Similar to previous bestRun 10 stress 0.1465875 ... Procrustes: rmse 0.00003078451 max resid 0.0002298629 ... Similar to previous bestRun 11 stress 0.1818785 Run 12 stress 0.1869773 Run 13 stress 0.1474098 Run 14 stress 0.1465873 ... New best solution... Procrustes: rmse 0.00009571418 max resid 0.0007180497 ... Similar to previous bestRun 15 stress 0.1474098 Run 16 stress 0.1474098 Run 17 stress 0.177508 Run 18 stress 0.1474098 Run 19 stress 0.2004568 Run 20 stress 0.1465875 ... Procrustes: rmse 0.0001422936 max resid 0.001067589 ... Similar to previous best** Solution reached
ord_NMDS_ufw = plot_ordination(ps_M, ord_nmds_ufw, color = "Group_label", shape="Time_point") + geom_point(size = 5) + scale_color_manual(values=mycols) + stat_ellipse() + theme_biome_utils()
Here we'll perform a statistical analysis on beta diversity.
See tutorial here.
Differences by Group_label
using ANOVA
# Transform data to hellingerpseq.rel <- microbiome::transform(ps_M, "hellinger")
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
Also defined by 'tidytree'
# Pick relative abundances (compositional) and sample metadata otu <- abundances(pseq.rel)meta <- meta(pseq.rel)# samples x SampleCategory as inputpermanova <- adonis(t(otu) ~ Group_label, data = meta, permutations=999, method = "bray")## statisticsprint(as.data.frame(permanova$aov.tab)["Group_label", "Pr(>F)"])
[1] 0.001
dist <- vegdist(t(otu))mod <- betadisper(dist, meta$Group_label)### ANOVA - are groups different anova(betadisper(dist, meta$Group_label))
Analysis of Variance TableResponse: Distances Df Sum Sq Mean Sq F value Pr(>F) Groups 1 0.009817 0.0098173 6.9784 0.01029 *Residuals 66 0.092850 0.0014068 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Beta diversity metrics can assess the differences between microbial communities. It can be visualized with PCA or PCoA, this can also be visualized with hierarchical clustering.
Function from MicrobiotaProcess using analysis based on ggtree.
## All samples - detailed, include species and SampleCategoryclust_all <- get_clust(obj=ps_M, distmethod="euclidean", method="hellinger", hclustmethod="average")mycols <- c("brown3", "steelblue")# circular layoutclust_all_plot <- ggclust(obj=clust_all , layout = "circular", pointsize=3, fontsize=0, factorNames=c("Group_label", "Time_point_label")) + scale_color_manual(values=mycols) + scale_shape_manual(values=c(17, 15, 16)) + ggtitle("Hierarchical Cluster of All Samples (euclidean)")
This workshop was intended to give you a flavour of what microbiome analysis can involve.
You are encouraged to explore this topic and expand on the analysis we have done here.
The internet is a wealth of options.
Here is one of my favourite links for all things microbiome and R.
R environment based tools for microbiome data
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |