BIO514

Bioinformatics

Workshop - Microbiome bioinformatic analysis

Lesson Link
Set up and resources link
Sequence processing link
Data cleaning link
Data visualization link

About the data

Today we will be using amplicon 16S rRNA data from West et al. (2020) Gut 69, 1452-1459. doi: 10.1136/gutjnl-2019-319620. The Rdata we will use is available from GitHub repository by the author link here. For this workshop the relevant .Rdata for today’s session has been made available within this repository.

Methods

Exert direct from West et al. 2020

Stool samples were randomised for processing and DNA was extracted (see online supplementary methods) using the PowerLyzer PowerSoil DNA Isolation Kit (Mo Bio). 16S rRNA gene amplicon sequencing targeting the V1-V2 regions was performed on the Illumina MiSeq platform as previously described21 Raw reads were processed in the R software environment19 following a published workflow22 which includes amplicon denoising implemented in ‘DADA2’23. See (online supplementary methods) for full details. Functions in the ‘vegan’ R package were used to calculate Shannon Diversity Indices (\(\alpha\)-diversity) on data rarefied to the minimum sequencing depth and Bray-Curtis dissimilarity (\(\beta\)-diversity) on log-transformed data (pseudocount of 1 added to each value). Significance of group separation in \(\beta\)-diversity was assessed by permutational multivariate analysis of variance. Changes in relative abundance were tested at each taxonomic rank from phylum to genus using the Mann-Whitney U test while differentially abundant 16S rRNA gene sequences were identified using ‘DESeq2’24. For ‘DESeq2’ analysis, data were pooled for each individual rather than analysing distinct time points.

Extra reference for microbiome sequencing
Mullish BH , Pechlivanis A , Barker GF , et al. Functional microbiomics evaluation of gut microbiota-bile acid metabolism interactions in health and disease. Methods 2018;149:49–58. doi: 10.1016/j.ymeth.2018.04.028


Introduction

Microbiome, metagenomics and bioinformatics is a huge area of study so we certainly wont be covering all aspects of it here.

Targeted amplicon and metagenomic sequencing approaches. Ref: Bharti And Grimm (2019) Briefings in Bioinformatics 22(1) doi: 10.1093/bib/bbz155

Today there are two main molecular approaches that we use for microbiome studies

1. Metagenomics = DNA

2. Metatranscriptomics = messenger RNA - Gene expression and regulation - Used for functional potential - Better for relative abundance comparison - no PCR bias

Pros of amplicon NGS

A schematic overview outlining various experimental and computational challenges associated with 16S rRNA-based and shotgun metagenomic sequencing. Ref: Bharti And Grimm (2019) Briefings in Bioinformatics 22(1) doi: 10.1093/bib/bbz155

Terminology note


Bioinformatics

We will only briefly go through these steps to give you an idea of what is involved. There are various programs and databases required for these steps - so you won’t be performing all of these on your machines today.

Instead I’ll go through the main steps and give you access to some scripts. Then I’ll share with you the output files that we will use for the data visualization part.

There is a wealth of information and different pipelines available but generally most use very similar algorithms under the hood.

The most widely used pipelines include:

Note that the list above is not mutually exclusive. For example the popular QIIME2 uses dada2 or vsearch or clustering/denoising.

Main steps of processing 16S amplicon sequencing

Optional first step - depending on sequence platform if you have forward and reverse reads you will first need to merge these. Most pipelines have built in merge function so you can avoid using a separate program. This step is fairly straight forward and not much difference between programs. PEAR is a popular stand alone program.

  1. Demulitplex.

    • Use of barcodes (i.e. sequence of 6-8 nucleotides added to primers to identify individual samples).

    • Depending on library prep used and sequencing platform this might be automated.

    • E.g. Illumina and Nextera indexes are automatically demultiplexed on sequencing machine.

  2. Trim primers and distal bases - this will also depend on QC (quality) scores.

    • Lots of options available, again I try and keep number of programs etc to a minimum. Most pipelines will have some sort of trimming/QC function built in.
    • FASTQC is popular for viewing sequence files and automating QC reports.
  3. Cluster or denoise

    • Group related sequences.

    • Traditional approaches relied on clustering.

      • Grouped sequences that were within 97% similar i.e group sequences at the species level.

      • Common tools = vsearch (use stand alone or within QIIME2 pipeline) and uparse (used within USEARCH pipeline).

    • Newer approaches use denoising method.

      • More accurate method to correct sequencing errors and determine real biological sequences at single nucleotide resolution by generating amplicon sequence variants (ASVs).

      • Common tools = dada2 (use stand alone or within QIIME2 pipeline) and unoise3 (used within USEARCH pipeline).

Terminology: The data produced from the clustering/denoising step is referred to a either “Operational Taxonomic Units (OTUs)” or “Amplicon Sequence Variants (ASVs)”. Unfortunately terminology in genomics is not always consistent. But as a general rule of thumb OTUs refer to data produced via clustering and ASVs refers to data produced by denoising (however unoise3 in USEARCH refers to these as Zero-radius taxonomic units (ZOTUs) in this case ZOTU = ASV).

  1. Assign taxonomy

    • Algorithms on taxonomic assignment and classification level (e.g. Genus, Family etc). Rarely obtain accurate species level assignment with 16S amplicon but depends on the amplicon region, size, taxa group and region of 16S gene.

      • q2-feature-classifier - used in QIIME2 pipeline (one of the best options currently available).

      • SINTAX - used within USEARCH pipeline.

    • Curated databases with representative of taxa. Comparison of main databases - SILVA, RDP, Greengenes, NCBI and OTT how do these taxonomies compare? Balvociute and Huson (2017) BMC Genomics, 18(2), 114. doi: 10.1186/s12864-017-3501-4.


Data cleaning and visualization

There are a number of different analysis and visualization options that you can use depending on your data and questions.

Some common examples include:

Overview of statistical and visualization methods for feature tables. Downstream analysis of microbiome feature tables, including alpha/beta-diversity (A/B), taxonomic composition (C), difference comparison (D), correlation analysis (E), network analysis (F), classification of machine learning (G), and phylogenetic tree (H). Ref: Liu, YX., Qin, Y., Chen, T. et al. A practical guide to amplicon and metagenomic analysis of microbiome data. Protein Cell (2020). 10.1007/s13238-020-00724-8

In this part of the workshop we will go through some different ways you can visualize the data and some statistical analysis. We will do this in RStudio. Just like the bioinformatic sites above there is a wealth of options for this. My personal preference is RStudio as it is easily reproducible (VERY important for bioinformatics) and is easy to upscale. In addition with the ever increasing data being produced RStudio provides the best platform to integrate different data types and create custom pipelines.

Working within RStudio environment is not limited to just running code locally on your machine. RShiny allows you to make custom apps and web interface programs..

Further detail on cleaning data after processing sequences is covered here


Mainly aimed at amplicon sequence methods