Lets inspect the output from QIIME2 and check for any further processing required.

pkgs <- c("readr", "rmarkdown", "tidyverse")
lapply(pkgs, require, character.only = TRUE)
## Loading required package: readr
## Loading required package: rmarkdown
## Loading required package: tidyverse
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ dplyr   1.0.7
## ✓ tibble  3.1.4     ✓ stringr 1.4.0
## ✓ tidyr   1.1.3     ✓ forcats 0.5.1
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x ggplot2::%+%()  masks crayon::%+%()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

1. The count data

This is the data contains the list of ASVs and number of sequences per sample.
In this example each row is a sample and a column is the OTU/ASV.

# read data
otu_table <- read_tsv(file = "data/qiime2/feature-table/feature-table.tsv", skip = 1,   col_names = TRUE)
paged_table(otu_table)

2. The taxonomy data

This contains the taxonomy of the count (or OTU) data. Each row is a unique OTU/ASV and column reflect Kingdom, Phylum, Class, Order, Family, Genus, Species.

# read data
tax_table <- read_tsv(file = "data/qiime2/exports/taxonomy.tsv", col_names = TRUE)
tax_table <- tax_table %>% separate(Taxon, sep = ";", into = c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"))
tax_table <- tax_table %>%
  rename(ASV = `Feature ID`)
# View table
paged_table(tax_table)

You may noticed some Unassigned or *d__Eukaryota* taxonomic classifications. It is important to investigate these, and those with a low confidence value (this indicates the algorithm used to calculate taxonomy was not confident in these classifications).

Of course it depending on your study question as to what level of accurate taxonomic classification is required, but as a guide I would recommend investigating any that have a Confidence values <0.7 or those that have the highest number of sequences.

I have already done this for you for those Unassigned or *d__Eukaryota* ASVs that have greater than 10,000 sequences.

To do this I extracted the sequences for these ASVs and then ran a BLAST search.

>595a1ab1fbf6fb7a154b169604e0e280
TTCGAGCCTCAGCGTCAGTTACAGACCAGAGAGCCGCCTTCGCCACTGGTGTTCCTCCATATATCTACGCATTTCACCGCTACACATGGAATTCCACTCTCCTCTTCTGCACTCAAGTCTCCCAGTTTCCAATGACCCTCCCCGGTTGAGCCGGGGGCTTTCACATCAGACTTAAGAAACCGCCTGCGCTCGCTTTACGCCCAATAAATCCGGACAACGCTTGCCACCTACGTATTACCGCGGCTGCTGGCACGTAGTTAGCCGTGGCTTTCTGGTTAGATACCGTCAGGGGACGTTCAGTTACTAACGTCCTTGTTCTTCTCTAACAACAGAGTTTTACGATCCGAAAACCTTCTTCACTCACGCGGCGTTGCTCGGTCAGACTTTCGT
>8e2da535041fca7c61c1636ca42da156
TTCGAGTCTCAGCGTCAGTTGCAGACCAGGTAGCCGCCTTCGCCACTGGTGTTCTTCCATATATCTACGCATTCCACCGCTACACATGGAGTTCCACTACCCTCTTCTGCACTCAAGTTATCCAGTTTCCGATGCACTTCTCCGGTTAAGCCGAAGGCTTTCACATCAGACTTAGAAAACCGCCTGCACTCTCTTTACGCCCAATAAATCCGGATAACGCTTGCCACCTACGTATTACCGCGGCTGCTGGCACGTAGTTAGCCGTGACTTTCTGGTTAAATACCGTCAACGTATGAACAGTTACTCTCATACGTGTTCTTCTTTAACAACAGAGCTTTACGAGCCGAAACCCTTCTTCACTCACGCGGTGTTGCTCCATCAGGCTTGCGC
>ac76d102caf414f3e8dff1da3b840171
TTCGCACCTGAGCGTCAGTCTTCGTCCAGGGGGCCGCCTTCGCCACCGGTATTCCTCCAGATCTCTACGCATTTCACCGCTACACCTGGAATTCTACCCCCCTCTACGAGACTCAAGCTTGCCAGTATCAGATGCAGTTCCCAGGTTGAGCCCGGGGATTTCACATCTGACTTAACAAACCGCCTGCGTGCGCTTTACGCCCAGTAATTCCGATTAACGCTTGCACCCTCCGTATTACCGCGGCTGCTGGCACGGAGTTAGCCGGTGCTTCTTCTGCGGGTAACGTCAATGAGCAAAGGTATTAACTTTACTCCCTTCCTCCCCGCTGAAAGTACTTTACAACCCGAAGGCCTTCTTCATACACGCGGCATGGCTGCATCAGGCTTGCGC
>f6b3e02ddb51771a55387ebe7c0b7f5d
TTCGCACCTGAGCGTCAGTCTTTGTCCAGGGGGCCGCCTTCGCCACCGGTATTCCTCCAGATCTCTACGCATTTCACCGCTACACCTGGAATTCTACCCCCCTCTACAAGACTCAAGCCTGCCAGTTTCGAATGCAGTTCCCAGGTTGAGCCCGGGGATTTCACATCCGACTTGACAGACCGCCTGCGTGCGCTTTACGCCCAGTAATTCCGATTAACGCTTGCACCCTCCGTATTACCGCGGCTGCTGGCACGGAGTTAGCCGGTGCTTCTTCTGCGGGTAACGTCAATTGCTGCGGTTATTAACCACAACACCTTCCTCCCCGCTGAAAGTACTTTACAACCCGAAGGCCTTCTTCATACACGCGGCATGGCTGCATCAGGCTTGCGC
>ce3ab90dada5391ca3cb9dd6b1401dd5
TTCGAGCATCAGCGTCAGTTACAATCCAGTAAGCTGCCTTCGCAATCGGAGTTCTTCGTGATATCTAAGCATTTCACCGCTACACCACGAATTCCGCCTACCTCTGTTGCACTCAAGGTCGCCAGTATCAACTGCAATTTTACGGTTGAGCCGCAAACTTTCACAACTGACTTAACAACCCGCCTACGCTCCCTTTAAACCCAATAAATCCGGATAACGCTCGGATCCTCCGTATTACCGCGGCTGCTGGCACGGAGTTAGCCGATCCTTATTCATACGGTACATACAAAAAGCCACACGTGGCTCACTTTATTCCCGTATAAAAGAAGTTTACAACCCATAGGGCAGTCATCCTTCACGCTACTTGGCTGGTTCAGACTCTCGT
>930227652a7112b1e8e0d1d2eb10eba7
TTCGCACATCAGCGTCAGTTACAGACCAGAAAGTCGCCTTCGCCACTGGTGTTCCTCCATATCTCTGCGCATTTCACCGCTACACATGGAATTCCACTTTCCTCTTCTGCACTCAAGTTTTCCAGTTTCCAATGACCCTCCACGGTTGAGCCGTGGGCTTTCACATCAGACTTAAAAAACCGCCTACGCGCGCTTTACGCCCAATAATTCCGGATAACGCTTGCCACCTACGTATTACCGCGGCTGCTGGCACGTAGTTAGCCGTGGCTTTCTGATTAGGTACCGTCAAGATGTGCACAGTTACTTACACATATGTTCTTCCCTAATAACAGAGTTTTACGATCCGAAGACCTTCATCACTCACGCGGCGTTGCTCCGTCAGGCTTTCGC
>b899e2ff8ff4cd80c0bd71d67f85e166
TTCGAGCCTCAGCGTCAGTTATCGTCCAGTAAGCCGCCTTCGCCACTGGTGTTCCTCCTAATATCTACGCATTTCACCGCTACACTAGGAATTCCGCTTACCCCTCCGACACTCTAGTACGACAGTTTCCAATGCAGTACCGGGGTTGAGCCCCGGGCTTTCACATCAGACTTGCCGCACCGCCTGCGCTCCCTTTACACCCAGTAAATCCGGATAACGCTTGCACCATACGTATTACCGCGGCTGCTGGCACGTATTTAGCCGGTGCTTCTTAGTCAGGTACCGTCATTATCTTCCCTGCTGATAGAGCTTTACATACCGAAATACTTCTTCGCTCACGCGGCGTCGCTGCATCAGGCTTTCGC
>9eec757b839600ae29e99f0f5013ad90
TTCGAGCCTCAATGTCAGTTGCAGCTTAGCAGGCTGCCTTCGCAATCGGAGTTCTTCGTGATATCTAAGCATTTCACCGCTACACCACGAATTCCGCCTGCCTCAACTGCACTCAAGATATCCAGTATCAACTGCAATTTTACGGTTGAGCCGCAAACTTTCACAACTGACTTAAACATCCATCTACGCTCCCTTTAAACCCAATAAATCCGGATAACGCTCGGATCCTCCGTATTACCGCGGCTGCTGGCACGGAGTTAGCCGATCCTTATTCATAAAGTACATGCAAACGGGTATACATACCCGACTTTATTCCTTTATAAAAGAAGTTTACAACCCATAGGGCAGTCATCCTTCACGCTACTTGGCTGGTTCAGGCCTGCGC
>e83c2018e5ac6c0f01b6c3086b7bbbd8
TTCGCACCTCAGTGTCAGTATCAGTCCAGGTGGTCGCCTTCGCCACTGGTGTTCCTTCCTATATCTACGCATTTCACCGCTACACAGGAAATTCCACCACCCTCTACCGTACTCTAGCTCAGTAGTTTTGGATGCAGTTCCCAGGTTGAGCCCGGGGATTTCACATCCAACTTGCTGAACCACCTACGCGCGCTTTACGCCCAGTAATTCCGATTAACGCTTGCACCCTTCGTATTACCGCGGCTGCTGGCACGAAGTTAGCCGGTGCTTATTCTGTTGGTAACGTCAAAACAGCAAGGTATTAACTTACTGCCCTTCCTCCCAACTTAAAGTGCTTTACAATCCGAAGACCTTCTTCACACACGCGGCATGGCTGGATCAGGCTTTCGC
>e127b1a3b3a29436b9196bbcae09634c
TTCGAGCATCAGCGTCAGTTACAGTCCAGCAGGCTGCCTTCGCAATCGGAGTTCTTCGTGATATCTAAGCATTTCACCGCTACACCACGAATTCCGCCTGCCTCTACTGTACTCAAGACACCCAGTATCAACTGCAATTTTACGGTTGAGCCGCAAACTTTCACAACTGACTTAAGCGTCCGCCTACGCTCCCTTTAAACCCAATAAATCCGGATAACGCTCGGATCCTCCGTATTACCGCGGCTGCTGGCACGGAGTTAGCCGATCCTTATTCATACGGTACATACAAAAAGGCACACGTGCCTCACTTTATTCCCGTATAAAAGAAGTTTACAACCCATAGGGCAGTCATCCTTCACGCTACTTGGCTGGTTCAGACTCTCGT
>b55cc19d9e50de5ada6128d604771abe
TTCGCTCCTCAGCGTCAGTTACAGACCAGAGAGTCGCCTTCGCCACTGGTGTTCCTCCACATCTCTACGCATTTCACCGCTACACGTGGAATTCCACTCTCCTCTTCTGCACTCAAGTTCCCCAGTTTCCAATGACCCTCCCCGGTTGAGCCGGGGGCTTTCACATCAGACTTAAGGAACCGCCTGCGAGCCCTTTACGCCCAATAATTCCGGACAACGCTTGCCACCTACGTATTACCGCGGCTGCTGGCACGTAGTTAGCCGTGGCTTTCTGGTTAGGTACCGTCAAGGTACCGCCCTATTCGAACGGTACTTGTTCTTCCCTAACAACAGAGCTTTACGATCCGAAAACCTTCATCACTCACGCGGCGTTGCTCCGTCAGACTTTCGT
>50409fb67393939844207dac154a17e7
TTCGAGCCTCAATGTCAGTTGCAGCTTAGCAGGCTGCCTTCGCAATCGGAGTTCTTCGTGATATCTAAGCATTTCACCGCTACACCACGAATTCCGCCTGCCTCAACTGCACTCAAGATATCCAGTATCAACTGCAATTTTACGGTTGAGCCGCAAACTTTCACAACTGACTTAAACATCCATCTACGCTCCCTTTAAACCCAATAAATCCGGATAACGCTCGGATCCTCCGTATTACCGCGGCTGCTGGCACGGAGTTAGCCGATCCTTATTCATAAAGTACATGCAAACGGGTATGCATACCCGACTTTATTCCTTTATAAAAGAAGTTTACAACCCATAGGGCAGTCATCCTTCACGCTACTTGGCTGGTTCAGGCTCTCGC
>e18aba69c1319dae7fd64a5a21c1a993
TTCGAGCATCAGCGTCAGTTACAATCCAGTAAGCTGCCTTCGCAATCGGAGTTCTTCGTGATATCTAAGCATTTCACCGCTACACCACGAATTCCGCCTACCTCTGTTGCACTCAAGGTCGCCAGTATCAACTGCAATTTTACGGTTGAGCCGCAAACTTTCACAACTGACTTAACAACCCGCCTACGCTCCCTTTAAACCCAATAAATCCGGATAACGCTCGGATCCTCCGTATTACCGCGGCTGCTGGCACGGAGTTAGCCGATCCTTATTCATACGGTACATACAAAAAACCACACGTGGCTAACTTTATTCCCGTATAAAAGAAGTTTACAACCCATAGGGCAGTCATCCTTCACGCTACTTGGCTGGTTCAGACTCTCGT
>2ea1496304b6132cc8cabe14e8da73ea
TTCGTGCCTCAACGTCAGATATAGTTTGGTAAGCTGCCTTCGCAATCGGTGTTCTGTATGATCTCTAAGCATTTCACCGCTACACCATACATTCCGCCTACCGCAACTACTCTCTAGTCAAACAGTATTAGAGGCAATTTCGGAGTTAAGCCCCGGGATTTCACCTCTAACTTATCTAACCGCCTACGCACCCTTTAAACCCAATAAATCCGGATAACGCTTGAATCCTCCGTATTACCGCGGCTGCTGGCACGGAGTTAGCCGATCCTTATTCGTACGATACTTTCAGTCACCTACACGTAGGTGAGTTTACCCTCGTACAAAAGCAGTTTACAACTCATAGAGCCGTCATCCTGCACGCGGCATGGCTGGTTCAGACTTGCGT
>123bcfbfc6b0cc94038679309034c98f
TTCGAGCCTCAATGTCAGTTGCAGCTTAGCAGGCTGCCTTCGCAATCGGAGTTCTTCGTGATATCTAAGCATTTCACCGCTACACCACGAATTCCGCCTGCCTCAACTGCACTCAAGACATCCAGTATCAACTGCAATTTTACGGTTGAGCCGCAAACTTTCACAACTGACTTAAACATCCATCTACGCTCCCTTTAAACCCAATAAATCCGGATAACGCTCGGATCCTCCGTATTACCGCGGCTGCTGGCACGGAGTTAGCCGATCCTTATTCATAAAGTACATGCAAACGGGTATGCATACCCGACTTTATTCCTTTATAAAAGAAGTTTACAACCCATAGGGCAGTCATCCTTCACGCTACTTGGCTGGTTCAGGCCATCGC
>0aafa3e0ce506d942c4e064b96502b5e
TTCGCGCCTCAGCGTCAGTTACAGACCAGAGAGTCGCCTTCGCCACTGGTGTTCCTCCACATATCTACGCATTTCACCGCTACACGTGGAATTCCACTCTCCTCTTCTGCACTCCAGTCTTCCAGTTTCCAATGACCCTCCCCGGTTAAGCCGGGGGCTTTCACATCAGACTTAAAAGACCGCCTGCGCGCGCTTTACGCCCAATAAATCCGGACAACGCTTGCCACCTACGTATTACCGCGGCTGCTGGCACGTAGTTAGCCGTGGCTTTCTGGTTAGATACCGTCAAGGGACAAGCAGTTACTCTTATCCTTGTTCTTCTCTAACAACAGTACTTTACGATCCGAAAACCTTCTTCATACACGCGGCGTTGCTCCGTCAGACTTTCGT

YOUR TURN: Navigate to the NCBI BLAST website https://blast.ncbi.nlm.nih.gov/Blast.cgi. Select the Nucleotide BLAST (nt) option. Paste the first sequence (seqID: 595a1ab1fbf6fb7a154b169604e0e280) into the search box. Leave parameters as default (make sure megablast is selected at the bottom) and click BLAST.

Here is the curated version of the taxonomy file I have updated (I recommend saving as separate file so you have raw data to compare).

# read data
tax_table <- read_tsv(file = "data/qiime2/exports/taxonomy_curated.tsv", col_names = TRUE)
tax_table <- tax_table %>% separate(Taxon, sep = ";", into = c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"))
tax_table <- tax_table %>%
  rename(ASV = `Feature ID`)
# View table
paged_table(tax_table)

3. The sample data

You’ll also need to have some sample metadata as well - this contains the variables and information related to our samples. Lets take a quick look at the for the example data set we are about to use.

# read data
sam_data <- read_csv("data/metadata.csv")
paged_table(sam_data)

The positive control used in this sample set was the Zymo mock community.

The theoretical composition based on gDNA:

  • Listeria monocytogenes - 12% [Gram positive]
  • Pseudomonas aeruginosa - 12% [Gram negative]
  • Bacillus subtilis - 12% [Gram positive]
  • Escherichia coli - 12% [Gram negative]
  • Salmonella enterica - 12%, [Gram negative]
  • Lactobacillus fermentum - 12% [Gram positive]
  • Enterococcus faecalis - 12% [Gram positive]
  • Staphylococcus aureus - 12% [Gram positive]
  • Saccharomyces cerevisiae - 2% [yeast]
  • Cryptococcus neoformans - 2% [yeast]



Copyright, Siobhon Egan, 2021.