This document describes how to understand the bioinformatics report generated by Aladdin shotgun pipeline. Most of the plots are taken from the sample report. The sample report was generated using a small number of samples from this paper. The plots in your report might look a little different.
The bioinformatics report is generated using MultiQC
. There are general instructions on how to use a MultiQC report on MultiQC website. The report itself also includes a link to a instructional video at the top of the report. In general, the report has a navigation bar to the left, which allows you to quickly navigate to one of many sections in the report. On the right side, there is a toolbox that allows to customize the appearance of your report and export figures and/or data. Most sections of the report are interactive. The plots will show you the sample name and values when you mouse over them.
The general statistics table gives an overview of some important stats of your samples. They include some statistics of the input files, such as how many reads are in each sample, the lengths, duplication rate, and GC contents of the reads. They also tell you how many of your reads passed read quality QC (% Pass QC Reads
and No. Processed Reads
), the percentage of reads that are of low complexity and therefore filtered (% Low-complexity Reads
), and most importantly, percentages of reads/kmers that have assigned taxonomy (% Kmers w/ Taxonomy
). The default sourmash
method assigns taxonomy to kmers, instead of reads. But since kmers are generated from reads, the percentages of kmers with taxonomy can be approximated to percentages of reads with assigned with taxonomy.
A good shotgun sequencing library should have most reads pass QC, with few low-complexity reads. Most well-studied bacterial community, such as human gut microbiome samples, should have most reads with taxonomy assigned. A low taxonomy assignment rate could suggest possible host DNA contamination, which you will be able to investigate in a later chart. Some poorly-understood microbial communities could have unknown microbes without representations in the database, which could lead to low taxonomy assignment rate.
There are two FastQC/Falco sections in the report, a pre-trimming and a post-trimming one. They contain charts and stats about the quality of the reads before and after read trimming step. If the library and sequencing qualities are good, these two sections are usually highly similar. In short, they contain the following subsections:
You can find more detailed explanations of FastQC reports here.
The fastp section of the report contains stats of the read trimming step and some plots of read quality. Since the read quality portion are already covered by FastQC/Falco sections, you could simply skip them (we may elect to not include them in a future version). The most important plot here is the “Filtered Reads” barplot. You should expect most reads to be “Passed Filter”.
The BBDuk section summarizes how many reads are filtered out because of low complexity or “low entropy”. These reads are highly likely to be derived from technical artifacts. You should expect the percentages of such reads to be very low.
This section summarizes the percentages of kmers, or by approximation, the numbers of reads assigned with taxonomy. Those without any taxonomy assignment are labeled as “Unidentified”, those coming from host sources or common eukaryotic pathogen will be labeled by their species name, for example, homo sapiens. All reads assigned to microbes will be combined and labeled “Microbes”. This is supposed to give you a quick answers to a few questions:
sourmash
has a reporting threshold (default 50kb). Only species with enough reads to reach this threshold will have a reported kmer percentage. Low abundance species have fewer reads, which may result in their reads not reaching the threshold.In the sample report, there are no host or pathogen sequences. If they are present in your samples, they will also show up in this plot. The samples here are all gut microbiome samples, which contain mostly well known organisms, therefore most of the kmers have assigned taxonomy.
This section allows you to investigate the composition of every sample on different taxonomy levels. Please note, these composition barplot only accounts for reads/kmers assigned to microbes, those from hosts or without assigned taxonomy have already been excluded. Therefore, the percentages presented here are “percentage among known microbial reads”, not “percentage of all reads”. The plot starts with a species-level presentation that include all species. It could be very busy and not very informative. That is why we’ve made the plot interactive where you could switch to different taxonomy levels, and group less abundant taxa into an “Others” category. Please click the buttons at the top left corner of the plot to customize your view. For example, below is a presentation at Genus level with genera below 5% grouped into “Others”.
This section presents the taxonomy abundance data in a heatmap format. This allows you to see the differences between the samples better. Please note, the heatmap only includes up to the top 20 most abundant taxa at a given taxonomy level. There are two different views of this heatmap. One is of relative abundance, in other words, same data as presented in the “Qiime composition barplot” section, this is useful when you want to know the actual percentages of a given taxa. The other is log-transformed and centered data, which contrasts the changes between samples and taxa better. You can also customize the view at different taxonomy levels.
By default, these analyses are carried out at species level, unless specified otherwise when the pipeline was run.
This section shows a box plot of the alpha diversities of samples in each group, if group labels are provided. There are three different metrics for alpha diversity. Those can be chosen with the dropdown box on the top left corner. This plot intends to show whether one group of samples is more diverse than others.
This section shows changes in diversity in each group if there were fewer reads. This is only about species diversity, but shotgun sequencing is more than that, it often tries to detect genomic contents of the species as well. This may be removed in the future version.
This 3-D plot displays principle coordinate analysis results calculated using the beta diversity distances between samples. There are two different beta diversity metrics. You can spin the plot to view at different angles. This plot is intended to show you how similar or disimilar are samples of different groups. You may expect samples of the same group to be more similar to each other, but that may not always be true.
This section is experimental and under development!
This section is very similar to the “Diversity analysis” section, except that in addition to user provided samples, a large cohort of reference samples are added in the diversity analysis. In this sample report, the reference dataset is a collection of healthy human gut microbiome samples from both public data and private data from Zymo Research. The hope is that it can provide a large number and therefore more complete picture of healthy human gut microbiome than a typical “control” group in a study. This function can be useful when user have limited number of samples but want something to compare to. The comparison here is crude considering there may be batch effects, protocol biases, and other factors in the reference dataset. We welcome suggestions to improve this part of the pipeline/report.
This is the same plot as the alpha diversity plot above, but with the reference samples added. As you can see, the reference dataset has a wide range of alpha diversity, but is consistent in showing that most healthy samples have higher alpha diversity than the samples with Crohn disease.
This is the same plot as the beta diversity plot above, but with the reference samples added. The placement of the diseased samples are not outside of all healthy samples, but you can see that they are more on the edge, while the control samples are closer to the center and the majority of the reference healthy samples. It may be that the subtle differences between the disease and healthy samples are not obvious in this specific method of plotting.
This pipeline uses AMR++ v3.0 to produce antimicrobial statistics from read alignments to the resistome database MEGARes. MEGARes contains entries for SNP, insertion, and deletion mutations documented to confer antimicrobial resistance, as well as gene sequences that confer antimicrobial resistance by presence. AMR++ compiles counts for MEGARes entries into hierarchical levels; on the first level, gene sequences and mutations that confer anti-microbial resistance, the second, gene groups under which MEGARes entries may be classified, the third, mechanisms targeted by antimicrobial resistance groups, and fourth, to classes under which antimicrobial resistance mechanisms fall. As an example, accession MEG_3779 documents a specific mutation within the MECA gene group, which, along with other groups, encodes the mechanism Penicillin-binding protein, and finally the class betalactams. This pipeline currently produces output for class, mechanism, and gene level data.
This plot depicts the read composition of antimicrobial resistance classes. This plot contains two tabs. The ‘Read counts’ tab will give you sample level read counts for all classes. To facilitate comparison between samples that may have different numbers of total reads, read counts have been normalized to counts per one million reads that passed trimming filters for each sample. The ‘Relative percentage’ tab contains the read counts of individual classes divided by the total number of antimicrobial gene reads within each sample.
This barchart displays the top 20 MEGARes entries ordered by normalized read count for each sample. All other matches to the MEGARes database outside of the top 20 have been grouped into the section “Other”. MEGARes entries contain both SNP mutations documented to confer antimicrobial resistance and gene sequences that confer antimicrobial resistance by presence. The ‘Read counts’ tab contains absolute counts that have been normalized to counts per one million reads that passed trimming. The ‘Relative percentage’ tab includes the number of reads per MEGARes entry divided by the total number of reads aligned successfully to the MEGARes database.
This section lists the versions of software used in this bioinformatic pipeline. This should help you in writing the methods section of your publication or if you wish to carry out some of the analysis on your own.
This section lists any parameter that were different from the default values. For default values, please refer to the pipeline code.