This document describes how to understand the bioinformatics report generated by Aladdin ampliseq pipeline. Most of the plots are taken from the sample report. These plots are just an example and may look different in your report.
The bioinformatics report is generated using MultiQC
. There are general instructions on how to use a MultiQC report on MultiQC website. The report itself also includes a link to an instructional video at the top of the report. In general, the report has a navigation bar to the left, which allows you to quickly navigate to one of many sections in the report. On the right side, there is a toolbox that enables customization of the report appearance and the export of figures and/or data. Most sections of the report are interactive. The plots will show you the sample name and values when you mouse over them.
The general statistics table gives an overview of some important sample information. From left to right, this table contains:
Seqs
).% GC
).% bp Trimmed
).DADA2 Inputs
).Chimera Error Rate
). Chimeric reads are sequencing artifacts resulting from the spurious joining of two or more independent biological sequences, which can be misinterpreted as novel organisms.Reads Passed DADA2
).Retained Taxa Filtered
and % Retained Taxa Filtered
, respectively).These statistics are collected from different parts of the pipeline workflow to give you a snapshot of the results and are a quick way to evaluate how your ampliseq experiment went. Here are a few important considerations when reading this table:
DADA2 inputs
) may result from selecting the incorrect sequencing primers at the pipeline run set-up stage. A dropout of reads during DADA2 processing (column Reads Passed DADA2
) can result from a high chimera rate (see below) and failure to merge paired-end reads due to insufficient overlapChimera Error Rate
). A high percentage (>5%) of chimeric reads can result from suboptimal PCR reactions when performing mixed template PCR.Retained Taxa Filtered
). Removal of a high percentage of read counts at this stage can indicate a high portion of reads amplified plastid sequences (chloroplast and mitochondria) and that your experimental design may need to be revised.Cutadapt
is a bioinformatic tool which is used to trim adapter and primer sequences from your raw sequence reads. The bar plot in the Filtered Reads
section of the report shows the numbers/percentages of reads that were retained following adapter and primer removal (buttons in the top left will toggle between these two metrics). A good sample should retain the majority of reads (“Pairs passing filters”) following Cutadapt
filtering (<5% reads removed). A high percentage of reads dropped may indicate errors in upstream library preparation. The Trimmed Sequence Lengths
section shows numbers of reads with certain lengths of adapters trimmed.
Additional FastQC
summary statistics are shown at the end of the report.
The ` Composition Bar Plot gives a summary of the taxonomic profile of the samples. The Y-axis represents each sample, and the X-axis shows the relative abundance of taxa within each sample. Hovering the mouse cursor over each bar will create a pop out window listing the taxa ID and relative abundance within each sample.
This bar plot contains several features to enhance visualization. The taxonomic rank in taxa are grouped can be selected by clicking the buttons above the bar plot (from Kingdom to Species). To view all taxa present at the taxonomic rank Genus, click on the
Genus button and select
No grouping. In the chart below (displayed at the Species level), there are many low abundant taxa, which make the plot difficult to interpret. To simplify this plot, you can click on the
Species button (or any other taxonomic ranking) and select
Grouping taxa < N% into ‘Others’` to group taxa <0.1%, <5% or <10% abundance into a group called ‘Others’. All taxa below the threshold relative abundance will now appear as one bar, titled ‘Others’. A list of all taxa present can be found under the bar plot, with additional information accessed by clicking the blue arrow at the bottom of the plot.
The Taxonomy Abundance Heatmap
demonstrates the magnitude of taxa abundance in each sample as a color scale. Each column represents a sample, and each row represents each taxon. The color of each cell (tile) in the heat map represents the relative abundance of each taxa, with the key scale shown on the far right of the plot. The number of rows displayed can be expanded by dragging the bar at the bottom of the plot.
The plot can be toggled between different taxonomic ranks, and between the relative abundance and the Z-score by clicking on the buttons above the plot. The Z-score is a data transformation method often used to visualize gene expression data. Instead of reporting the relative abundance of each taxa, the deviation from the mean relative abundance for all samples for that specific taxa is reported. This can enhance visualization of the heatmap and the ability to spot interesting trends of taxa abundance in the data. The ordering of samples and taxa is determined by hierarchical clustering.
Note that complete taxa name may be obscured from view to the right of the plot, however the complete name can be observed interactively by hovering over each tile in the plot.
The first plot in the diversity analysis section shows the rarefaction curves, plotted by metadata grouping (if this information is provided). Rarefaction is a statistical method used to adjust for differences in sequencing library sizes (i.e. uneven numbers of reads) across samples, aiding diversity comparisons. Samples with greater library sizes could demonstrate higher species diversity, simply due to there being higher numbers of microbial reads and not a truly higher diversity, leading us to misinterpret the data. Rarefaction involves randomly subsampling each sample to a specified number of reads, determined by the minimum read count of the sample set, in order to standardize library sizes prior to diversity analyses. The rarefaction curve is generated by repeatedly, randomly re-sampling the number of reads for each sample and calculating three alpha diversity metrics at each library size:
Observed features
, counting the number of different taxa presentShannon diversity
index, which considers both the number and proportion of taxaFaith pd
(Faith’s phylogenetic diversity index), which considers the phylogenetic distance amongst taxa
Rarefaction curves plot the accumulation of species (Y-axis) with increasing sequencing depth (X-axis). Curves should demonstrate a steep initial gradient, which flattens out with increasing sequencing depth, indicating that most of the diversity in the population has been sampled. If the curve fails to plateau, this indicates that a sample has not been sequenced to a sufficient depth to discover all taxa present in the community, as increasing the sequencing depth is still uncovering new taxa. In this case, you may decide to omit these samples from the analyses.Alpha diversity describes the species richness within a community. Alpha diversity (following rarefaction) is plotted as a boxplot, in which samples are grouped according to metadata (if this information is provided). Boxplots for four different measures can be visualized by clicking the button in the top left of the plot:
Observed features
, counting the number of different taxa presentShannon diversity
index, which considers both the number and proportion of taxaFaith PD
(Faith’s phylogenetic diversity index), which considers the phylogenetic distance amongst taxaEvenness
, which considers the distribution of abundance of taxa in a community
Here, you can determine if the species richness differs between your samples, or sample groupings. A greater number of observed features, for example, indicates a greater species richness. You can also hover the mouse cursor over each box to list a range of metrics, including the minimum diversity of samples in a given group (min), max diversity (max), median, 25th percentile (q1) and 75th percentile (q3).Whilst alpha diversity measures the species richness in a single sample (or community), beta diversity provides a measure of similarity (or dissimilarity) of microbial compositions between different samples. Four different beta diversity statistics are calculated in the report:
Bray-Curtis
. Considers the abundance of taxa shared between two samples and the total number of taxa in each sample when calculating dissimilarity between samples.Jaccard
. Considers taxa presence/absence, but not abundance.Weighted Unifrac
. Considers the phylogenetic relationships between taxa found in samples and their relative abundance.Unweighted Unifrac
. Considers the phylogenetic relationships between taxa found in samples but does not consider their relative abundance.
Following the calculation of beta diversity, a dissimilarity matrix is generated which is visualized as an interactive, 3D principal coordinate plot (PCoA), displaying the first three principal coordinates (PC1-3). Each point in the plot represents a sample. Samples clustered closely together indicate more similar microbial compositions than points which are further apart.With this interactive plot, you can toggle between the different beta diversity measures by clicking the button in the top left corner of the plot. The 3D plot can be rotated using the mouse cursor. Viewing the plot from different perspectives can aid in data interpretation.
The objective of differential abundance analysis is to identify taxa which differ statistically between samples and sample groups (if this information is provided). The method employed is ANCOM-BC (Analysis of Compositions of Microbiomes with Bias Correction). ANCOM-BC estimates unknown sampling fractions (ratio of the observed taxa abundance in a sample to its actual abundance in the ecosystem) and uses this estimate to correct for sampling bias through a log-linear regression model. Results can be visualized in the ANCOM-BC bar plot
.
In the ANCOM-BC bar plot
, the Y-axis represents each taxa and the X-axis represents the log fold change (LFC). Each bar represents the LFC of the absolute abundance of a taxon between treatment groups. A plot is produced for each treatment group pairwise comparison, which can be viewed by clicking the buttons above the plot. Only the top 20 significantly differentially abundant taxa are shown (multiple testing adjusted P value (q value) < 0.05).
A positive LFC (LFC > 0) between Group A vs Group B indicates that relative abundance is significantly higher in Group A relative to Group B, whilst a negative LFC (LFC < 0) between Group A vs Group B indicates that relative abundance is significantly lower in group A. For example, in a ‘Control_vs_Treatment-A’ comparison, a negative LFC for taxa Escherichia coli indicates that the abundance of this taxa is significantly lower in the control group relative to the Treatment-A group (in other words, the abundance of this taxa is higher in the Treatment-A group).
In addition to performing analyses on your own samples, users have the opportunity to perform a comparative analysis with selected publicly available datasets. When this option is selected, an additional section will appear in the report: Comparative diversity analysis
. In this section, diversity analyses have been recalculated, incorporating the reference dataset. Alpha Rarefaction
, Alpha Diversity
and Beta Diversity
plots can be viewed here. Note: sample groupings can be hidden from a plot to simplify visualization by clicking on the group in the figure legend key, to the right of the plot.
Additional sequencing read quality metrics are provided at the end of the report, performed by the bioinformatic tool FastQC
. Here, information on the quality score distribution across reads, per base sequences content (% of bases A, T, C and G) and much more is provided. For further information on these metrics, see the FastQC help pages.
This section lists the versions of software used in this bioinformatic pipeline. This should help you in writing the methods section of your publication or if you wish to carry out some of the analysis on your own.