This document describes how to understand the bioinformatics report generated by Aladdin ampliseq pipeline. Most of the plots are taken from the sample report. These plots are just an example and may look different in your report.
The bioinformatics report is generated using MultiQC
. There are general instructions on how to use a MultiQC report on MultiQC website. The report itself also includes a link to an instructional video at the top of the report. In general, the report has a navigation bar to the left, which allows you to quickly navigate to one of many sections in the report. On the right side, there is a toolbox that enables customization of the report appearance and the export of figures and/or data. Most sections of the report are interactive. The plots will show you the sample name and values when you mouse over them.
The general statistics table gives an overview of some important sample information. From left to right, this table contains:
No. Seqs
).% GC
).% Passing trimming filter
).DADA2 Input
).(DADA2 qual filter
)DADA2 merge pairs
)DADA2 non-chimera
). Chimeric reads are sequencing artifacts resulting from the spurious joining of two or more independent biological sequences, which can be misinterpreted as novel organisms.DADA2 filter tax
).These statistics are collected from different parts of the pipeline workflow to give you a snapshot of the results and are a quick way to evaluate how your ampliseq experiment went. Here are a few important considerations when reading this table:
The Composition Barplot
gives a summary of the taxonomic profile of the samples. The Y-axis represents each sample, and the X-axis shows the relative abundance of taxa within each sample. Hovering the mouse cursor over each bar will create a pop out window listing the taxa ID and relative abundance within each sample.
This barplot contains several features to enhance visualization. The taxonomic rank in taxa are grouped can be selected by clicking the buttons above the barplot (from Kingdom to Species). To view all taxa present at the taxonomic rank Genus, click on the Genus
button and select No grouping
. In the chart below (displayed at the Species level), there are many low abundant taxa, which make the plot difficult to interpret. To simplify this plot, you can click on the Species
button (or any other taxonomic ranking) and select Grouping taxa < N% into ‘Others’
to group taxa <0.1%, <5% or <10% abundance into a group called ‘Others’. All taxa below the threshold relative abundance will now appear as one bar, titled ‘Others’. A list of all taxa present can be found under the barplot, with additional information accessed by clicking the blue arrow at the bottom of the plot.
In some cases, numerous taxa identified in one sample can contribute to overcrowding in the composition barplots. If a more detailed format is desired, the pipeline also produces downloadable krona reports for each sample. Krona reports display the taxa composition of each sample as a pie chart. Use the table on the left hand side of the screen to select a sample. Mouse over a desired taxa and click on the highlighted section to expand it.
The Taxonomy Abundance Heatmap
demonstrates the magnitude of taxa abundance in each sample as a color scale. Each column represents a sample, and each row represents each taxon. The color of each cell (tile) in the heat map represents the relative abundance of each taxa, with the key scale shown on the far right of the plot. The number of rows displayed can be expanded by dragging the bar at the bottom of the plot.
The plot can be toggled between different taxonomic ranks, and between the relative abundance and the Z-score by clicking on the buttons above the plot. The Z-score is a data transformation method often used to visualize gene expression data. Instead of reporting the relative abundance of each taxa, the deviation from the mean relative abundance for all samples for that specific taxa is reported. This can enhance visualization of the heatmap and the ability to spot interesting trends of taxa abundance in the data. The ordering of samples and taxa is determined by hierarchical clustering.
Note that complete taxa name may be obscured from view to the right of the plot, however the complete name can be observed interactively by hovering over each tile in the plot.
The first plot in the diversity analysis section shows the rarefaction curves, plotted by metadata grouping (if this information is provided). Rarefaction is a statistical method used to adjust for differences in sequencing library sizes (i.e. uneven numbers of reads) across samples, aiding diversity comparisons. Samples with greater library sizes could demonstrate higher species diversity, simply due to there being higher numbers of microbial reads and not a truly higher diversity, leading us to misinterpret the data. Rarefaction involves randomly subsampling each sample to a specified number of reads, determined by the minimum read count of the sample set, in order to standardize library sizes prior to diversity analyses. The rarefaction curve is generated by repeatedly, randomly re-sampling the number of reads for each sample and calculating three alpha diversity metrics at each library size:
Observed features
, counting the number of different taxa presentShannon diversity
index, which considers both the number and proportion of taxaFaith pd
(Faith’s phylogenetic diversity index), which considers the phylogenetic distance amongst taxa
Rarefaction curves plot the accumulation of species (Y-axis) with increasing sequencing depth (X-axis). Curves should demonstrate a steep initial gradient, which flattens out with increasing sequencing depth, indicating that most of the diversity in the population has been sampled. If the curve fails to plateau, this indicates that a sample has not been sequenced to a sufficient depth to discover all taxa present in the community, as increasing the sequencing depth is still uncovering new taxa. In this case, you may decide to omit these samples from the analyses.Alpha diversity describes the species richness within a community. Alpha diversity (following rarefaction) is plotted as a boxplot, in which samples are grouped according to metadata (if this information is provided). Boxplots for four different measures can be visualized by clicking the button in the top left of the plot:
Observed features
, counting the number of different taxa presentShannon diversity
index, which considers both the number and proportion of taxaFaith PD
(Faith’s phylogenetic diversity index), which considers the phylogenetic distance amongst taxaEvenness
, which considers the distribution of abundance of taxa in a community
Here, you can determine if the species richness differs between your samples, or sample groupings. A greater number of observed features, for example, indicates a greater species richness. You can also hover the mouse cursor over each box to list a range of metrics, including the minimum diversity of samples in a given group (min), max diversity (max), median, 25th percentile (q1) and 75th percentile (q3).Whilst alpha diversity measures the species richness in a single sample (or community), beta diversity provides a measure of similarity (or dissimilarity) of microbial compositions between different samples. Four different beta diversity statistics are calculated in the report:
Bray-Curtis
. Considers the abundance of taxa shared between two samples and the total number of taxa in each sample when calculating dissimilarity between samples.Jaccard
. Considers taxa presence/absence, but not abundance.Weighted Unifrac
. Considers the phylogenetic relationships between taxa found in samples and their relative abundance.Unweighted Unifrac
. Considers the phylogenetic relationships between taxa found in samples but does not consider their relative abundance.
Following the calculation of beta diversity, a dissimilarity matrix is generated which is visualized as an interactive, 3D principal coordinate plot (PCoA), displaying the first three principal coordinates (PC1-3). Each point in the plot represents a sample. Samples clustered closely together indicate more similar microbial compositions than points which are further apart.With this interactive plot, you can toggle between the different beta diversity measures by clicking the button in the top left corner of the plot. The 3D plot can be rotated using the mouse cursor. Viewing the plot from different perspectives can aid in data interpretation.
The objective of differential abundance analysis is to identify taxa which differ statistically between samples and sample groups (if this information is provided). The method employed is ANCOM-BC (Analysis of Compositions of Microbiomes with Bias Correction). ANCOM-BC estimates unknown sampling fractions (ratio of the observed taxa abundance in a sample to its actual abundance in the ecosystem) and uses this estimate to correct for sampling bias through a log-linear regression model. Results can be visualized in the ANCOM-BC barplot
.
In the ANCOM-BC barplot, the Y-axis represents each taxa and the X-axis represents the log fold change (LFC). Each bar represents the LFC of the absolute abundance of a taxon between treatment groups. Only the top 50 significantly differentially abundant taxa are shown (multiple testing adjusted P value (q value) < 0.05). In the subheading above each plot in the format of “Group A vs Group B”, “Group B” is the reference group. A positive LFC (LFC > 0) between Group A vs Group B indicates that relative abundance is significantly higher in Group A relative to Group B, whilst a negative LFC (LFC < 0) between Group A vs Group B indicates that relative abundance is significantly lower in group A. For example, in a ‘Treatment-A vs Control’ comparison, a negative LFC for taxa Escherichia coli indicates that the abundance of this taxa is significantly lower in the Treatment-A group relative to the control group (in other words, the abundance of this taxa is higher in the Control group).
Additional sequencing read quality metrics are provided at the end of the report, performed by the bioinformatic tool FastQC
. Here, information on the quality score distribution across reads, per base sequences content (% of bases A, T, C and G) and much more is provided. For further information on these metrics, see the FastQC help pages.
Cutadapt
is a bioinformatic tool which is used to trim adapter and primer sequences from your raw sequence reads. The barplot in the Filtered Reads
section of the report shows the numbers/percentages of reads that were retained following adapter and primer removal (buttons in the top left will toggle between these two metrics). A good sample should retain the majority of reads (“Pairs passing filters”) following Cutadapt
filtering (<5% reads removed). A high percentage of reads dropped may indicate errors in upstream library preparation. The Trimmed Sequence Lengths
section shows numbers of reads with certain lengths of adapters trimmed.
Additional FastQC
summary statistics are shown at the end of the report.
This section lists the versions of software used in this bioinformatic pipeline. This should help you in writing the methods section of your publication or if you wish to carry out some of the analysis on your own.