This document describes how to understand the bioinformatics report generated by MGscan Metatranscriptome pipeline. Most of the plots are taken from the sample report. The sample report was generated using a small number of samples downsampled to 100 million reads per sample from this paper. The plots in your report might look a little different.
The bioinformatics report is generated using MultiQC
. There are general instructions on how to use a MultiQC report on MultiQC website. The report itself also includes a link to a instructional video at the top of the report. In general, the report has a navigation bar to the left, which allows you to quickly navigate to one of many sections in the report. On the right side, there is a toolbox that allows to customize the appearance of your report and export figures and/or data. Most sections of the report are interactive. The plots will show you the sample name and values when you mouse over them.
The general statistics table gives an overview of some important stats of your samples. They include some statistics of the input files, such as how many reads are in each sample, duplication rate, and GC contents of the reads. They also tell you how much of the base-pairs were removed during the trimming step (% BP Trimmed
) and percentages of reads removed during low-complexity filtering step (Reads Removed(%)
). One important stat is how much of your reads were of rRNA origin (% rRNA reads
). These reads are not used in the assembly step or any analysis that follows.
A good metatranscriptome sequencing library should have most reads pass QC, with few low-complexity reads. The percentages of rRNA should be as low as possible, but given the complexity of the sample, sometimes a small amount of rRNA reads cannot be avoided.
FastQC is a read quality control tool that gives you various plots and stats about your sequencing quality. In this report, FastQC results on your raw reads are displayed. In short, they contain the following subsections:
You can find more detailed explanations of FastQC reports here.
This pipeline uses TrimGalore
, which is a wrapper of Cutadapt
, to trim sequencing adapters and low quality bases. You should expect to see a plot that starts high on low number of bp trimmed (left side) and quickly flattens. This means most of your reads have very few base pairs being removed. If you see a plot with high number of reads being trimmed at high bp, it could mean your library insert is really short or your sequencing quality is really poor.
The BBDuk section summarizes how many reads are filtered out because of low complexity or “low entropy”. These reads are highly likely to be derived from technical artifacts. You should expect the percentages of such reads to be very low.
By default, the pipeline uses Megahit
to conduct de novo assembly of metatranscriptome. This section gives you some stats of each assembly. They include, from left to right:
This stats help you quickly evaluate and compare assemblies. The total number and length of the contigs tell you the biological complexity of your samples. If two samples have similar sequencing depth and quality, but one with much larger assembly, that samples is likely more complex and diverse. The N50 number is a good evaluator of how fragmented your data is. A number of factors affects this, such as sequencing depth, sample complexity, assembly methods, etc. A good number for a bacterial metatranscriptome sample is 600-1000bp. If you see N50 significantly below that, it means your assembly contain a large number of very short contigs, which are less informative in taxonomy and function. You might want to consider increase sequencing depth, troubleshoot library prep, or both.
The section summarizes what happens to your reads after they are preprocessed, i.e., after the low complexity read removal step. There are four categories:
RiboDetector
and removed from downstream analysis including assembly.This plot is not commonly included in metatranscriptome pipelines, but we believe it could be very informative. For example, high rRNA reads obviously points to a rRNA depletion problem in library prep. High percentage of unassembled reads indicates the sequencing depth is inadequate to allow assemblies for low abundance organisms. Such bias against low abundance organism should be considered when evaluating a sample’s taxonomic and functional profile. Of course, you want the percentage of annotated reads to be as high as possible. The information in this plot, combined with stats in the “MEGAHIT” section, help you understand your samples/assemblies better. For example, if a sample has low N50 number, high percentage of unassembled reads, increasing sequencing depth is likely advised to achieve better assembly.
This section allows you to investigate the composition of every sample on different taxonomy levels. Please note, this section can be slow to load. Please be patient. The plot starts with a species-level presentation that include all species. It could be very busy and not very informative. That is why we’ve made the plot interactive where you could switch to different taxonomy levels, and group less abundant taxa into an “Others” category. Please click the buttons at the top left corner of the plot to customize your view. For example, below is a presentation at Genus level with genera below 1% grouped into “Others”. Taxa labelled with “g__”, “f__” means we couldn’t confidently assign a genus, a family, and so on. Taxa “d__;p__;c__;o__;f__;g__” means the pipeline couldn’t assign any taxonomy to those sequences. Often they don’t have any hits in the entire NCBI NR database.
The pipeline also prosent the same information in a different format using Krona, which is easier to browse and navigate. You can download the Krona report on Aladdin platform. A screenshot of that report is shown here.
This section lists the versions of software used in this bioinformatic pipeline. This should help you in writing the methods section of your publication or if you wish to carry out some of the analysis on your own.
This section lists any parameter that were different from the default values. For default values, please refer to the pipeline code.