This document describes how to understand the small RNAseq bioinformatics report generated by Aladdin small RNAseq pipeline. Most of the plots are taken from the sample report. The sample report was generated using public data from this paper. The plots in your report might look a little different.
The small RNAseq bioinformatics pipeline is built using Nextflow and originally adapted from nf-core/smrnaseq pipeline version 1.0.0. A lot of changes have been made. Please refer to the pipeline repo for details. A brief summary of the pipeline:
FastQC
)Trim Galore!
)mirtrace
)seqcsluter
)Bowtie1
)mirtop
)isomiRs
)Bowtie1
)RSEM
)tximport
)DESeq2
)MultiQC
)The bioinformatics report is generated using MultiQC
. There are general instructions on how to use a MultiQC report on MultiQC website. The report itself also includes a link to a instructional video at the top of the report. In general, the report has a navigation bar to the left, which allows you to quickly navigate to one of many sections in the report. On the right side, there is a toolbox that allows to customize the appearance of your report and export figures and/or data. Most sections of the report are interactive. The plots will show you the sample name and values when you mouse over them.
The general statistics table gives an overview of some important stats of your samples. For example, how many reads were in each sample, how many reads passed filter, and how many reads were miRNA reads, etc. These stats are collected from different sections of the report to give you a snapshot. This is usually the quickest way for you to evaluate how your small RNAseq experiment went. Here are a few important things you should look for when reading this table:
Workflow Summary
section). While it is rare that the number of reads passing filter approaches 100%, one would hope most samples have more than 50% reads that pass filters.Estimated RNA Type Counts
section.Other information you can get from this tables are (from left to right):
FastQC gives general quality metrics about your reads. It provides information about the quality of your reads (in section Sequnce Quality Histograms
, section Per Sequence Quality Scores
, and section Per Base N Content
). Section Adapter Content
shows you how many reads have adapters in them. As you can see the sample report, in miRNAseq libraries, most if not all reads have adapters at the 3’ end because of the short inserts. This would usually produce warnings in FastQC
, however we have disabled that here because this is perfectly normal in miRNAseq.
Trim Galore!
apply quality and adapter trimming to FASTQ files. There are two sections in the report. The Filtered Reads
section show numbers and percentages of reads that passed or failed filtering criteria for various reasons. Most of your reads should pass filters. If this is not the case, this section can tell you why. You can find the definition of reads being “too short” or “too long” in the Workflow Summary
section of the report.
The Trimmed Sequence Lengths
section show numbers of reads with certain lengths of adapters trimmed. Most of your reads would have bps trimmed off. In the sample report, you can see there is a peak at 29 bp which suggest majority of the small RNA inserts are 51 - 29 = 22
bp long. There are also peaks at 51 bp for some samples. Those are likely reads derived from primer-dimers and other library artifcats.
miRTrace
generates several useful quality control plots for miRNAseq data.
This section shows you the distribution of read lengths post-trimming and filtering. In the sample report, you can see majority of the reads are around 22 bp long, which is consistent with the expected length of miRNAs.
This section shows you whether any of your samples might have contaminations from a foreign organism. Please know that this only works if the contaminating organism is distantly related to the organism of interest, for example, insect RNA in a human sample. It won’t detect contaminations from closely related organisms, for example, non-human primate RNA in a human sample. In the sample report, all samples are clean.
This section plots how many miRNA hairpins are detected at different sequencing depths. This would help you understand whether you have sequenced enough for your samples, and whether/how much more sequencing would help discover more miRNA hairpins. In the sample report, you can see all curves are either starting to or have already flattened. Therefore, there will be diminishing returns for more sequencing in most if not all of them.
RSEM
quantifies the number of reads derived from miRNA, tRNA, rRNA, lncRNA, miscRNA, scaRNA, snoRNA, and snRNA. The “unmapped” category most likely encompasses other small RNAs such as piRNA, siRNA, etc. You can use the button on the top left to toggle between numbers and percentages of reads.
High confidence mature tRNA sequences are downloaded from GtRNAdb. An example GtRNAdb access link for GRCh38 can be viewed here. miRBase hairpin sequences are used as reference for miRNAs. miRNA hairpin sequences for all species in miRBase are available in a single FASTA file. seqkit was used to extract species-specific sequences from the miRBase hairpin file, as well as convert GtRNAdb and miRBase reference sequences from RNA to DNA format. For other RNA types, Aladdin uses an Ensembl noncoding RNA file with mitochondrial tRNA, lncRNA, snoRNA, scaRNA, snRNA, miscellaneous RNA, and rRNA sequences. An example noncoding RNA file for GRCh38 can be viewed here. Sequences with desired gene_biotypes were all extracted with seqkit from the noncoding RNA file. Transcripts that did not belong to the primary assembly GTF were removed. rRNA reference sequences were supplemented with additional Repeatmasker sequences from UCSC’s Table Browser with identifiers LSU-rRNA_Hsa, 5S, and SSU-rRNA_Hsa.
Bowtie
alignments to the above reference sequences are quantified with RSEM
. Resulting counts are used to generate the Estimated RNA Type Counts table.
When there are more than 2 samples in the study, the pipeline will generate two sets of overview plots for all samples. The first set is generated using normalized read counts of all mature miRNAs with isomiRs
, which uses DESeq2
under the hood. The second set is generated with DESeq2
using normalized read counts of several RNA Types (tRNA, lncRNA, miscRNA, scaRNA, snoRNA, snRNA), excluding miRNA and rRNA.
This plot shows the Pearson correlation coefficient between pairs of samples. You can mouse over the colored blocks to see correlation coefficients. This pipeline will generate two sample similarities plots, one of which will be made using all mature miRNAs, and the other using read counts of 6 non-miRNA RNA types (tRNA, lncRNA, miscRNA, scaRNA, snoRNA, snRNA). In the miRNA plot below, you can clearly notice the difference between groups of samples (different cell types).
This plot shows the distances between all samples. Each dot represents a sample, and distances between dots represent differences in RNA gene expression patterns between samples. This pipeline will generate two MDS plots, one of which will be made using mature miRNAs, and the other using read counts of 6 non-miRNA RNA types (tRNA, lncRNA, miscRNA, scaRNA, snoRNA, snRNA). Only the top 500 RNAs with highest variances are used to make both plots. You should expect replicate samples to be close to each other, and hope to see clear distances between your comparison groups, as is the case in the sample report.
This pipeline will generate two plots to represent top gene expression patterns. One will show results for only mature miRNA, and the other will show results for the non-miRNA RNA types tRNA, lncRNA, miscRNA, scaRNA, snoRNA, and snRNA. These plots show a heatmap of expression patterns of the top 100 mature miRNAs or non-miRNA RNAs with highest variance. Both miRNA and non-miRNA graphs plot log2 fold change values against mean RNA expression levels from their respective categories. In both graphs, positive numbers (red) indicate higher expression levels, while negative numbers (blue) indicate lower expression levels. The RNA results are clustered using hierarchical clustering. This plot gives you an overview of expression patterns among RNAs with most significant changes in their expression. You can also mouse over to see values for specifc RNA and sample. In the sample report featuring an miRNA heatmap below, you can clearly see different miRNAs being most highly expressed in different cell types. A static version of this plot can also be downloaded on the Aladdin platform.
If your experiment was conducted with replication, this pipeline runs differential gene expression analysis twice, once for miRNA with isomiRs
, which uses DESeq2
under the hood, and again for genes from 6 RNA types (tRNA, lncRNA, miscRNA, scaRNA, snoRNA, snRNA) with DESeq2
.
Workflow summary
section of the report.
This section lists the versions of softwares used in this bioinformatic pipeline. This should help you in writing the methods section of your publication or if you wish to carry out some of the analysis on your own.
This section lists some important parameters of this particular study. This often includes which reference genome is used, how the trimming was done, and false discovery rate (FDR) and fold change thresholds used in differential gene expression analysis.