pipeline-resources

How to interpret the Genomics report

This document describes how to understand the bioinformatics report generated by Aladdin Genomics pipeline. Most of the plots are taken from the sample report. The sample report was generated using public data from this paper. The plots in your report might look slightly different.

Table of contents

  1. Report overview
  2. General statistics table
  3. Sample processing
  4. Variant Calling
  5. Variant information and potential impact (VEP)
  6. Pipeline information

Report overview

The bioinformatics report is generated using MultiQC. There are general instructions on how to use a MultiQC report on MultiQC website. The report itself also includes a link to an instructional video at the top of the report. In general, the report has a navigation bar to the left, which allows you to quickly navigate to one of many sections in the report. On the right side, there is a toolbox that allows you to customize the appearance of your report and export figures and/or data. Most sections of the report are interactive. The plots will show you the sample name and values when you mouse over them.

Report overview

General statistics table

The general statistics table gives an overview of important stats for your samples. For example, it includes the GC content after filtering, the duplicate percentage of reads, the error rate (mismatches per bases mapped), and the fraction of the genome with at least 30X coverage. These stats are collected from different sections of the report to give you a snapshot. This is usually the quickest way to evaluate how your experiment went. Here are a few important things to look for when reading this table:

Other information you can get from this table includes (from left to right):

General statistics

Sample processing

FastQC

FastQC provides several quality metrics: the quality score distribution across your reads (in section Sequence Quality Histograms), the per base sequence content (\%A/C/G/T)(in section Per Base Sequence Content). You get information about adapter contamination (in section Adapter Content) and other overrepresented sequences (in section Overrepresented sequences). Some sections will sometimes issue warnings that your samples failed QC. It is important to remember that these QC metrics are from the raw reads, and there are often reasonable explanations why the raw reads failed these QC. One frequent warning you might see is in the Per Base Sequence Content section (see below). The presence of adapter sequences at 5’ or 3’ end could trigger these warnings. These adapter sequences are trimmed off before alignment, so there is no need to worry about them in the raw reads.

Per Sequence Quality Score

Per Sequence GC Content

Sequence Duplication Levels

FastP

FastP is used for data pre-processing, including quality filtering and adapter trimming. Important metrics include:

Sequence Quality R1

Sequence Quality R2

GATK MarkDuplicates

GATK MarkDuplicates is used to identify and flag duplicate reads. It improves the accuracy of downstream analyses such as variant calling by reducing the impact of PCR duplicates and sequencing artifacts:

Duplicate Reads

Samtools Flagstat

Samtools Flagstat generates a statistical summary of various attributes of sequence alignment files, including the total number of reads, number of properly paired reads, number of singletons, and mapping quality.

Samtools Metrics

Mosdepth

Mosdepth performs fast BAM/CRAM depth calculation for WGS, exome, or targeted sequencing. It provides coverage statistics:

Coverage Depth

Coverage Breadth

Variant Calling

This section describes the processing and results of identifying variants (e.g., Single Nucleotide Polymorphisms - SNPs, Insertions - Deletions - INDELs) in the genome. Variant calling is performed using GATK(broadinstitute.org), a popular toolkit offering a wide variety of tools with a primary focus on variant discovery and genotyping.

GATK BQSR

GATK BQSR (Base Quality Score Recalibration) involves recalibrating the base quality scores to improve the accuracy of variant calls produced by tools like GATK.

GATK BQSR

BCFtools

BCFtools contains utilities for variant calling and manipulating VCFs and BCFs. It offers functions to detect genetic variants, including SNPs, INDEL, and Transition/Transversion (Ts/Tv) from aligned sequencing data (in BAM format). By allowing the comparison and integration of variant call sets across multiple samples, it facilitates the identification of shared and unique variants among different datasets.

Variant Count

Variant Quality Count

VCFtools

VCFtools is a program for working with and reporting on VCF files. It is used for further analysis of variant calls and enables the filtering of Ts/Tv based on quality and counts, which helps in selecting high-confidence variants and removing potential false positives or artifacts. Researchers often empirically determine the threshold based on the quality distribution of the SNPs in their dataset. Commonly used thresholds are Q20, Q30, etc., corresponding to error probabilities of 1 in 100 and 1 in 1000, respectively. Bioinformatics tools like VCFtools allow users to specify a minimum quality threshold for filtering SNPs. For instance, setting a threshold of Q30 would filter out all SNPs with a Phred quality score below 30. The expected Ts/Tv ratio for a typical human genome is around 2:1 to 3:1 in coding regions and approximately 1:1 in non-coding regions.

TsTv by Quality

Variant information and potential Impact (VEP)

Variant Effect Predictor (VEP) determines variants’ effects on genes, transcripts, protein sequences, and regulatory regions. VEP is utilized to annotate variants in VCF files and predict their functional consequences. It provides variant information, including chromosome position, reference and alternate alleles. VEP also predicts the potential impact of genetic variants on genes and proteins by classifying them into different functional categories.

Variant Consequences

Predicted Impacts by SIFT

Pipeline Information

This section details the methods and software used in the analysis:

Methods Description

Describes the overall workflow and methodologies applied in the analysis, ensuring reproducibility and understanding of the process.

Software Versions

This section lists the versions of software used in this bioinformatic pipeline. This should help you in writing the methods section of your publication or if you wish to carry out some of the analysis on your own. We have written a template to help you with the method section of your manuscript.

Workflow Summary

This section lists some important parameters of this particular study. This often includes which reference genome is used, how the preprocessing was done, and also PCGR subworkflow option.