pipeline-resources

About MGscan Metagenomics pipeline

This pipeline is designed to conduct taxonomy profiling and diversity analysis on metagenome shotgun sequencing data. Please find the sample report here.

Source of the pipeline

This pipeline is originally adapted from community-developed nf-core/taxprofiler pipeline version 1.0.0. Zymo Research made significant contributions in the adaption effort. This mainly include the addition of downstream analyses such as diversity analysis, addition of sourmash as the profiler, addition of antimicrobial resistance gene identification and the improvement of the report.

What is in the pipeline

This pipeline is built using Nextflow. A brief summary of pipeline:

Read QC (FastQC or falco as an alternative option)
Performs optional read pre-processing (code for long-read inherited from nf-core/taxprofiler, but not separately tested by us yet)
- Adapter clipping and merging (short-read: fastp, AdapterRemoval2; long-read: porechop)
- Low complexity and quality filtering (short-read: bbduk, PRINSEQ++; long-read: Filtlong)
Perform Host-read removal
- Host-read removal (short-read: BowTie2; long-read: Minimap2). This is not performed when sourmash-zymo is selected as the database because it already contains host sequences.
- Statistics for host-read removal (Samtools)
Run merging when applicable
Identifies antimicrobial resistance genes in samples from database MEGARes version 3
- Reads are aligned to MEGARes reference (bwa mem)
- Resistome statistics are quantified and compiled for each sample (AMRplusplus)
Performs taxonomic profiling using one of: (nf-core/taxprofiler has more choices for this step, if there are tools you’d like for this step, please let us know.)
- sourmash
- MetaPhlAn4
Merge all taxonomic profiling results into one table (Qiime2) and generate composition plot (Krona)
Perform alpha/beta diversity analysis (Qiime2) and beta diversity differential expression (ANCOM-BC)
Present all results in above steps in a report (MultiQC)

For details, please find the source code here.

Pipeline flowchart

Default pipeline parameters

Trimming (fastp)

fastp automatically detects and trims adapter sequences. Reads shorter than 15bp after trimming are discarded.

Low complexity filtering (BBDuk)

BBDuk discards any read that has less than 0.3 in average entropy in a sliding window of 50bp.

Taxonomic classification (sourmash)

sourmash sketch is run with the parameter of k=31,scaled=1000,abund. sourmash gather is run with the parameter of --ksize 31 --threshold-bp 5000. This means that k-mers of length 31 are profiled at 1:1000 downsampling with abundance of each k-mer consdiered. A match/genome will only be reported if the estimated overlap between a sample and the reference genome is at least 5000bp.

Sourmash database

By default, the database option sourmash-zymo-2024 include the following reference genomes.

GTDB R08-RS214 genomic representatives
Genbank viral genomes(2022)
Genbank archael genomes(2022)
Genbank protozoa genomes(2022), with one genome GCA_002893375.1 removed due to contamination
Genbank fungi genomes(2022)
A list of 29 genomes realted to ZymoBIOMICS Microbial Community Standards
A list of 16 common host genomes such as human, mouse, rat, etc. These genomes are included for host reads removal purposes. Their reads are not included downstream analysis.

K-mer to read approximation

We estimate the numbers of reads belonging to each microbial genome by multiplying f_unique_weighted values of that genome from sourmash to the total number of reads after trimming and low complexity filtering.

Sample filtering criteria

The following samples not removed from downstream analysis at various steps:

Samples with a file size <1 kb after trimming and read length filtering
Samples with no identified microbial genomes by sourmash after host genome removal
Samples with less than 1,000 total reads assigned to non-host genomes
Samples that fail to produce a sourmash result after 3 tries with increased compute resources, up to 180GB of memory. This rarely happens and only happens to samples with extremely complex profile.