Parapipe: A Bioinformatics Pipeline for Genomic Analysis of Eukaryotic Pathogens from NGS Data

Overview

Parapipe is a modular, ISO-accreditable bioinformatics pipeline designed to process Eukaryotic Pathogen next-generation sequencing (NGS) datasets. It provides a robust, scalable, and reproducible framework for genomic surveillance, outbreak tracking, and epidemiological investigations.

Parapipe supports paired end Next-Generation Sequencing data and performs thorough Quality-Control, ensuring high-quality genomic insights even when full genome coverage is not achieved.

Features

Quality Control: Automated read pre-processing, filtering, and QC reporting (fastp, FastQC, MultiQC, Trim Galore).
Variant Calling & SNP Analysis: Variant calling and high-resolution SNP space analysis (Samtools, Bowtie2, FreeBayes).
Multiplicity of Infection (MOI) Characterisation: Identifies mixed infections and within-host diversity (PyMOI, Moimix).
Phylogenomic Analysis: Generates SNP-based phylogenetic trees and clustering (Ardal, IQtree2).
Modular and Scalable: Built using Nextflow DSL2, containerized with Singularity, ensuring portability and reproducibility.

Installation

Prerequisites

Bioinformatic tools executed within this pipeline itself are containerised using singularity/apptainer, but Go, Nextflow, and Singularity/Apptainer themselves will need to be installed.

Installation Steps

Clone the repository:

git clone https://github.com/ArthurVM/Parapipe.git

Build Singularity containers:

cd Parapipe/singularity
bash singularity_build.sh
cd ..

Run Parapipe

Parapipe runs using the standard Nextflow command line syntax. There are a number of parameters which can be assigned and adjusted using command line arguments, but you can run Parapipe out of the box.

Quick Start

For those who want to just run Parapipe out of the box with default parameters, just feed it a directory containing paired end fastqs, a suffix for how fastqs can be paired, a reference species to map to, and an output directory prefix.

Lets say we have a dataset of paired end Cryptosporidium parvum NGS files in our directory, /user/my_cparvum_data/, which looks like:

/user/my_cparvum_data/
            ├── sample1_1.fastq.gz
            ├── sample1_2.fastq.gz
            ├── sample2_1.fastq.gz
            ├── sample2_2.fastq.gz
            ├── sample3_1.fastq.gz
            ├── sample3_2.fastq.gz
            ├── sample4_1.fastq.gz
            └── sample4_2.fastq.gz

The --pattern argument refers to the suffix with which we can pair files. In this case the suffix for forward reads are _1.fastq.gz, and reverse reads _2.fastq.gz. Consequently, we can use bash syntax to pair these files by setting --pattern to '*_{1,2}.fastq.gz', which captures any files which end in _1.fastq.gz or _2.fastq.gz using the * wildcard, and pairs them by their shared prefix (e.g. sample1). This prefix will then be used as the ID for this sample during the run.

In the Parapipe directory, simply run:

nextflow run ./parapipe.nf --input_dir /user/my_cparvum_data/ --pattern '*_{1,2}.fastq.gz' --ref cryptosporidium_parvum --output_dir ./my_cparvum_run -profile singularity

Advanced Usage

There are a small number of parameters which can be tuned, determining which samples are included in phylogenomic analysis, and how alleles are filtered.

❯ nextflow run ./parapipe.nf --help

Usage:
  nextflow run main.nf --input_dir [fullPath] --pattern [regexPattern] --genome [refGenomeID]

Description:
  General purpose parasite genomics pipeline. DEVELOPMENT VERSION.

Mandatory Arguments:
  --input_dir         the full path to the directory containing raw read files in paired-end
                      fastq format.
  --output_dir        output directory. Module output files will be written to subdirectories
                      within this parent directory.
  --pattern           regex pattern to match pairs of fastq files.
  --ref               the species ID of the reference genome to download and map against.
  --yaml              YAML file containing typing profiles.

Optional Arguments:
  --database          the path to a directory containing vcf files for constructing a
                      phylogenetic tree.
  --mincov            the minimum fraction of the genome which must be covered to a depth of
                      5x to include a sample in phylogenetic analysis. Default=0.8.
  --missing           the maximum allele missingness to include SNPs in MOI analysis. Default=0.1.
  --maf               the minimum minor allele frequency to include SNPs in MOI analysis. Default=0.05.
  --mac               the minimum minor allele depth to include SNPs in MOI analysis. Default=5.

Profiles:
singularity           run with singularity image

Output

Parapipe will output a directory which contains key intermediate and analysis files which can be used for further analysis. A report is produced and deposited in the output directory as Parapipe_report.html.

Suggestions and Requests

If you have any suggestions for functionality, requests for directions on usage or report interpretation, or if you would just like to reach out to discuss Parapipe then please don't hesitate to contact me at morrisa28@cardiff.ac.uk. Alternatively you may wish to post an issue using the GitHub issues tab.

Citing Parapipe

If you use Parapipe to perform analysis of your dataset, please cite the paper: https://www.microbiologyresearch.org/content/journal/acmi/10.1099/acmi.0.000993.v3

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
.github/workflows		.github/workflows
bin		bin
markdown		markdown
modules		modules
old_scripts		old_scripts
resources		resources
scripts		scripts
singularity		singularity
workflows		workflows
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config
parapipe.nf		parapipe.nf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parapipe: A Bioinformatics Pipeline for Genomic Analysis of Eukaryotic Pathogens from NGS Data

Overview

Features

Installation

Prerequisites

Installation Steps

Run Parapipe

Quick Start

Advanced Usage

Output

Suggestions and Requests

Citing Parapipe

About

Uh oh!

Releases

Packages

Languages

ArthurVM/Parapipe

Folders and files

Latest commit

History

Repository files navigation

Parapipe: A Bioinformatics Pipeline for Genomic Analysis of Eukaryotic Pathogens from NGS Data

Overview

Features

Installation

Prerequisites

Installation Steps

Run Parapipe

Quick Start

Advanced Usage

Output

Suggestions and Requests

Citing Parapipe

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages