Skip to content

propenster/wexygen

Repository files navigation

wexygen - A GATK-Based Whole Exome and Whole Genome Sequencing Analysis Toolkit

wexygen is a high-performance, GATK Best Practices-compliant WES/WGS pipeline toolkit built in C++ and Python. It provides fully automated preprocessing, configuration generation, annotation, workflow execution, and logging. The toolkit integrates industry-standard tools including GATK, ANNOVAR, and NIRVANA, and supports both Snakemake and Nextflow workflows with SLURM cluster execution.

This work is ongoing AND it is extensively based on and robustly insipred by the brilliant works of wes_gatk. They have an amazing project and much more completed than wexygen. So if you're looking for a much more robust toolkit, please check out their project WES_gatk

Features

  • Full GATK Best Practices-based workflow for WES and WGS data.
  • Supports both Snakemake and Nextflow pipelines.
  • SLURM schedular integration for cluster-based execution.
  • Automated generation of workflow files during CMake build:
    • workflow_files/snakemake/
    • workflow_files/nextflow/
  • Automated environment setup:
    • GATK installation
    • ANNOVAR installation
    • NIRVANA installation
    • Reference genome downloads
    • Known-sites VCF downloads
  • Detailed logging and execution tracing.
  • Preprocessing entrypoint (preprocessor.py) for preparing raw FASTQ/inputs.
  • Unified runner binary (./wexygen) that detects and executes the selected workflow engine.
  • Supports:
    • Variant calling
    • Annotation via ANNOVAR and NIRVANA
    • QC metrics and contamination estimation (SVD)
    • Joint use of multiple known-variant resources
  • Reproducible output structure for downstream analysis.
  • Built in C++17 and Python 3, portable across Linux systems.

Installation

wexygen is built using CMake. The build system automatically generates required support scripts and workflow directories.

mkdir build
cd build
cmake ..
make -j

During the build, the following components are generated inside the build/ directory:

  • preprocessor.py
  • wexygen binary
  • workflow_files/snakemake/
  • workflow_files/nextflow/
  • download_and_setup_data.sh
  • run_wexygen.sh
  • Auto-installer scripts for:
    • GATK
    • ANNOVAR
    • NIRVANA
    • Reference data
    • Known variants

After building, run the data setup script:

./download_and_setup_data.sh

Usage

Step 1: Preprocess Input Data

The preprocessor organizes raw FASTQ, verifies metadata, and prepares the WES/WGS run directory.

python preprocessor.py \
  -i ./test/raw_data \
  -o ./test/output \
  --overwrite

Step 2: Run wexygen

The following example demonstrates a WES analysis using Snakemake, GATK Best Practices, and ANNOVAR/NIRVANA annotation:

./wexygen WES \
    -i ./test/raw_data \
    -o ./test/output \
    --reference-fasta ./test/tools/broad/Homo_sapiens_assembly38.fasta \
    --bed-file ./test/exome_bed/sureSelect_V6_60M.bed \
    --gff-file ./test/gff/Homo_sapiens.GRCh38.109.gff3.gz \
    --nirvana-path ./test/nirvana \
    --annovar-path ./test/annovar \
    --haplotype-db-file /home/propenster/src/compbio/gatk_res/Homo_sapiens_assembly38.haplotype_database.txt \
    --known-variants-snps ./test/known_variants/Homo_sapiens_assembly38.dbsnp138.vcf \
    --known-variants-indels ./test/known_variants/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
    --known-variants-indels2 ./test/known_variants/Homo_sapiens_assembly38.known_indels.vcf.gz \
    --reference-index ./test/tools/broad/Homo_sapiens_assembly38.fasta \
    --svd-prefix ./test/tools/broad/Homo_sapiens_assembly38.contam \
    --generate-confs-only \
    --use-snakemake \
    --threads 12

(c) Faith (propenster) Olusegun 2025

About

wexygen - A GATK-Based Whole Exome and Whole Genome Sequencing Analysis Toolkit

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published