wexygen - A GATK-Based Whole Exome and Whole Genome Sequencing Analysis Toolkit

wexygen is a high-performance, GATK Best Practices-compliant WES/WGS pipeline toolkit built in C++ and Python. It provides fully automated preprocessing, configuration generation, annotation, workflow execution, and logging. The toolkit integrates industry-standard tools including GATK, ANNOVAR, and NIRVANA, and supports both Snakemake and Nextflow workflows with SLURM cluster execution.

This work is ongoing AND it is extensively based on and robustly insipred by the brilliant works of wes_gatk. They have an amazing project and much more completed than wexygen. So if you're looking for a much more robust toolkit, please check out their project WES_gatk

Features

Full GATK Best Practices-based workflow for WES and WGS data.
Supports both Snakemake and Nextflow pipelines.
SLURM schedular integration for cluster-based execution.
Automated generation of workflow files during CMake build:
- workflow_files/snakemake/
- workflow_files/nextflow/
Automated environment setup:
- GATK installation
- ANNOVAR installation
- NIRVANA installation
- Reference genome downloads
- Known-sites VCF downloads
Detailed logging and execution tracing.
Preprocessing entrypoint (preprocessor.py) for preparing raw FASTQ/inputs.
Unified runner binary (./wexygen) that detects and executes the selected workflow engine.
Supports:
- Variant calling
- Annotation via ANNOVAR and NIRVANA
- QC metrics and contamination estimation (SVD)
- Joint use of multiple known-variant resources
Reproducible output structure for downstream analysis.
Built in C++17 and Python 3, portable across Linux systems.

Installation

wexygen is built using CMake. The build system automatically generates required support scripts and workflow directories.

mkdir build
cd build
cmake ..
make -j

During the build, the following components are generated inside the build/ directory:

preprocessor.py
wexygen binary
workflow_files/snakemake/
workflow_files/nextflow/
download_and_setup_data.sh
run_wexygen.sh
Auto-installer scripts for:
- GATK
- ANNOVAR
- NIRVANA
- Reference data
- Known variants

After building, run the data setup script:

./download_and_setup_data.sh

Usage

Step 1: Preprocess Input Data

The preprocessor organizes raw FASTQ, verifies metadata, and prepares the WES/WGS run directory.

python preprocessor.py \
  -i ./test/raw_data \
  -o ./test/output \
  --overwrite

Step 2: Run wexygen

The following example demonstrates a WES analysis using Snakemake, GATK Best Practices, and ANNOVAR/NIRVANA annotation:

./wexygen WES \
    -i ./test/raw_data \
    -o ./test/output \
    --reference-fasta ./test/tools/broad/Homo_sapiens_assembly38.fasta \
    --bed-file ./test/exome_bed/sureSelect_V6_60M.bed \
    --gff-file ./test/gff/Homo_sapiens.GRCh38.109.gff3.gz \
    --nirvana-path ./test/nirvana \
    --annovar-path ./test/annovar \
    --haplotype-db-file /home/propenster/src/compbio/gatk_res/Homo_sapiens_assembly38.haplotype_database.txt \
    --known-variants-snps ./test/known_variants/Homo_sapiens_assembly38.dbsnp138.vcf \
    --known-variants-indels ./test/known_variants/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
    --known-variants-indels2 ./test/known_variants/Homo_sapiens_assembly38.known_indels.vcf.gz \
    --reference-index ./test/tools/broad/Homo_sapiens_assembly38.fasta \
    --svd-prefix ./test/tools/broad/Homo_sapiens_assembly38.contam \
    --generate-confs-only \
    --use-snakemake \
    --threads 12

(c) Faith (propenster) Olusegun 2025

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
CMakeFiles		CMakeFiles
scripts		scripts
src		src
test		test
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
Snakefile		Snakefile
app.log		app.log
gatk_bwa_pipeline.nf		gatk_bwa_pipeline.nf
gatk_bwa_pipeline.smk		gatk_bwa_pipeline.smk
gen_test_data.sh		gen_test_data.sh
main.nf		main.nf
preprocessor.py		preprocessor.py
run_wexygen.sh		run_wexygen.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

wexygen - A GATK-Based Whole Exome and Whole Genome Sequencing Analysis Toolkit

Features

Installation

Usage

Step 1: Preprocess Input Data

Step 2: Run wexygen

About

Uh oh!

Releases

Packages

Languages

propenster/wexygen

Folders and files

Latest commit

History

Repository files navigation

wexygen - A GATK-Based Whole Exome and Whole Genome Sequencing Analysis Toolkit

Features

Installation

Usage

Step 1: Preprocess Input Data

Step 2: Run wexygen

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages