pHapCompass is a probabilistic framework for polyploid haplotype assembly supporting both
short-read and long-read sequencing data.
- Installation
- Input Requirements
- Basic Usage
- Command-line Arguments
- Simulation Pipeline
- Evaluation
- Output Format
- Example Output
- Availability of Simulated Datasets
- Citation
- Contact
Before installing pHapCompass, ensure you have:
- Python 3.10+
- C compiler (gcc or cc)
- make build tool
- git (for downloading submodules)
- zlib development headers (required for compiling extractHAIRS)
Note: pHapCompass has been tested on Ubuntu, Debian, Fedora, and RHEL/Rocky Linux.
Ubuntu/Debian:
sudo apt-get update
sudo apt-get install build-essential git python3-pip zlib1g-dev libbz2-dev liblzma-devFedora:
sudo dnf install -y git gcc make zlib-devel bzip2-devel xz-devel python3-pipRHEL/Rocky Linux:
sudo dnf install -y git gcc make zlib-devel bzip2-devel xz-devel epel-release
# RHEL ships with Python 3.9 by default - install Python 3.11 explicitly:
sudo dnf install -y python3.11 python3.11-pip
# Then use python3.11 instead of pip when installing pHapCompass (see below)Option 1: Install from GitHub (Recommended)
This automatically compiles extractHAIRS during installation:
git clone --recursive https://github.com/bayesomicslab/pHapCompass.git
cd pHapCompass
pip install -e .The
--recursiveflag is important — it downloads required submodules (htslib and samtools).
On RHEL/Rocky Linux, use Python 3.11 explicitly:
python3.11 -m pip install -e .Option 2: Manual Compilation (if automatic compilation fails)
git clone --recursive https://github.com/bayesomicslab/pHapCompass.git
cd pHapCompass
# Manually compile extractHAIRS
cd third_party/extract_poly
make
cd ../..
# Install pHapCompass
pip install -e .Option 3: Install without extractHAIRS
If you plan to use pre-computed fragment files (.frag):
pip install git+https://github.com/bayesomicslab/pHapCompass.gitThen use --frag-path to provide pre-computed fragment files.
"gcc: command not found" or "make: command not found"
- Install build tools using the commands in the Prerequisites section above.
"zlib.h: No such file or directory" during compilation
- Install zlib development headers:
# Ubuntu/Debian sudo apt-get install zlib1g-dev # Fedora/RHEL sudo dnf install zlib-devel
"Package requires a different Python: 3.9.x not in >=3.10" on RHEL/Rocky Linux
- Install Python 3.11 and use it explicitly:
sudo dnf install -y epel-release python3.11 python3.11-pip python3.11 -m pip install -e . - Or use conda to manage the Python version:
conda create -n phapcompass python=3.10 -y conda activate phapcompass pip install -e .
Git submodule errors
- Ensure you cloned with
--recursive, or initialize manually:git submodule update --init --recursive
"extractHAIRS binary not found" during runtime
- Check if the binary was compiled:
ls -lh src/phapcompass/bin/extractHAIRS
- If not present, manually compile as shown in Option 2 above.
For more detailed troubleshooting, see TROUBLESHOOTING.md.
# Check that pHapCompass is installed
phapcompass --help
# Test with example data (this also exercises extractHAIRS)
phapcompass --data-type short \
--bam-path test_data/short_data_example/0.bam \
--vcf-path test_data/ref_example/Chr1_unphased.vcf \
--result-path output.vcf.gzTo run pHapCompass, you need:
Required
- BAM file: aligned reads from a single individual
- VCF file: containing heterozygous SNPs (biallelic or multiallelic)
Optional
- A pre-computed
.fragfragment file
The tool infers ploidy automatically from the VCF unless specified.
The standard usage is to run pHapCompass directly from BAM + VCF, letting the internal polyploid extractHAIRS generate fragments.
From BAM + VCF (recommended)
phapcompass --data-type short \
--bam-path sample.bam \
--vcf-path sample.vcf.gz \
--result-path output_short.vcf.gzOptional hyperparameters:
--mw: MEC weight (default: 10.0)--lw: likelihood weight (default: 1.0)--sw: FFBS sample weight (default: 1.0)--epsilon: sequencing error rate (default: 1e-5)--uncertainty [N]: enable N-sample FFBS solution sampling (default N=3)
Example with custom parameters:
phapcompass --data-type short \
--bam-path sample.bam \
--vcf-path sample.vcf.gz \
--result-path output_short.vcf.gz \
--mw 8 --lw 2 --sw 0.5Note: The weights do not have to be between 0 and 1.
Using a precomputed fragment file
phapcompass --data-type short \
--frag-path sample.frag \
--vcf-path sample.vcf.gz \
--result-path output_short.vcf.gzFrom BAM + VCF
phapcompass --data-type long \
--bam-path sample.bam \
--vcf-path sample.vcf.gz \
--result-path output_long.vcf.gzOptional hyperparameters:
--delta: transition penalty parameter (default: 5)--learning-rate: optimization learning rate (default: 0.02)--epsilon: sequencing error rate (default: 1e-5)--uncertainty [N]: enable N-sample solution sampling (default N=3)
Example with custom parameters:
phapcompass --data-type long \
--bam-path sample.bam \
--vcf-path sample.vcf.gz \
--result-path output_long.vcf.gz \
--delta 4 --learning-rate 0.01 --epsilon 0.00002Using a precomputed fragment file
phapcompass --data-type long \
--frag-path sample.frag \
--vcf-path sample.vcf.gz \
--result-path output_long.vcf.gzCore I/O
| Argument | Description |
|---|---|
--bam-path PATH |
BAM file; triggers internal extractHAIRS. |
--frag-path PATH |
Optional: use an existing fragment file. |
--vcf-path PATH |
Required. Input VCF containing heterozygous SNPs. |
--result-path PATH |
Required. Output VCF path. |
--ploidy INT |
Optional. If omitted, inferred from VCF. |
Model selection
--data-type short--data-type long
Short-read model hyperparameters
--mwMEC weight--lwlikelihood weight--swFFBS sample weight
Long-read model hyperparameters
--delta--learning-rate
Other
--epsilonsequencing error rate--uncertainty [N]enable sampling mode (N samples; default = 3)--verbose
pHapCompass includes a simulator for generating polyploid haplotype references (and optionally reads) for benchmarking. The pipeline is organized as:
- Haplotype simulation (required)
- Read simulation (optional; uses output of step 1)
Simulate haplotype references
phapcompass simulation haplotypes -hAutopolyploidy example:
phapcompass simulation haplotypes \
--reference_path reference/potato_tetra/He1_Chr1_only.fasta \
--output_dir sim_out \
--structure autopolyploidy \
--num_samples 1 \
--ploidies 4 \
--mutation_rates 0.001Allopolyploidy example:
phapcompass simulation haplotypes \
--reference_path reference/potato_tetra/He1_Chr1_only.fasta \
--output_dir sim_out \
--structure allopolyploidy \
--num_samples 1 \
--sg_rates 0.0005 0.0001 \
--mutation_rates 0.00005 0.0001Note: the haplotype simulator uses a fixed region window (
500000–1000000) whenshifted=True. Ensure your reference contig is long enough, or adjust the window insrc/phapcompass/simulator/simulate_haplotypes.py.
Simulate reads (planned)
Read simulation is under development and will be exposed through the same pipeline entry.
pHapCompass includes utilities to evaluate predicted polyploid haplotypes against truth.
phapcompass eval -hVER (Vector Error Rate)
phapcompass eval ver \
--truth-vcf path/to/truth.vcf.gz \
--pred-vcf path/to/pred.vcf.gz \
--ploidy 4MEC (Minimum Error Correction)
Using a fragment file:
phapcompass eval mec \
--pred-vcf path/to/pred.vcf.gz \
--ploidy 4 \
--frag path/to/reads.fragUsing a BAM file:
phapcompass eval mec \
--pred-vcf path/to/pred.vcf.gz \
--ploidy 4 \
--bam path/to/reads.bam \
--vcf path/to/input_unphased.vcf.gzGeometric MEC
phapcompass eval geom-mec \
--pred-vcf path/to/pred.vcf.gz \
--ploidy 4 \
--frag path/to/reads.fragpHapCompass outputs a single phased polyploid VCF with the following FORMAT fields:
GT Genotype (phased or unphased)
PS Phase-set identifier
If uncertainty mode is enabled, probability headers are added (one per solution):
##phapcompass_solution=<ID=i,Probability=p_i>
GT formatting
- Phased alleles use pipes:
0|1|0 - Unphased alleles use slashes:
0/1/0
PS formatting
- Integer block ID for phased SNPs
.for unphased positions
Multisolution output (uncertainty mode)
If --uncertainty N is used, GT and PS fields for different solutions appear separated by :, and probabilities appear in the VCF header only:
GT:PS
0|0|1:3529 : 0|1|0:3529 : 1|0|0:3529
##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Phased genotype">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set identifier">
##phapcompass_solution=<ID=1,Probability=0.812345>
##phapcompass_solution=<ID=2,Probability=0.187655>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
Chr1 3529 . A T . PASS . GT:PS 0|0|1:0
Chr1 3781 . G C . PASS . GT:PS 1|0|0:0
Chr1 5934 . A T . PASS . GT:PS 1|0|0:0
A subset of our simulated polyploid benchmarking data is publicly available:
Zenodo dataset:
https://zenodo.org/records/17667753
The remaining datasets will be released upon acceptance of the manuscript.
If you use pHapCompass, please cite our preprint:
Hosseini et al.
pHapCompass: Probabilistic Assembly and Uncertainty Quantification of Polyploid Haplotype Phase
arXiv:2512.04393
https://doi.org/10.48550/arXiv.2512.04393
@article{hosseini2025phapcompass,
title={pHapCompass: Probabilistic Assembly and Uncertainty Quantification of Polyploid Haplotype Phase},
author={Hosseini, Marjan and Veiner, Ella and Bergendahl, Thomas and Yasenpoor, Tala and Smith, Zane and Staton, Margaret and Aguiar, Derek},
journal={arXiv preprint arXiv:2512.04393},
year={2025}
}
For questions or issues, please open a GitHub issue on the project repository.