Native HMM segmentation by etal · Pull Request #1001 · etal/cnvkit

etal · 2026-02-18T06:13:30Z

No description provided.

Replace the pomegranate-based HMM with a direct numpy/scipy implementation featuring distance-dependent transitions, joint (log2, BAF) emissions parameterized by (purity, ploidy), and grid search over marginal likelihood. This removes ~500MB of transitive dependencies (pomegranate, PyTorch, numba, llvmlite). Key changes: - hmm.py: Forward/Viterbi algorithms in log space, batched grid search vectorized across all (purity, ploidy) grid points per arm - commands.py: Add --purity-output flag to segment command for saving purity/ploidy estimates as TSV - Remove pomegranate from all dependency and environment files - Update documentation to reflect new HMM implementation - Remove stale pomegranate ImportWarning workarounds from test files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…icks Remove run_exports with Jinja2 pin_subpackage template from conda meta.yaml -- not appropriate for a pure Python package and caused check-yaml hook failure. Fix single-backtick RST inline code to double-backtick. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Skip purity/ploidy grid search when no autosomal chromosomes are found (e.g. chrX/chrY/chrM-only data), falling back to purity=1.0, ploidy=2.0 with a warning. Rename state_cn to state_indices for clarity, since the variable holds HMM state indices, not copy number values. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use aggregated minor allele counts and read depths (from VCF) instead of derived mirrored BAF ratios for the HMM emission model. Beta-binomial naturally accounts for sequencing depth, respects the [0,1] domain, and eliminates the mirroring mismatch where observations were >= 0.5 but expected values could be < 0.5. - Add VariantArray.baf_counts_by_ranges() to aggregate counts per bin - Add _betabinom_logpmf() vectorized helper using gammaln/betaln - Rename _expected_baf → _expected_minor_freq (corrects misleading name) - Update log_emission_probs, _batched_emission_baf, prepare_observations, grid_search_purity_ploidy, and segment_hmm for count-based interface - Clamp alt_count to [0, depth] for robustness against inconsistent VCFs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

etal · 2026-02-18T06:13:51Z

@claude Please review.

claude · 2026-02-18T06:14:07Z

Claude finished @etal's task —— View job

PR Review: Native HMM Segmentation

Reviewing PR #1001 which implements native HMM segmentation to replace the pomegranate dependency.

Review Checklist:

Read project guidelines (CLAUDE.md)
Review main HMM implementation changes (cnvlib/segmentation/hmm.py)
Analyze test updates and coverage
Check dependency removal changes
Review documentation updates
Assess overall code quality and performance implications
Provide comprehensive review feedback

Review Summary

This PR represents a significant and well-executed refactoring that removes the pomegranate dependency and implements a native HMM segmentation algorithm using pure numpy/scipy. The implementation is mathematically sound and follows good software engineering practices.

✅ Strengths

Architecture & Design:

Clean native implementation: The new HMM uses only numpy/scipy, eliminating the external pomegranate dependency
Improved BAF modeling: Replaces Gaussian BAF emission with beta-binomial distribution on raw allele counts, which is more appropriate for count data
Efficient grid search: Batched computation across purity/ploidy grid points for better performance
Modular design: Well-separated functions for emissions, transitions, Viterbi, and forward algorithms

Mathematical Soundness:

Distance-dependent transitions: p_stay(d) = exp(-d/L) with copy number distance penalties
Proper emission models: Gaussian for log2 ratios, beta-binomial for allele counts
Numerical stability: Uses logsumexp, gammaln, betaln for stable log-space computations
Edge case handling: Proper handling of zero-depth variants, single-bin arms, empty autosomes

Code Quality:

Type annotations: Comprehensive type hints with numpy typing
Error handling: Robust handling of edge cases and invalid data
Documentation: Clear docstrings explaining parameters and mathematical formulations
Modern Python: Uses PEP 604 union syntax, proper match/case statements

🔧 Implementation Details

Key algorithmic improvements:

cnvlib/segmentation/hmm.py:103-121: New _betabinom_logpmf function for numerically stable beta-binomial calculations
cnvlib/segmentation/hmm.py:123-189: Enhanced emission probability calculation with joint log2/BAF modeling
cnvlib/segmentation/hmm.py:574-668: Efficient batched grid search implementation
cnvlib/vary.py:63-82: New baf_counts_by_ranges method for aggregating allele counts

Dependency management:

✅ Clean removal of pomegranate >=1.0.0 from requirements/core.txt and requirements/min.txt
✅ Removal of pomegranate-related warnings from test files
✅ Updated conda environment files consistently

🧪 Testing & Validation

Test coverage:

✅ Existing HMM tests in test/test_commands.py:1034-1050 remain functional
✅ Tests cover all HMM variants: hmm, hmm-tumor, hmm-germline, and joint log2+BAF mode
⚠️ Missing unit tests: No specific tests for new mathematical functions (_betabinom_logpmf, emission calculations, grid search)

Code quality improvements:

✅ Removed pomegranate import warnings
✅ Applied consistent code formatting (ruff)
✅ Updated type annotations throughout

🎯 Suggestions for Improvement

Add unit tests for core mathematical functions:

def test_betabinom_logpmf(self):
    """Test beta-binomial log PMF calculation."""

def test_emission_probabilities(self):
    """Test log emission probability calculations."""

def test_grid_search_batching(self):
    """Test batched grid search efficiency."""

Performance benchmarking: Consider adding timing comparisons vs. the old pomegranate implementation
Documentation: The algorithm change is significant enough to warrant a note in the changelog or documentation about the improved BAF modeling

⚠️ Minor Issues

cnvlib/segmentation/hmm.py:28: BETABINOM_RHO = 200.0 - Consider adding a comment explaining this concentration parameter choice
cnvlib/segmentation/hmm.py:703: The docstring mentions "optional joint (log2, BAF) emissions" but could be clearer about the beta-binomial model
cnvlib/vary.py:76: The clamping of alt_count to [0, depth] is good defensive programming but might mask VCF data quality issues

📊 Performance Considerations

Expected improvements:

✅ Reduced dependencies and memory usage (no pomegranate)
✅ Vectorized operations should be faster than pomegranate's general-purpose HMM
✅ Batched grid search is more efficient than individual parameter evaluations

Potential concerns:

The beta-binomial calculations are more computationally intensive than Gaussian, but this is appropriate for the data model

🎉 Conclusion

This is an excellent refactoring that removes a complex external dependency while improving the mathematical foundation of the HMM segmentation. The beta-binomial emission model is more appropriate for count data than the previous Gaussian BAF model, and the implementation is robust and well-engineered.

Recommendation: ✅ APPROVE

The core changes are sound, tests pass, and this represents a significant improvement to the codebase architecture and mathematical modeling. The minor suggestions above are optional enhancements rather than blocking issues.

codecov · 2026-02-18T06:19:39Z

Codecov Report

❌ Patch coverage is 83.00654% with 52 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.43%. Comparing base (8246923) to head (c57a2c2).
⚠️ Report is 9 commits behind head on master.

Files with missing lines	Patch %	Lines
cnvlib/segmentation/hmm.py	84.32%	38 Missing and 7 partials ⚠️
cnvlib/commands.py	16.66%	5 Missing ⚠️
cnvlib/vary.py	83.33%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1001      +/-   ##
==========================================
+ Coverage   61.68%   62.43%   +0.74%     
==========================================
  Files          72       72              
  Lines        6968     7168     +200     
  Branches     1213     1239      +26     
==========================================
+ Hits         4298     4475     +177     
- Misses       2296     2316      +20     
- Partials      374      377       +3

Flag	Coverage Δ
unittests	`62.43% <83.00%> (+0.74%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

16 tests covering _betabinom_logpmf (vs scipy reference, zero-depth, broadcasting, extreme parameters), _expected_minor_freq (diploid het, LOH, CN=0, impure, all-states bound check), log_emission_probs (log2-only, with BAF counts, zero-depth neutrality, het-favors-diploid), and _batched_emission_baf (shape, zero-depth zeroed). Also improve BETABINOM_RHO comment and segment_hmm docstring clarity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fixes the RUF059 lint error.

etal and others added 4 commits February 17, 2026 19:54

etal and others added 3 commits February 18, 2026 09:26

segmetrics: Fix typecheck error from np.percentile's py3.12 stub

1530c21

hmm: prefix unused unpacked variables with an underscore

11d2c8a

Fixes the RUF059 lint error.

etal force-pushed the hmmnew branch from 76f1057 to 11d2c8a Compare February 19, 2026 04:26

Remove superfluous file

c57a2c2

etal merged commit f1e8e43 into master Feb 19, 2026
14 checks passed

etal deleted the hmmnew branch February 19, 2026 04:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Native HMM segmentation#1001

Native HMM segmentation#1001
etal merged 8 commits intomasterfrom
hmmnew

etal commented Feb 18, 2026

Uh oh!

etal commented Feb 18, 2026

Uh oh!

claude bot commented Feb 18, 2026 •

edited

Loading

Uh oh!

codecov bot commented Feb 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

etal commented Feb 18, 2026

Uh oh!

etal commented Feb 18, 2026

Uh oh!

claude bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Native HMM Segmentation

Review Checklist:

Review Summary

✅ Strengths

🔧 Implementation Details

🧪 Testing & Validation

🎯 Suggestions for Improvement

⚠️ Minor Issues

📊 Performance Considerations

🎉 Conclusion

Uh oh!

codecov bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude bot commented Feb 18, 2026 •

edited

Loading

codecov bot commented Feb 18, 2026 •

edited

Loading