Skip to content

Comments

Native HMM segmentation#1001

Merged
etal merged 8 commits intomasterfrom
hmmnew
Feb 19, 2026
Merged

Native HMM segmentation#1001
etal merged 8 commits intomasterfrom
hmmnew

Conversation

@etal
Copy link
Owner

@etal etal commented Feb 18, 2026

No description provided.

etal and others added 4 commits February 17, 2026 19:54
Replace the pomegranate-based HMM with a direct numpy/scipy implementation
featuring distance-dependent transitions, joint (log2, BAF) emissions
parameterized by (purity, ploidy), and grid search over marginal likelihood.
This removes ~500MB of transitive dependencies (pomegranate, PyTorch, numba,
llvmlite).

Key changes:
- hmm.py: Forward/Viterbi algorithms in log space, batched grid search
  vectorized across all (purity, ploidy) grid points per arm
- commands.py: Add --purity-output flag to segment command for saving
  purity/ploidy estimates as TSV
- Remove pomegranate from all dependency and environment files
- Update documentation to reflect new HMM implementation
- Remove stale pomegranate ImportWarning workarounds from test files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…icks

Remove run_exports with Jinja2 pin_subpackage template from conda
meta.yaml -- not appropriate for a pure Python package and caused
check-yaml hook failure. Fix single-backtick RST inline code to
double-backtick.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Skip purity/ploidy grid search when no autosomal chromosomes are found
(e.g. chrX/chrY/chrM-only data), falling back to purity=1.0, ploidy=2.0
with a warning. Rename state_cn to state_indices for clarity, since the
variable holds HMM state indices, not copy number values.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use aggregated minor allele counts and read depths (from VCF) instead of
derived mirrored BAF ratios for the HMM emission model. Beta-binomial
naturally accounts for sequencing depth, respects the [0,1] domain, and
eliminates the mirroring mismatch where observations were >= 0.5 but
expected values could be < 0.5.

- Add VariantArray.baf_counts_by_ranges() to aggregate counts per bin
- Add _betabinom_logpmf() vectorized helper using gammaln/betaln
- Rename _expected_baf → _expected_minor_freq (corrects misleading name)
- Update log_emission_probs, _batched_emission_baf, prepare_observations,
  grid_search_purity_ploidy, and segment_hmm for count-based interface
- Clamp alt_count to [0, depth] for robustness against inconsistent VCFs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@etal
Copy link
Owner Author

etal commented Feb 18, 2026

@claude Please review.

@claude
Copy link

claude bot commented Feb 18, 2026

Claude finished @etal's task —— View job


PR Review: Native HMM Segmentation

Reviewing PR #1001 which implements native HMM segmentation to replace the pomegranate dependency.

Review Checklist:

  • Read project guidelines (CLAUDE.md)
  • Review main HMM implementation changes (cnvlib/segmentation/hmm.py)
  • Analyze test updates and coverage
  • Check dependency removal changes
  • Review documentation updates
  • Assess overall code quality and performance implications
  • Provide comprehensive review feedback

Review Summary

This PR represents a significant and well-executed refactoring that removes the pomegranate dependency and implements a native HMM segmentation algorithm using pure numpy/scipy. The implementation is mathematically sound and follows good software engineering practices.

Strengths

Architecture & Design:

  • Clean native implementation: The new HMM uses only numpy/scipy, eliminating the external pomegranate dependency
  • Improved BAF modeling: Replaces Gaussian BAF emission with beta-binomial distribution on raw allele counts, which is more appropriate for count data
  • Efficient grid search: Batched computation across purity/ploidy grid points for better performance
  • Modular design: Well-separated functions for emissions, transitions, Viterbi, and forward algorithms

Mathematical Soundness:

  • Distance-dependent transitions: p_stay(d) = exp(-d/L) with copy number distance penalties
  • Proper emission models: Gaussian for log2 ratios, beta-binomial for allele counts
  • Numerical stability: Uses logsumexp, gammaln, betaln for stable log-space computations
  • Edge case handling: Proper handling of zero-depth variants, single-bin arms, empty autosomes

Code Quality:

  • Type annotations: Comprehensive type hints with numpy typing
  • Error handling: Robust handling of edge cases and invalid data
  • Documentation: Clear docstrings explaining parameters and mathematical formulations
  • Modern Python: Uses PEP 604 union syntax, proper match/case statements

🔧 Implementation Details

Key algorithmic improvements:

  • cnvlib/segmentation/hmm.py:103-121: New _betabinom_logpmf function for numerically stable beta-binomial calculations
  • cnvlib/segmentation/hmm.py:123-189: Enhanced emission probability calculation with joint log2/BAF modeling
  • cnvlib/segmentation/hmm.py:574-668: Efficient batched grid search implementation
  • cnvlib/vary.py:63-82: New baf_counts_by_ranges method for aggregating allele counts

Dependency management:

  • ✅ Clean removal of pomegranate >=1.0.0 from requirements/core.txt and requirements/min.txt
  • ✅ Removal of pomegranate-related warnings from test files
  • ✅ Updated conda environment files consistently

🧪 Testing & Validation

Test coverage:

  • ✅ Existing HMM tests in test/test_commands.py:1034-1050 remain functional
  • ✅ Tests cover all HMM variants: hmm, hmm-tumor, hmm-germline, and joint log2+BAF mode
  • ⚠️ Missing unit tests: No specific tests for new mathematical functions (_betabinom_logpmf, emission calculations, grid search)

Code quality improvements:

  • ✅ Removed pomegranate import warnings
  • ✅ Applied consistent code formatting (ruff)
  • ✅ Updated type annotations throughout

🎯 Suggestions for Improvement

  1. Add unit tests for core mathematical functions:

    def test_betabinom_logpmf(self):
        """Test beta-binomial log PMF calculation."""
    
    def test_emission_probabilities(self):
        """Test log emission probability calculations."""
    
    def test_grid_search_batching(self):
        """Test batched grid search efficiency."""
  2. Performance benchmarking: Consider adding timing comparisons vs. the old pomegranate implementation

  3. Documentation: The algorithm change is significant enough to warrant a note in the changelog or documentation about the improved BAF modeling

⚠️ Minor Issues

  1. cnvlib/segmentation/hmm.py:28: BETABINOM_RHO = 200.0 - Consider adding a comment explaining this concentration parameter choice
  2. cnvlib/segmentation/hmm.py:703: The docstring mentions "optional joint (log2, BAF) emissions" but could be clearer about the beta-binomial model
  3. cnvlib/vary.py:76: The clamping of alt_count to [0, depth] is good defensive programming but might mask VCF data quality issues

📊 Performance Considerations

Expected improvements:

  • ✅ Reduced dependencies and memory usage (no pomegranate)
  • ✅ Vectorized operations should be faster than pomegranate's general-purpose HMM
  • ✅ Batched grid search is more efficient than individual parameter evaluations

Potential concerns:

  • The beta-binomial calculations are more computationally intensive than Gaussian, but this is appropriate for the data model

🎉 Conclusion

This is an excellent refactoring that removes a complex external dependency while improving the mathematical foundation of the HMM segmentation. The beta-binomial emission model is more appropriate for count data than the previous Gaussian BAF model, and the implementation is robust and well-engineered.

Recommendation: ✅ APPROVE

The core changes are sound, tests pass, and this represents a significant improvement to the codebase architecture and mathematical modeling. The minor suggestions above are optional enhancements rather than blocking issues.


@codecov
Copy link

codecov bot commented Feb 18, 2026

Codecov Report

❌ Patch coverage is 83.00654% with 52 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.43%. Comparing base (8246923) to head (c57a2c2).
⚠️ Report is 9 commits behind head on master.

Files with missing lines Patch % Lines
cnvlib/segmentation/hmm.py 84.32% 38 Missing and 7 partials ⚠️
cnvlib/commands.py 16.66% 5 Missing ⚠️
cnvlib/vary.py 83.33% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1001      +/-   ##
==========================================
+ Coverage   61.68%   62.43%   +0.74%     
==========================================
  Files          72       72              
  Lines        6968     7168     +200     
  Branches     1213     1239      +26     
==========================================
+ Hits         4298     4475     +177     
- Misses       2296     2316      +20     
- Partials      374      377       +3     
Flag Coverage Δ
unittests 62.43% <83.00%> (+0.74%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

etal and others added 3 commits February 18, 2026 09:26
16 tests covering _betabinom_logpmf (vs scipy reference, zero-depth,
broadcasting, extreme parameters), _expected_minor_freq (diploid het,
LOH, CN=0, impure, all-states bound check), log_emission_probs (log2-only,
with BAF counts, zero-depth neutrality, het-favors-diploid), and
_batched_emission_baf (shape, zero-depth zeroed).

Also improve BETABINOM_RHO comment and segment_hmm docstring clarity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@etal etal merged commit f1e8e43 into master Feb 19, 2026
14 checks passed
@etal etal deleted the hmmnew branch February 19, 2026 04:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant