Skip to content

Conversation

@svdrecbd
Copy link

This PR introduces critical algorithmic and I/O optimizations to the kb_python core processing paths. By replacing sequential Python loops with vectorized operations and implementing memory-safe streaming, these changes enable the tool to scale gracefully to multi-million cell datasets.

Optimizations:

  • collapse_anndata (Vectorized): Replaced manual sparse column loops with a single sparse matrix-matrix multiplication (X @ S).
    • Performance: Achieved a ~2900x speedup on a 100k sample and reduced full-scale (6.7M cells) processing time from ~1 hour to 3.4 seconds.
  • do_sum_matrices (1-Pass Merge): Switched to a single-pass streaming merge with a temporary body file.
    • Performance: Improved runtime by 1.72x while maintaining O(1) memory safety.
    • Integrity: Uses strict integer arithmetic when processing count matrices to ensure bit-exact precision (addressing float rounding risks).
  • generate_kite_fasta (KITE Ref Gen): Optimized collision detection from $O(N^2)$ to $O(N)$ using dictionary-based lookups.
    • Performance: Reduced runtime for 10,000 features from ~80s to ~3.5s.
  • validate_mtx: Replaced full-file loading with header-only inspection using scipy.io.mminfo.
    • Performance: Reduced validation time by ~1800x, turning a linear file read into a constant-time check.

Dataset Used for Validation:
All performance metrics and parity checks were conducted using the following real-world dataset:

Scientific Parity & Compliance:

  • Numerical Parity: Confirmed exact scientific parity ((X_old != X_new).nnz == 0) across all optimized paths using the 6.7M cell dataset.
  • MatrixMarket Compliance: Body lines are explicitly formatted as integers when multimapping is disabled, ensuring strict compliance with the .mtx format.
  • Regression Suite: All existing tests pass. Added new regression tests for do_sum_matrices covering integer formatting and complex sparse overlapping merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant