Version 2.1 - Refactored Edition with Streamlined ML Architecture
intronIC is a bioinformatics tool for extracting and classifying intron sequences as U12-type (minor) or U2-type (major) using a support vector machine trained on position-weight matrix scores.
pip install intronIC# Classify introns (default model loaded automatically)
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name -p 8
# Extract sequences only (no classification)
intronIC extract -g genome.fa.gz -a annotation.gff3.gz -n species_name -p 8
# Train a custom model (optional - most users don't need this)
intronIC train -n my_model -p 8# Quick installation test using bundled test data
intronIC test -p 4
# Or show where test data is located
intronIC test --show-only- Changelog - Release notes and version history
For complete documentation, see the intronIC Wiki:
- Quick Start Guide - Installation, dependencies, resource usage
- Overview - Classification approach and scientific background
- Usage Info - Complete CLI reference
- Output Files - File formats and interpretation
- Technical Details - Algorithm and ML architecture
- Example Usage - Common workflows
- About - Background and motivation
- Improved cross-species model: Default model now uses isotonic calibration for better accuracy
intronIC testcommand: Quick installation validation with bundled test data- Bug fixes: Test command metrics, sklearn warnings completely suppressed
- See CHANGELOG.md for full release history
This refactored version maintains 100% algorithmic fidelity and CLI compatibility with the original intronIC while providing a modernized, maintainable codebase:
- Corrected ML Architecture: Fixed double-scaling issue and train/test mismatch
- Single scaling step via RobustScaler with centering (removes composition bias)
- Configurable augmented features (5D standard or custom)
- Two-stage optimization (C via balanced_accuracy, calibration via log-loss)
- Modular Architecture: Organized into logical packages instead of a single 6,000+-line file
- Enhanced Code Quality: Type hints throughout, immutable data structures, better error handling
- Bug Fixes: Corrected data leakage in z-score normalization, fixed type_id assignment
- Modern Tooling: Support for
pixianduvpackage managers - Improved Documentation: Comprehensive wiki and inline documentation
- SVM-based classification with probability scores (0-100%)
- Default pretrained model loaded automatically - works for virtually all species
- Streaming mode (default) for ~85% memory reduction on large genomes
- Parallel processing for improved performance (
-p 8recommended) - Fast runtimes: ~6-10 minutes for human genome with default settings
- Comprehensive metadata including phase, position, parent gene/transcript
Most eukaryotic introns (~99.5%) are spliced by the major (U2-type) spliceosome, while a small fraction (~0.5%) are spliced by the minor (U12-type) spliceosome. U12-type introns have:
- Highly conserved TCCTTAAC branch point motif
- Terminal dinucleotides: AT-AC (~25%) or GT-AG (~75%)
- Functional importance and evolutionary conservation
intronIC identifies U12-type introns using:
- PWM Scoring: Apply position-weight matrices to 5' splice site, branch point, and 3' splice site
- Normalization: Convert raw scores to z-scores (prevents data leakage)
- SVM Classification: Linear SVM with balanced class weights outputs probability scores
For detailed algorithm description, see the Technical Details wiki page.
If you use intronIC in your research, please cite:
Devlin C Moyer, Graham E Larue, Courtney E Hershberger, Scott W Roy, Richard A Padgett. Comprehensive database and evolutionary dynamics of U12-type introns. Nucleic Acids Research, Volume 48, Issue 13, 27 July 2020, Pages 7066–7078. https://doi.org/10.1093/nar/gkaa464
- Documentation: intronIC Wiki
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
git clone https://github.com/glarue/intronIC.git
cd intronIC
make install # Set up development environment
make test # Run testsintronIC is released under the GNU General Public License v3.0.
