This classifier uses genomic data from the MalariaGEN Vector Observatory, a comprehensive resource for studying malaria vector populations worldwide.
A high-performance Python library for classifying mosquito species using genomic data. This library provides a complete toolkit for loading, validating, and predicting species from Zarr-based genomic datasets, accessible through both a Python API and a command-line interface.
The Taxon Classifier is designed to make genomic-based species prediction straightforward and efficient. It uses a collection of trained models, each focused on a specific genomic partition, to classify mosquito samples. The library is built to handle both local and remote datasets, making it flexible for various research and operational workflows.
- Python API & CLI: Use the library directly in your Python scripts or from the command line.
- Local & URL Support: Load data from local Zarr files or directly from URLs.
- Built-in Data Validation: Automatically validates input datasets to ensure compatibility and integrity.
- Advanced Prediction Control: Customize predictions by selecting the top N classifiers or a specific set of genomic partitions.
- Batch Processing Pipeline: Process multiple Zarr files in a single run with progress tracking and parallel processing.
- Directory Processing: Automatically process all Zarr files in a directory with recursive search.
- URL Batch Processing: Process multiple URLs simultaneously with automatic download management.
- Progress Tracking: Real-time progress updates with ETA calculations for batch operations.
- Parallel Processing: Support for thread-based and process-based parallelism for improved performance.
- Result Aggregation: Combine and analyze results from multiple files with comprehensive reporting.
- Export Options: Save results in multiple formats (JSON, CSV, text, joblib) with detailed reports.
- Verbose & Quiet Logging: Choose between detailed, PyTorch-style logging or a quiet mode for cleaner output.
- Memory Management: Automatic and manual memory cleanup to manage resources efficiently.
- Utility Functions: Includes helpers for managing temporary files and inspecting the library's configuration.
To get started, clone the repository and install the required dependencies using the provided requirements.txt file.
git clone <repository_url>
cd vector-taxon-classifier
pip install -r src/requirements.txtHere's a simple example of the end-to-end prediction workflow. This snippet initializes the classifier, loads data from a remote URL, and runs a prediction using the top 5 classifiers.
import sys
sys.path.append('src')
from main import Taxon_Classifier
# 1. Initialize the classifier
classifier = Taxon_Classifier()
# 2. Load and validate data from a URL
# This will download the file to a local './temp_downloads' directory
url = "https://vo_agam_output.cog.sanger.ac.uk/AR0047-C.gatk.zarr.zip"
data = classifier.input(url)
# 3. Run prediction
if data["status"] == "success":
results = classifier.predict(data, top_n_classifiers=5)
# 4. Print results
if results["status"] == "success":
for sample_id, result in results['results'].items():
print(f"Sample: {sample_id}")
print(f" Prediction: {result['prediction']}")
print(f" Partitions Used: {result['partitions_used']}/{result['total_partitions']}")For a detailed, step-by-step guide on using all features of this library, please refer to the Jupyter notebooks located in the /notebooks directory.
01_Basic_Setup.ipynb: Installation and initialization.02_Data_Loading_and_Validation.ipynb: Loading local/remote data and understanding validation.03_Basic_Prediction.ipynb: Standard end-to-end prediction workflow.04_Advanced_Prediction.ipynb: Using advanced prediction options.05_CLI_and_Utilities.ipynb: Using the CLI and utility functions.06_Batch_Processing.ipynb: Batch processing pipeline with progress tracking and parallel processing.
The core of the library is the Taxon_Classifier class.
Initializes the classifier.
n_jobs: The number of parallel jobs to run for predictions.log_level: The logging level for console output ("DEBUG","INFO","WARNING","ERROR").
Loads and validates a Zarr dataset from a local path or URL.
- Returns: A dictionary containing the loaded data, sample IDs, and validation results.
.predict(data: Dict, classifier_ids: Optional[List[str]] = None, top_n_classifiers: int = 50, verbose: bool = True) -> Dict
Runs the prediction on the loaded data.
data: The dictionary returned by the.input()method.classifier_ids: A list of specific partition IDs to use for prediction.top_n_classifiers: An integer to use only the top N classifiers.verbose: A boolean to enable or disable detailed logging.- Returns: A dictionary with prediction results.
Manually clears model caches and triggers garbage collection.
Deletes all files from the ./temp_downloads directory.
Returns a list of all available partition IDs.
Returns a dictionary with detailed information about the classifier's parameters and configuration.
Process multiple Zarr files in batch with progress tracking and parallel processing.
input_paths: List of paths to Zarr files or URLsmax_workers: Maximum number of parallel workers (default: 1)parallel_mode: "thread", "process", or "sequential" (default: "thread")output_directory: Directory to save results (default: ./batch_results)output_format: Output format for individual results (default: "json")classifier_ids: Specific partitions to use (default: all available)top_n_classifiers: Top N classifiers to use (default: 50)verbose: Enable detailed logging (default: True)save_individual_results: Save results for each file individually (default: True)save_summary: Save batch summary (default: True)cleanup_temp_files: Clean up temporary downloaded files (default: True)progress_callback: Optional callback function for progress updates- Returns: Batch processing results with summary and aggregated predictions
Process all Zarr files in a directory.
directory_path: Path to directory containing Zarr filesfile_pattern: File pattern to match (default: "*.zarr.zip")recursive: Search subdirectories recursively (default: True)- **
**kwargs**: Additional arguments passed to process_batch() - Returns: Batch processing results
Process multiple URLs in batch.
urls: List of URLs to Zarr files- **
**kwargs**: Additional arguments passed to process_batch() - Returns: Batch processing results
The CLI provides access to the library's main features from the terminal.
| Command | Description |
|---|---|
partitions |
Lists all available partition IDs. |
predict |
Runs a prediction on a local Zarr file. |
batch |
Process multiple Zarr files in batch. |
batch-dir |
Process all Zarr files in a directory. |
batch-urls |
Process multiple URLs in batch. |
cleanup |
Deletes all temporary downloaded files. |
memory |
Clears model caches and frees up memory. |
help |
Displays detailed help and parameter information. |
Example commands:
# Single file prediction
python src/cli.py predict /path/to/your/data.zarr.zip --top-n-classifiers 10 --quiet
# Batch process multiple files
python src/cli.py batch file1.zarr.zip file2.zarr.zip --max-workers 4 --parallel-mode thread
# Process all files in a directory
python src/cli.py batch-dir /path/to/zarr/files --recursive --output-directory ./results
# Process multiple URLs
python src/cli.py batch-urls https://url1.com/file1.zarr.zip https://url2.com/file2.zarr.zip