Taxon Classifier: Mosquito Species Prediction Library

This classifier uses genomic data from the MalariaGEN Vector Observatory, a comprehensive resource for studying malaria vector populations worldwide.

A high-performance Python library for classifying mosquito species using genomic data. This library provides a complete toolkit for loading, validating, and predicting species from Zarr-based genomic datasets, accessible through both a Python API and a command-line interface.

Overview

The Taxon Classifier is designed to make genomic-based species prediction straightforward and efficient. It uses a collection of trained models, each focused on a specific genomic partition, to classify mosquito samples. The library is built to handle both local and remote datasets, making it flexible for various research and operational workflows.

Features

Python API & CLI: Use the library directly in your Python scripts or from the command line.
Local & URL Support: Load data from local Zarr files or directly from URLs.
Built-in Data Validation: Automatically validates input datasets to ensure compatibility and integrity.
Advanced Prediction Control: Customize predictions by selecting the top N classifiers or a specific set of genomic partitions.
Batch Processing Pipeline: Process multiple Zarr files in a single run with progress tracking and parallel processing.
Directory Processing: Automatically process all Zarr files in a directory with recursive search.
URL Batch Processing: Process multiple URLs simultaneously with automatic download management.
Progress Tracking: Real-time progress updates with ETA calculations for batch operations.
Parallel Processing: Support for thread-based and process-based parallelism for improved performance.
Result Aggregation: Combine and analyze results from multiple files with comprehensive reporting.
Export Options: Save results in multiple formats (JSON, CSV, text, joblib) with detailed reports.
Verbose & Quiet Logging: Choose between detailed, PyTorch-style logging or a quiet mode for cleaner output.
Memory Management: Automatic and manual memory cleanup to manage resources efficiently.
Utility Functions: Includes helpers for managing temporary files and inspecting the library's configuration.

Installation

To get started, clone the repository and install the required dependencies using the provided requirements.txt file.

git clone <repository_url>
cd vector-taxon-classifier
pip install -r src/requirements.txt

Quickstart

Here's a simple example of the end-to-end prediction workflow. This snippet initializes the classifier, loads data from a remote URL, and runs a prediction using the top 5 classifiers.

import sys
sys.path.append('src')
from main import Taxon_Classifier

# 1. Initialize the classifier
classifier = Taxon_Classifier()

# 2. Load and validate data from a URL
# This will download the file to a local './temp_downloads' directory
url = "https://vo_agam_output.cog.sanger.ac.uk/AR0047-C.gatk.zarr.zip"
data = classifier.input(url)

# 3. Run prediction
if data["status"] == "success":
    results = classifier.predict(data, top_n_classifiers=5)

    # 4. Print results
    if results["status"] == "success":
        for sample_id, result in results['results'].items():
            print(f"Sample: {sample_id}")
            print(f"  Prediction: {result['prediction']}")
            print(f"  Partitions Used: {result['partitions_used']}/{result['total_partitions']}")

Tutorial Notebooks

For a detailed, step-by-step guide on using all features of this library, please refer to the Jupyter notebooks located in the /notebooks directory.

01_Basic_Setup.ipynb: Installation and initialization.
02_Data_Loading_and_Validation.ipynb: Loading local/remote data and understanding validation.
03_Basic_Prediction.ipynb: Standard end-to-end prediction workflow.
04_Advanced_Prediction.ipynb: Using advanced prediction options.
05_CLI_and_Utilities.ipynb: Using the CLI and utility functions.
06_Batch_Processing.ipynb: Batch processing pipeline with progress tracking and parallel processing.

API Reference

The core of the library is the Taxon_Classifier class.

`Taxon_Classifier(n_jobs: int = 1, log_level: str = "INFO")`

Initializes the classifier.

n_jobs: The number of parallel jobs to run for predictions.
log_level: The logging level for console output ("DEBUG", "INFO", "WARNING", "ERROR").

`.input(data_path: Union[str, Path]) -> Dict`

Loads and validates a Zarr dataset from a local path or URL.

Returns: A dictionary containing the loaded data, sample IDs, and validation results.

`.predict(data: Dict, classifier_ids: Optional[List[str]] = None, top_n_classifiers: int = 50, verbose: bool = True) -> Dict`

Runs the prediction on the loaded data.

data: The dictionary returned by the .input() method.
classifier_ids: A list of specific partition IDs to use for prediction.
top_n_classifiers: An integer to use only the top N classifiers.
verbose: A boolean to enable or disable detailed logging.
Returns: A dictionary with prediction results.

`.cleanup_memory(verbose: bool = True) -> Dict`

Manually clears model caches and triggers garbage collection.

`.cleanup_temp_files() -> Dict`

Deletes all files from the ./temp_downloads directory.

`.get_partitions() -> List[str]`

Returns a list of all available partition IDs.

`.help() -> Dict`

Returns a dictionary with detailed information about the classifier's parameters and configuration.

`.process_batch(input_paths, **kwargs) -> Dict`

Process multiple Zarr files in batch with progress tracking and parallel processing.

input_paths: List of paths to Zarr files or URLs
max_workers: Maximum number of parallel workers (default: 1)
parallel_mode: "thread", "process", or "sequential" (default: "thread")
output_directory: Directory to save results (default: ./batch_results)
output_format: Output format for individual results (default: "json")
classifier_ids: Specific partitions to use (default: all available)
top_n_classifiers: Top N classifiers to use (default: 50)
verbose: Enable detailed logging (default: True)
save_individual_results: Save results for each file individually (default: True)
save_summary: Save batch summary (default: True)
cleanup_temp_files: Clean up temporary downloaded files (default: True)
progress_callback: Optional callback function for progress updates
Returns: Batch processing results with summary and aggregated predictions

`.process_directory(directory_path, **kwargs) -> Dict`

Process all Zarr files in a directory.

directory_path: Path to directory containing Zarr files
file_pattern: File pattern to match (default: "*.zarr.zip")
recursive: Search subdirectories recursively (default: True)
****kwargs**: Additional arguments passed to process_batch()
Returns: Batch processing results

`.process_urls_batch(urls, **kwargs) -> Dict`

Process multiple URLs in batch.

urls: List of URLs to Zarr files
****kwargs**: Additional arguments passed to process_batch()
Returns: Batch processing results

Command-Line Interface (CLI) Reference

The CLI provides access to the library's main features from the terminal.

Command	Description
`partitions`	Lists all available partition IDs.
`predict`	Runs a prediction on a local Zarr file.
`batch`	Process multiple Zarr files in batch.
`batch-dir`	Process all Zarr files in a directory.
`batch-urls`	Process multiple URLs in batch.
`cleanup`	Deletes all temporary downloaded files.
`memory`	Clears model caches and frees up memory.
`help`	Displays detailed help and parameter information.

Example commands:

# Single file prediction
python src/cli.py predict /path/to/your/data.zarr.zip --top-n-classifiers 10 --quiet

# Batch process multiple files
python src/cli.py batch file1.zarr.zip file2.zarr.zip --max-workers 4 --parallel-mode thread

# Process all files in a directory
python src/cli.py batch-dir /path/to/zarr/files --recursive --output-directory ./results

# Process multiple URLs
python src/cli.py batch-urls https://url1.com/file1.zarr.zip https://url2.com/file2.zarr.zip

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Taxon Classifier: Mosquito Species Prediction Library

Overview

Features

Installation

Quickstart

Tutorial Notebooks

API Reference

`Taxon_Classifier(n_jobs: int = 1, log_level: str = "INFO")`

`.input(data_path: Union[str, Path]) -> Dict`

`.predict(data: Dict, classifier_ids: Optional[List[str]] = None, top_n_classifiers: int = 50, verbose: bool = True) -> Dict`

`.cleanup_memory(verbose: bool = True) -> Dict`

`.cleanup_temp_files() -> Dict`

`.get_partitions() -> List[str]`

`.help() -> Dict`

`.process_batch(input_paths, **kwargs) -> Dict`

`.process_directory(directory_path, **kwargs) -> Dict`

`.process_urls_batch(urls, **kwargs) -> Dict`

Command-Line Interface (CLI) Reference

About

Uh oh!

Contributors 2

Languages

malariagen/vector-taxon-classifier-prediction

Folders and files

Latest commit

History

Repository files navigation

Taxon Classifier: Mosquito Species Prediction Library

Overview

Features

Installation

Quickstart

Tutorial Notebooks

API Reference

Taxon_Classifier(n_jobs: int = 1, log_level: str = "INFO")

.input(data_path: Union[str, Path]) -> Dict

.predict(data: Dict, classifier_ids: Optional[List[str]] = None, top_n_classifiers: int = 50, verbose: bool = True) -> Dict

.cleanup_memory(verbose: bool = True) -> Dict

.cleanup_temp_files() -> Dict

.get_partitions() -> List[str]

.help() -> Dict

.process_batch(input_paths, **kwargs) -> Dict

.process_directory(directory_path, **kwargs) -> Dict

.process_urls_batch(urls, **kwargs) -> Dict

Command-Line Interface (CLI) Reference

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors 2

Languages

`Taxon_Classifier(n_jobs: int = 1, log_level: str = "INFO")`

`.input(data_path: Union[str, Path]) -> Dict`

`.predict(data: Dict, classifier_ids: Optional[List[str]] = None, top_n_classifiers: int = 50, verbose: bool = True) -> Dict`

`.cleanup_memory(verbose: bool = True) -> Dict`

`.cleanup_temp_files() -> Dict`

`.get_partitions() -> List[str]`

`.help() -> Dict`

`.process_batch(input_paths, **kwargs) -> Dict`

`.process_directory(directory_path, **kwargs) -> Dict`

`.process_urls_batch(urls, **kwargs) -> Dict`