Skip to content

malariagen/vector-taxon-classifier-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Taxon Classifier: Mosquito Species Prediction Library

This classifier uses genomic data from the MalariaGEN Vector Observatory, a comprehensive resource for studying malaria vector populations worldwide.

A high-performance Python library for classifying mosquito species using genomic data. This library provides a complete toolkit for loading, validating, and predicting species from Zarr-based genomic datasets, accessible through both a Python API and a command-line interface.

Overview

The Taxon Classifier is designed to make genomic-based species prediction straightforward and efficient. It uses a collection of trained models, each focused on a specific genomic partition, to classify mosquito samples. The library is built to handle both local and remote datasets, making it flexible for various research and operational workflows.

Features

  • Python API & CLI: Use the library directly in your Python scripts or from the command line.
  • Local & URL Support: Load data from local Zarr files or directly from URLs.
  • Built-in Data Validation: Automatically validates input datasets to ensure compatibility and integrity.
  • Advanced Prediction Control: Customize predictions by selecting the top N classifiers or a specific set of genomic partitions.
  • Batch Processing Pipeline: Process multiple Zarr files in a single run with progress tracking and parallel processing.
  • Directory Processing: Automatically process all Zarr files in a directory with recursive search.
  • URL Batch Processing: Process multiple URLs simultaneously with automatic download management.
  • Progress Tracking: Real-time progress updates with ETA calculations for batch operations.
  • Parallel Processing: Support for thread-based and process-based parallelism for improved performance.
  • Result Aggregation: Combine and analyze results from multiple files with comprehensive reporting.
  • Export Options: Save results in multiple formats (JSON, CSV, text, joblib) with detailed reports.
  • Verbose & Quiet Logging: Choose between detailed, PyTorch-style logging or a quiet mode for cleaner output.
  • Memory Management: Automatic and manual memory cleanup to manage resources efficiently.
  • Utility Functions: Includes helpers for managing temporary files and inspecting the library's configuration.

Installation

To get started, clone the repository and install the required dependencies using the provided requirements.txt file.

git clone <repository_url>
cd vector-taxon-classifier
pip install -r src/requirements.txt

Quickstart

Here's a simple example of the end-to-end prediction workflow. This snippet initializes the classifier, loads data from a remote URL, and runs a prediction using the top 5 classifiers.

import sys
sys.path.append('src')
from main import Taxon_Classifier

# 1. Initialize the classifier
classifier = Taxon_Classifier()

# 2. Load and validate data from a URL
# This will download the file to a local './temp_downloads' directory
url = "https://vo_agam_output.cog.sanger.ac.uk/AR0047-C.gatk.zarr.zip"
data = classifier.input(url)

# 3. Run prediction
if data["status"] == "success":
    results = classifier.predict(data, top_n_classifiers=5)

    # 4. Print results
    if results["status"] == "success":
        for sample_id, result in results['results'].items():
            print(f"Sample: {sample_id}")
            print(f"  Prediction: {result['prediction']}")
            print(f"  Partitions Used: {result['partitions_used']}/{result['total_partitions']}")

Tutorial Notebooks

For a detailed, step-by-step guide on using all features of this library, please refer to the Jupyter notebooks located in the /notebooks directory.

  • 01_Basic_Setup.ipynb: Installation and initialization.
  • 02_Data_Loading_and_Validation.ipynb: Loading local/remote data and understanding validation.
  • 03_Basic_Prediction.ipynb: Standard end-to-end prediction workflow.
  • 04_Advanced_Prediction.ipynb: Using advanced prediction options.
  • 05_CLI_and_Utilities.ipynb: Using the CLI and utility functions.
  • 06_Batch_Processing.ipynb: Batch processing pipeline with progress tracking and parallel processing.

API Reference

The core of the library is the Taxon_Classifier class.

Taxon_Classifier(n_jobs: int = 1, log_level: str = "INFO")

Initializes the classifier.

  • n_jobs: The number of parallel jobs to run for predictions.
  • log_level: The logging level for console output ("DEBUG", "INFO", "WARNING", "ERROR").

.input(data_path: Union[str, Path]) -> Dict

Loads and validates a Zarr dataset from a local path or URL.

  • Returns: A dictionary containing the loaded data, sample IDs, and validation results.

.predict(data: Dict, classifier_ids: Optional[List[str]] = None, top_n_classifiers: int = 50, verbose: bool = True) -> Dict

Runs the prediction on the loaded data.

  • data: The dictionary returned by the .input() method.
  • classifier_ids: A list of specific partition IDs to use for prediction.
  • top_n_classifiers: An integer to use only the top N classifiers.
  • verbose: A boolean to enable or disable detailed logging.
  • Returns: A dictionary with prediction results.

.cleanup_memory(verbose: bool = True) -> Dict

Manually clears model caches and triggers garbage collection.

.cleanup_temp_files() -> Dict

Deletes all files from the ./temp_downloads directory.

.get_partitions() -> List[str]

Returns a list of all available partition IDs.

.help() -> Dict

Returns a dictionary with detailed information about the classifier's parameters and configuration.

.process_batch(input_paths, **kwargs) -> Dict

Process multiple Zarr files in batch with progress tracking and parallel processing.

  • input_paths: List of paths to Zarr files or URLs
  • max_workers: Maximum number of parallel workers (default: 1)
  • parallel_mode: "thread", "process", or "sequential" (default: "thread")
  • output_directory: Directory to save results (default: ./batch_results)
  • output_format: Output format for individual results (default: "json")
  • classifier_ids: Specific partitions to use (default: all available)
  • top_n_classifiers: Top N classifiers to use (default: 50)
  • verbose: Enable detailed logging (default: True)
  • save_individual_results: Save results for each file individually (default: True)
  • save_summary: Save batch summary (default: True)
  • cleanup_temp_files: Clean up temporary downloaded files (default: True)
  • progress_callback: Optional callback function for progress updates
  • Returns: Batch processing results with summary and aggregated predictions

.process_directory(directory_path, **kwargs) -> Dict

Process all Zarr files in a directory.

  • directory_path: Path to directory containing Zarr files
  • file_pattern: File pattern to match (default: "*.zarr.zip")
  • recursive: Search subdirectories recursively (default: True)
  • ****kwargs**: Additional arguments passed to process_batch()
  • Returns: Batch processing results

.process_urls_batch(urls, **kwargs) -> Dict

Process multiple URLs in batch.

  • urls: List of URLs to Zarr files
  • ****kwargs**: Additional arguments passed to process_batch()
  • Returns: Batch processing results

Command-Line Interface (CLI) Reference

The CLI provides access to the library's main features from the terminal.

Command Description
partitions Lists all available partition IDs.
predict Runs a prediction on a local Zarr file.
batch Process multiple Zarr files in batch.
batch-dir Process all Zarr files in a directory.
batch-urls Process multiple URLs in batch.
cleanup Deletes all temporary downloaded files.
memory Clears model caches and frees up memory.
help Displays detailed help and parameter information.

Example commands:

# Single file prediction
python src/cli.py predict /path/to/your/data.zarr.zip --top-n-classifiers 10 --quiet

# Batch process multiple files
python src/cli.py batch file1.zarr.zip file2.zarr.zip --max-workers 4 --parallel-mode thread

# Process all files in a directory
python src/cli.py batch-dir /path/to/zarr/files --recursive --output-directory ./results

# Process multiple URLs
python src/cli.py batch-urls https://url1.com/file1.zarr.zip https://url2.com/file2.zarr.zip