Skip to content

A CLI tool for easily downloading pathology data and running PathoBench for slide-level benchmarks for pathology foundation models.

Notifications You must be signed in to change notification settings

RhizoNymph/Patho-Bench-cli

Repository files navigation

Patho-Bench-cli

Unified downloader for Patho-Bench datasets.

Installation

# Using uv
uv pip install -e .

# Or using pip
pip install -e .

Usage

Download task definitions

First, download the Patho-Bench task definitions from HuggingFace:

patho-bench-cli tasks

List available datasets

# List all providers
patho-bench-cli list

# List datasets for a specific provider
patho-bench-cli list cptac
patho-bench-cli list panda
patho-bench-cli list ovarian_bevacizumab
patho-bench-cli list post_nat_brca
patho-bench-cli list idr

Download slides

# Download CPTAC slides needed for Patho-Bench (dry-run, creates manifest)
patho-bench-cli download cptac -o /path/to/slides

# Actually download
patho-bench-cli download cptac --download -o /path/to/slides

# Download specific datasets
patho-bench-cli download cptac --datasets cptac_ccrcc cptac_brca --download -o /path/to/slides

# Download with per-task symlinks
patho-bench-cli download cptac --download --create-symlinks -o /path/to/slides

# Download PANDA slides
patho-bench-cli download panda --download -o /path/to/slides

# Download Ovarian Bevacizumab Response slides
patho-bench-cli download ovarian_bevacizumab --download -o /path/to/slides

# Download POST-NAT-BRCA slides
patho-bench-cli download post_nat_brca --download -o /path/to/slides

# Download IDR slides (via BioImage Archive)
patho-bench-cli download idr --download -o /path/to/slides

Note that not including --datasets will download all datasets for the provider.

Full dataset download

Download entire datasets (not just Patho-Bench slides):

patho-bench-cli download cptac --full --datasets cptac_ccrcc
patho-bench-cli download panda --full
patho-bench-cli download ovarian_bevacizumab --full
patho-bench-cli download post_nat_brca --full

Verify and clean slides

Recursively check if WSI files can be opened with OpenSlide:

# Verify slides in a directory
patho-bench-cli verify /path/to/slides

# Verify and delete invalid slides
patho-bench-cli verify /path/to/slides --delete

# Use specific number of parallel jobs
patho-bench-cli verify /path/to/slides --jobs 8

Embed slides

Embed slides using a patch encoder. This will create a directory of embeddings for each dataset. If you point it at the same slides

patho-bench-cli embed cptac --datasets cptac_ccrcc--slides-dir /path/to/slides --embeddings-dir /path/to/embeddings --patch_encoder open-midnight --create-symlinks --mag 20 --patch_size 224

Note that not including --datasets will embed all datasets for the provider. If you point it at the by_task directory created by the download command and include --create-symlinks, it will create a by_task directory in the embeddings directory where each dataset directory has task directories with symlinks to the embeddings/dataset directory. It embeds all by_task slides into the same embeddings/dataset directory in order to avoid duplicate slide embeddings.

Data Sources

Provider Source Authentication
cptac TCIA (The Cancer Imaging Archive) None
panda Kaggle Competition ~/.kaggle/kaggle.json
imp INESCTEC Open Datasets None
ovarian_bevacizumab TCIA Ovarian Bevacizumab Response None
post_nat_brca TCIA POST-NAT-BRCA None
idr Image Data Resource (OpenMicroscopy) via BioImage Archive None

IDR Datasets

IDR slides are downloaded from the EBI BioImage Archive via direct HTTP. No additional dependencies required.

Available IDR datasets:

Development

# Install with dev dependencies
uv pip install -e ".[dev]"

# Run tests
pytest

Adding New Providers

Create a new provider by implementing DatasetProvider in patho_bench_cli/providers/:

from patho_bench_cli.providers.base import DatasetProvider

class MyProvider(DatasetProvider):
    @property
    def name(self) -> str:
        return "my_provider"
    
    # ... implement other methods

Then register it in patho_bench_cli/providers/registry.py.

About

A CLI tool for easily downloading pathology data and running PathoBench for slide-level benchmarks for pathology foundation models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages