Unified downloader for Patho-Bench datasets.
# Using uv
uv pip install -e .
# Or using pip
pip install -e .First, download the Patho-Bench task definitions from HuggingFace:
patho-bench-cli tasks# List all providers
patho-bench-cli list
# List datasets for a specific provider
patho-bench-cli list cptac
patho-bench-cli list panda
patho-bench-cli list ovarian_bevacizumab
patho-bench-cli list post_nat_brca
patho-bench-cli list idr# Download CPTAC slides needed for Patho-Bench (dry-run, creates manifest)
patho-bench-cli download cptac -o /path/to/slides
# Actually download
patho-bench-cli download cptac --download -o /path/to/slides
# Download specific datasets
patho-bench-cli download cptac --datasets cptac_ccrcc cptac_brca --download -o /path/to/slides
# Download with per-task symlinks
patho-bench-cli download cptac --download --create-symlinks -o /path/to/slides
# Download PANDA slides
patho-bench-cli download panda --download -o /path/to/slides
# Download Ovarian Bevacizumab Response slides
patho-bench-cli download ovarian_bevacizumab --download -o /path/to/slides
# Download POST-NAT-BRCA slides
patho-bench-cli download post_nat_brca --download -o /path/to/slides
# Download IDR slides (via BioImage Archive)
patho-bench-cli download idr --download -o /path/to/slidesNote that not including --datasets will download all datasets for the provider.
Download entire datasets (not just Patho-Bench slides):
patho-bench-cli download cptac --full --datasets cptac_ccrcc
patho-bench-cli download panda --full
patho-bench-cli download ovarian_bevacizumab --full
patho-bench-cli download post_nat_brca --fullRecursively check if WSI files can be opened with OpenSlide:
# Verify slides in a directory
patho-bench-cli verify /path/to/slides
# Verify and delete invalid slides
patho-bench-cli verify /path/to/slides --delete
# Use specific number of parallel jobs
patho-bench-cli verify /path/to/slides --jobs 8Embed slides using a patch encoder. This will create a directory of embeddings for each dataset. If you point it at the same slides
patho-bench-cli embed cptac --datasets cptac_ccrcc--slides-dir /path/to/slides --embeddings-dir /path/to/embeddings --patch_encoder open-midnight --create-symlinks --mag 20 --patch_size 224Note that not including --datasets will embed all datasets for the provider. If you point it at the by_task directory created by the download command and include --create-symlinks, it will create a by_task directory in the embeddings directory where each dataset directory has task directories with symlinks to the embeddings/dataset directory. It embeds all by_task slides into the same embeddings/dataset directory in order to avoid duplicate slide embeddings.
| Provider | Source | Authentication |
|---|---|---|
cptac |
TCIA (The Cancer Imaging Archive) | None |
panda |
Kaggle Competition | ~/.kaggle/kaggle.json |
imp |
INESCTEC Open Datasets | None |
ovarian_bevacizumab |
TCIA Ovarian Bevacizumab Response | None |
post_nat_brca |
TCIA POST-NAT-BRCA | None |
idr |
Image Data Resource (OpenMicroscopy) via BioImage Archive | None |
IDR slides are downloaded from the EBI BioImage Archive via direct HTTP. No additional dependencies required.
Available IDR datasets:
# Install with dev dependencies
uv pip install -e ".[dev]"
# Run tests
pytestCreate a new provider by implementing DatasetProvider in patho_bench_cli/providers/:
from patho_bench_cli.providers.base import DatasetProvider
class MyProvider(DatasetProvider):
@property
def name(self) -> str:
return "my_provider"
# ... implement other methodsThen register it in patho_bench_cli/providers/registry.py.