OpenAPI Biomedical Database Downloader

Tool that automatically downloads data from public biomedical databases using their API specifications from SmartAPI. Provide a SmartAPI/OpenAPI spec, and the tool discovers entity types, handles pagination, and exports everything to CSV files.

✅ Verified with SmartAPI: HuBMAP, SenNet, CFDE, WikiPathways, LINCS Data Portal, ClinGen LDH

For AI Agents: Input/Output Specification

Input Requirements

Parameter	Required	Type	Description
`--openapi`	Yes	File path or URL	OpenAPI specification (JSON or YAML format)
`--out-dir`	No	Directory path	Output directory for CSV files (default: `.`)
`--base-url`	No	URL	API base URL (auto-detected from spec if available)
`--max-rows-per-entity`	No	Integer	Limit rows per entity (for testing/sampling)
`--use-search-api`	No	Flag	Use Elasticsearch POST /search for complete data

Output

The tool produces:

CSV files - One file per entity type (e.g., Donors.csv, Samples.csv, Datasets.csv)
Console output - Progress information and detected endpoints (to stderr)

Example Agent Workflow

# Step 1: Get OpenAPI spec from SmartAPI
curl -s "https://smart-api.info/api/metadata/{SMARTAPI_ID}" > spec.json

# Step 2: Run the downloader
python3 openapi_downloader.py --openapi spec.json --out-dir ./output

# Step 3: Output files are in ./output/
# → Donors.csv, Samples.csv, Datasets.csv, Files.csv

Supported Databases

✅ Verified with SmartAPI (use with this tool)

Database	SmartAPI ID	What You Get
HuBMAP	`7aaf02b838022d564da776b03f357158`	Donors, Samples, Datasets, Files
SenNet	`7d838c9dee0caa2f8fe57173282c5812`	Datasets (with provenance)
CFDE	`d1ac2227e079aa3cae4e1cd696431ff8`	Genes, Variants, RegulatoryElements
WikiPathways	`45f6ce9f9f2072b581ab85771e2ab15b`	Pathways (3,278), Organisms (48)
LINCS Data Portal	`1ad2cba40cb25cd70d00aa8fba9cfaf3`	Drug mechanisms, Disease indications
ClinGen LDH	`5f76a78a6b80eef423677db7cd81140e`	Genes, Variants, ClinVar submissions

🔧 Pattern Detection Ready (no SmartAPI spec available)

These databases have detection patterns in the code but don't have SmartAPI specs. Use their REST APIs directly:

Database	Data	Direct API
GTEx	Gene expression	`https://gtexportal.org/api/v2/`
Harmonizome	Gene-function	`https://maayanlab.cloud/Harmonizome/api/1.0/`
Monarch	Disease-phenotype	`https://api-v3.monarchinitiative.org/`
RGD	Rat genomics	`https://rest.rgd.mcw.edu/rgdws/`
IMPC	Mouse phenotyping	`https://www.ebi.ac.uk/mi/impc/solr/`
UniProt	Proteins	`https://rest.uniprot.org/`
ChEMBL	Drug molecules	`https://www.ebi.ac.uk/chembl/api/data/`

Quick Start

# Install dependencies
pip install requests pyyaml pandas

# Download HuBMAP data
curl -s "https://smart-api.info/api/metadata/7aaf02b838022d564da776b03f357158" > hubmap.json
python3 openapi_downloader.py --openapi hubmap.json --out-dir ./hubmap_data

# Check results
ls hubmap_data/
# → Donors.csv  Samples.csv  Datasets.csv  Files.csv

How It Works: Pagination Handling

The Problem: Heterogeneous API Designs

Biomedical databases expose their data through REST APIs, but each uses different pagination strategies based on their backend architecture:

Pagination Type	Mechanism	Used By
Offset/Limit	`?offset=0&limit=100` then `?offset=100&limit=100`	CFDE, Monarch, ChEMBL
Page-based	`?page=0&size=100` then `?page=1&size=100`	GTEx
Cursor-based	`?cursor=abc123` (opaque token from previous response)	Harmonizome
Elasticsearch	POST body with `{"from": 0, "size": 100}`	HuBMAP (with `--use-search-api`)
Solr	`?start=0&rows=100`	IMPC
No pagination	Single request returns complete dataset	SenNet, RGD

The Solution: Pattern-Based Detection

The tool analyzes the OpenAPI specification to:

Identify the API type - Match URL patterns and parameter names against known database signatures
Discover all entity types - Extract available data types (e.g., donors, samples, datasets, genes)
Configure pagination - Set the correct parameters for iterating through results

Example: HuBMAP Detection

Input: OpenAPI spec with path "/param-search/{entity_type}"

Step 1: Pattern matcher identifies HuBMAP-style API
Step 2: Extract entity types from spec → ["donors", "samples", "datasets", "files"]
Step 3: Generate endpoints for each:
        
        /param-search/donors   → Donors.csv
        /param-search/samples  → Samples.csv
        /param-search/datasets → Datasets.csv
        /param-search/files    → Files.csv

Example: CFDE Detection

Input: OpenAPI spec with path "/{entType}/id" and enum ["Gene", "Variant", "RegulatoryElement"]

Step 1: Pattern matcher identifies CFDE-style API
Step 2: Extract entity types from parameter enum → ["Gene", "Variant", "RegulatoryElement"]
Step 3: Generate endpoints with offset/limit pagination:
        
        /Gene/id?offset=0&limit=1000              → Gene.csv
        /Variant/id?offset=0&limit=1000           → Variant.csv
        /RegulatoryElement/id?offset=0&limit=1000 → Regulatoryelement.csv

Usage Examples

Basic Usage

# Download all entity types
python3 openapi_downloader.py --openapi hubmap.json --out-dir ./data

# Limit rows (for testing)
python3 openapi_downloader.py --openapi hubmap.json --out-dir ./data --max-rows-per-entity 100

# Use Elasticsearch API for complete HuBMAP data
python3 openapi_downloader.py --openapi hubmap.json --out-dir ./data --use-search-api

Download from All 6 SmartAPI Databases

#!/bin/bash
mkdir -p data && cd data

# HuBMAP - Human BioMolecular Atlas Program
curl -s "https://smart-api.info/api/metadata/7aaf02b838022d564da776b03f357158" > hubmap.json
python3 ../openapi_downloader.py --openapi hubmap.json --out-dir ./hubmap

# SenNet - Cellular Senescence Network
curl -s "https://smart-api.info/api/metadata/7d838c9dee0caa2f8fe57173282c5812" > sennet.json
python3 ../openapi_downloader.py --openapi sennet.json --out-dir ./sennet

# CFDE - Common Fund Data Ecosystem
curl -s "https://smart-api.info/api/metadata/d1ac2227e079aa3cae4e1cd696431ff8" > cfde.json
python3 ../openapi_downloader.py --openapi cfde.json --out-dir ./cfde

# WikiPathways - Biological Pathways
curl -s "https://smart-api.info/api/metadata/45f6ce9f9f2072b581ab85771e2ab15b" > wikipathways.json
python3 ../openapi_downloader.py --openapi wikipathways.json \
  --base-url "http://webservice.wikipathways.org" --out-dir ./wikipathways

# LINCS Data Portal - Drug-Gene Interactions
curl -s "https://smart-api.info/api/metadata/1ad2cba40cb25cd70d00aa8fba9cfaf3" > lincs.json
python3 ../openapi_downloader.py --openapi lincs.json \
  --base-url "http://lincsportal.ccs.miami.edu/dcic/api" --out-dir ./lincs

# ClinGen LDH - Clinical Genetics Linked Data
curl -s "https://smart-api.info/api/metadata/5f76a78a6b80eef423677db7cd81140e" > clingen.json
python3 ../openapi_downloader.py --openapi clingen.json \
  --base-url "https://genboree.org/ldh" --out-dir ./clingen

Command Line Reference

python3 openapi_downloader.py [OPTIONS]

Required:
  --openapi PATH          OpenAPI spec file (JSON/YAML) or URL

Optional:
  --out-dir DIR           Output directory for CSV files (default: .)
  --base-url URL          API base URL (auto-detected if not provided)
  --max-rows-per-entity N Limit rows per entity type
  --max-pages N           Safety limit on pagination cycles (default: 10000)
  --use-search-api        Use POST /search endpoint (HuBMAP complete data)

Output Format

CSV Files

Each entity type produces one CSV file with:

Flattened JSON using dot notation (e.g., donor.metadata.age)
One row per record
All fields from the API response

Example Donors.csv:

uuid,hubmap_id,created_timestamp,data_access_level,entity_type,metadata.age,metadata.sex
abc123,HBM123.XYZ.456,1609459200000,public,Donor,45,Male
def456,HBM789.ABC.123,1609545600000,public,Donor,32,Female

Console Output

Detected 4 HuBMAP endpoint(s)
Detected entity endpoints:
  - donors: GET /param-search/donors (list_key=None, pagination=none)
  - samples: GET /param-search/samples (list_key=None, pagination=none)
  - datasets: GET /param-search/datasets (list_key=None, pagination=none)
  - files: GET /param-search/files (list_key=None, pagination=none)

Fetching all records for entity 'donors' from /param-search/donors ...
Wrote 500 rows to ./data/Donors.csv

Error Handling

The tool gracefully handles common errors:

Error	Behavior
400 Bad Request	Skips invalid entity type
401 Unauthorized	Skips endpoint requiring auth
403 Forbidden	Skips restricted endpoint
404 Not Found	Skips missing endpoint
502 Bad Gateway	Reports server error, continues

Architecture

openapi_downloader.py
│
├── load_openapi()           # Load spec from file or URL
│
├── EndpointDetectorRegistry # Strategy pattern for detection
│   ├── @register("GTEx")    # Decorator-based registration
│   ├── @register("HuBMAP")
│   └── ... (15 detectors)
│
├── find_entity_endpoints()  # Main detection entry point
│
├── paginate_request()       # Handle all pagination types
│   ├── page-based
│   ├── offset-based
│   ├── cursor-based
│   ├── elasticsearch
│   └── solr
│
└── write_entity_csv()       # Export to CSV with flattening

Requirements

Python >= 3.9
requests
pyyaml
pandas

Install: pip install requests pyyaml pandas

Troubleshooting

"No entity endpoints detected"

Verify the spec URL uses /api/metadata/ (not /ui/)
Check if the database requires a specific --base-url

"Could not infer API base URL"

python3 openapi_downloader.py --openapi spec.json --base-url https://api.example.org

Empty CSV files

Some endpoints require authentication (401 errors are skipped)
Check the console output for skipped endpoints

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
GETTING_STARTED.md		GETTING_STARTED.md
QUICK_REFERENCE.md		QUICK_REFERENCE.md
README.md		README.md
example_all_databases.sh		example_all_databases.sh
example_hubmap.sh		example_hubmap.sh
openapi_downloader.py		openapi_downloader.py
requirements.txt		requirements.txt
test_all_databases.sh		test_all_databases.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenAPI Biomedical Database Downloader

For AI Agents: Input/Output Specification

Input Requirements

Output

Example Agent Workflow

Supported Databases

✅ Verified with SmartAPI (use with this tool)

🔧 Pattern Detection Ready (no SmartAPI spec available)

Quick Start

How It Works: Pagination Handling

The Problem: Heterogeneous API Designs

The Solution: Pattern-Based Detection

Usage Examples

Basic Usage

Download from All 6 SmartAPI Databases

Command Line Reference

Output Format

CSV Files

Console Output

Error Handling

Architecture

Requirements

Troubleshooting

"No entity endpoints detected"

"Could not infer API base URL"

Empty CSV files

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OpenAPI Biomedical Database Downloader

For AI Agents: Input/Output Specification

Input Requirements

Output

Example Agent Workflow

Supported Databases

✅ Verified with SmartAPI (use with this tool)

🔧 Pattern Detection Ready (no SmartAPI spec available)

Quick Start

How It Works: Pagination Handling

The Problem: Heterogeneous API Designs

The Solution: Pattern-Based Detection

Usage Examples

Basic Usage

Download from All 6 SmartAPI Databases

Command Line Reference

Output Format

CSV Files

Console Output

Error Handling

Architecture

Requirements

Troubleshooting

"No entity endpoints detected"

"Could not infer API base URL"

Empty CSV files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages