Skip to content

psui3905/openapi-crawler

Repository files navigation

OpenAPI Biomedical Database Downloader

Tool that automatically downloads data from public biomedical databases using their API specifications from SmartAPI. Provide a SmartAPI/OpenAPI spec, and the tool discovers entity types, handles pagination, and exports everything to CSV files.

✅ Verified with SmartAPI: HuBMAP, SenNet, CFDE, WikiPathways, LINCS Data Portal, ClinGen LDH


For AI Agents: Input/Output Specification

Input Requirements

Parameter Required Type Description
--openapi Yes File path or URL OpenAPI specification (JSON or YAML format)
--out-dir No Directory path Output directory for CSV files (default: .)
--base-url No URL API base URL (auto-detected from spec if available)
--max-rows-per-entity No Integer Limit rows per entity (for testing/sampling)
--use-search-api No Flag Use Elasticsearch POST /search for complete data

Output

The tool produces:

  1. CSV files - One file per entity type (e.g., Donors.csv, Samples.csv, Datasets.csv)
  2. Console output - Progress information and detected endpoints (to stderr)

Example Agent Workflow

# Step 1: Get OpenAPI spec from SmartAPI
curl -s "https://smart-api.info/api/metadata/{SMARTAPI_ID}" > spec.json

# Step 2: Run the downloader
python3 openapi_downloader.py --openapi spec.json --out-dir ./output

# Step 3: Output files are in ./output/
# → Donors.csv, Samples.csv, Datasets.csv, Files.csv

Supported Databases

✅ Verified with SmartAPI (use with this tool)

Database SmartAPI ID What You Get
HuBMAP 7aaf02b838022d564da776b03f357158 Donors, Samples, Datasets, Files
SenNet 7d838c9dee0caa2f8fe57173282c5812 Datasets (with provenance)
CFDE d1ac2227e079aa3cae4e1cd696431ff8 Genes, Variants, RegulatoryElements
WikiPathways 45f6ce9f9f2072b581ab85771e2ab15b Pathways (3,278), Organisms (48)
LINCS Data Portal 1ad2cba40cb25cd70d00aa8fba9cfaf3 Drug mechanisms, Disease indications
ClinGen LDH 5f76a78a6b80eef423677db7cd81140e Genes, Variants, ClinVar submissions

🔧 Pattern Detection Ready (no SmartAPI spec available)

These databases have detection patterns in the code but don't have SmartAPI specs. Use their REST APIs directly:

Database Data Direct API
GTEx Gene expression https://gtexportal.org/api/v2/
Harmonizome Gene-function https://maayanlab.cloud/Harmonizome/api/1.0/
Monarch Disease-phenotype https://api-v3.monarchinitiative.org/
RGD Rat genomics https://rest.rgd.mcw.edu/rgdws/
IMPC Mouse phenotyping https://www.ebi.ac.uk/mi/impc/solr/
UniProt Proteins https://rest.uniprot.org/
ChEMBL Drug molecules https://www.ebi.ac.uk/chembl/api/data/

Quick Start

# Install dependencies
pip install requests pyyaml pandas

# Download HuBMAP data
curl -s "https://smart-api.info/api/metadata/7aaf02b838022d564da776b03f357158" > hubmap.json
python3 openapi_downloader.py --openapi hubmap.json --out-dir ./hubmap_data

# Check results
ls hubmap_data/
# → Donors.csv  Samples.csv  Datasets.csv  Files.csv

How It Works: Pagination Handling

The Problem: Heterogeneous API Designs

Biomedical databases expose their data through REST APIs, but each uses different pagination strategies based on their backend architecture:

Pagination Type Mechanism Used By
Offset/Limit ?offset=0&limit=100 then ?offset=100&limit=100 CFDE, Monarch, ChEMBL
Page-based ?page=0&size=100 then ?page=1&size=100 GTEx
Cursor-based ?cursor=abc123 (opaque token from previous response) Harmonizome
Elasticsearch POST body with {"from": 0, "size": 100} HuBMAP (with --use-search-api)
Solr ?start=0&rows=100 IMPC
No pagination Single request returns complete dataset SenNet, RGD

The Solution: Pattern-Based Detection

The tool analyzes the OpenAPI specification to:

  1. Identify the API type - Match URL patterns and parameter names against known database signatures
  2. Discover all entity types - Extract available data types (e.g., donors, samples, datasets, genes)
  3. Configure pagination - Set the correct parameters for iterating through results

Example: HuBMAP Detection

Input: OpenAPI spec with path "/param-search/{entity_type}"

Step 1: Pattern matcher identifies HuBMAP-style API
Step 2: Extract entity types from spec → ["donors", "samples", "datasets", "files"]
Step 3: Generate endpoints for each:
        
        /param-search/donors   → Donors.csv
        /param-search/samples  → Samples.csv
        /param-search/datasets → Datasets.csv
        /param-search/files    → Files.csv

Example: CFDE Detection

Input: OpenAPI spec with path "/{entType}/id" and enum ["Gene", "Variant", "RegulatoryElement"]

Step 1: Pattern matcher identifies CFDE-style API
Step 2: Extract entity types from parameter enum → ["Gene", "Variant", "RegulatoryElement"]
Step 3: Generate endpoints with offset/limit pagination:
        
        /Gene/id?offset=0&limit=1000              → Gene.csv
        /Variant/id?offset=0&limit=1000           → Variant.csv
        /RegulatoryElement/id?offset=0&limit=1000 → Regulatoryelement.csv

Usage Examples

Basic Usage

# Download all entity types
python3 openapi_downloader.py --openapi hubmap.json --out-dir ./data

# Limit rows (for testing)
python3 openapi_downloader.py --openapi hubmap.json --out-dir ./data --max-rows-per-entity 100

# Use Elasticsearch API for complete HuBMAP data
python3 openapi_downloader.py --openapi hubmap.json --out-dir ./data --use-search-api

Download from All 6 SmartAPI Databases

#!/bin/bash
mkdir -p data && cd data

# HuBMAP - Human BioMolecular Atlas Program
curl -s "https://smart-api.info/api/metadata/7aaf02b838022d564da776b03f357158" > hubmap.json
python3 ../openapi_downloader.py --openapi hubmap.json --out-dir ./hubmap

# SenNet - Cellular Senescence Network
curl -s "https://smart-api.info/api/metadata/7d838c9dee0caa2f8fe57173282c5812" > sennet.json
python3 ../openapi_downloader.py --openapi sennet.json --out-dir ./sennet

# CFDE - Common Fund Data Ecosystem
curl -s "https://smart-api.info/api/metadata/d1ac2227e079aa3cae4e1cd696431ff8" > cfde.json
python3 ../openapi_downloader.py --openapi cfde.json --out-dir ./cfde

# WikiPathways - Biological Pathways
curl -s "https://smart-api.info/api/metadata/45f6ce9f9f2072b581ab85771e2ab15b" > wikipathways.json
python3 ../openapi_downloader.py --openapi wikipathways.json \
  --base-url "http://webservice.wikipathways.org" --out-dir ./wikipathways

# LINCS Data Portal - Drug-Gene Interactions
curl -s "https://smart-api.info/api/metadata/1ad2cba40cb25cd70d00aa8fba9cfaf3" > lincs.json
python3 ../openapi_downloader.py --openapi lincs.json \
  --base-url "http://lincsportal.ccs.miami.edu/dcic/api" --out-dir ./lincs

# ClinGen LDH - Clinical Genetics Linked Data
curl -s "https://smart-api.info/api/metadata/5f76a78a6b80eef423677db7cd81140e" > clingen.json
python3 ../openapi_downloader.py --openapi clingen.json \
  --base-url "https://genboree.org/ldh" --out-dir ./clingen

Command Line Reference

python3 openapi_downloader.py [OPTIONS]

Required:
  --openapi PATH          OpenAPI spec file (JSON/YAML) or URL

Optional:
  --out-dir DIR           Output directory for CSV files (default: .)
  --base-url URL          API base URL (auto-detected if not provided)
  --max-rows-per-entity N Limit rows per entity type
  --max-pages N           Safety limit on pagination cycles (default: 10000)
  --use-search-api        Use POST /search endpoint (HuBMAP complete data)

Output Format

CSV Files

Each entity type produces one CSV file with:

  • Flattened JSON using dot notation (e.g., donor.metadata.age)
  • One row per record
  • All fields from the API response

Example Donors.csv:

uuid,hubmap_id,created_timestamp,data_access_level,entity_type,metadata.age,metadata.sex
abc123,HBM123.XYZ.456,1609459200000,public,Donor,45,Male
def456,HBM789.ABC.123,1609545600000,public,Donor,32,Female

Console Output

Detected 4 HuBMAP endpoint(s)
Detected entity endpoints:
  - donors: GET /param-search/donors (list_key=None, pagination=none)
  - samples: GET /param-search/samples (list_key=None, pagination=none)
  - datasets: GET /param-search/datasets (list_key=None, pagination=none)
  - files: GET /param-search/files (list_key=None, pagination=none)

Fetching all records for entity 'donors' from /param-search/donors ...
Wrote 500 rows to ./data/Donors.csv

Error Handling

The tool gracefully handles common errors:

Error Behavior
400 Bad Request Skips invalid entity type
401 Unauthorized Skips endpoint requiring auth
403 Forbidden Skips restricted endpoint
404 Not Found Skips missing endpoint
502 Bad Gateway Reports server error, continues

Architecture

openapi_downloader.py
│
├── load_openapi()           # Load spec from file or URL
│
├── EndpointDetectorRegistry # Strategy pattern for detection
│   ├── @register("GTEx")    # Decorator-based registration
│   ├── @register("HuBMAP")
│   └── ... (15 detectors)
│
├── find_entity_endpoints()  # Main detection entry point
│
├── paginate_request()       # Handle all pagination types
│   ├── page-based
│   ├── offset-based
│   ├── cursor-based
│   ├── elasticsearch
│   └── solr
│
└── write_entity_csv()       # Export to CSV with flattening

Requirements

Python >= 3.9
requests
pyyaml
pandas

Install: pip install requests pyyaml pandas


Troubleshooting

"No entity endpoints detected"

  • Verify the spec URL uses /api/metadata/ (not /ui/)
  • Check if the database requires a specific --base-url

"Could not infer API base URL"

python3 openapi_downloader.py --openapi spec.json --base-url https://api.example.org

Empty CSV files

  • Some endpoints require authentication (401 errors are skipped)
  • Check the console output for skipped endpoints

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors