GitHub - sidataplus/athena2duckdb: Convert an extracted OMOP vocabulary download from OHDSI Athena into a ready-to-query DuckDB file

OMOP Athena → DuckDB

athena2duckdb converts an extracted OMOP vocabulary download from OHDSI Athena into a ready-to-query DuckDB file with typed CDM tables, primary keys, and automatic row-count validation. The loader ingests the standard OMOP vocab files (CONCEPT.csv, VOCABULARY.csv, etc.) and skips auxiliary exports like CONCEPT_CPT4.csv or README.txt by default.

Features

Discovers standard OMOP vocabulary files such as CONCEPT.csv, VOCABULARY.csv, CONCEPT_RELATIONSHIP.csv, and more.
Streams each file into DuckDB using read_csv with quoting/escaping disabled, preventing parse failures caused by embedded quotes or backslashes, while the CLI shows a live progress bar per table.
Loads recognised vocab files into typed tables (INTEGER, DATE, VARCHAR) that match the CDM DDL with primary keys already enforced (secondary indexes can be added later if needed).
Always performs row-count verification to ensure the database matches source files.

Installation (local)

uv sync

or build/install directly from the project root:

uv build
pip install dist/athena2duckdb-*.whl

CLI Usage

uv run athena2duckdb /path/to/athena-export -o vocab.duckdb --verbose

Arguments:

Flag	Description
`input_dir`	Directory that contains the Athena CSV/TSV files.
`-o, --out`	Output DuckDB database file (default `vocab.duckdb`).
`--sep`	Field delimiter (default tab).
`--encoding`	Source file encoding (default `UTF-8`).
`--threads`	Number of DuckDB threads to use.
`--schema`	DuckDB schema name for created tables (default `main`).
`--overwrite`	Replace an existing DuckDB file if present.
`--verbose`	Emit INFO-level logs during the load.

Example

uv run athena2duckdb data/ -o vocab.duckdb

Sample output:

Loaded 10 tables into vocab.duckdb.
Tables: concept, concept_ancestor, concept_class, concept_relationship, concept_synonym,
domain, drug_strength, source_to_concept_map, relationship, vocabulary
OK        table=concept                  csv_rows=93,547 table_rows=93,547
...

Programmatic API

from pathlib import Path
from athena2duckdb import CSVOptions, load_vocab_dir, verify_row_counts

summary = load_vocab_dir(Path("data"), Path("vocab.duckdb"), schema="cdm")
results = verify_row_counts(summary.db_path, summary.vocab_files, schema=summary.schema)

Testing

uv run pytest

License

This project is licensed under the MIT License. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src/athena2duckdb		src/athena2duckdb
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OMOP Athena → DuckDB

Features

Installation (local)

CLI Usage

Example

Programmatic API

Testing

License

About

Uh oh!

Releases

Packages

Languages

License

sidataplus/athena2duckdb

Folders and files

Latest commit

History

Repository files navigation

OMOP Athena → DuckDB

Features

Installation (local)

CLI Usage

Example

Programmatic API

Testing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages