omop-rag is a Python CLI tool designed for accurate mapping of unstructured clinical event names to standardised OMOP concepts.
It leverages a Retrieval-Augmented Generation (RAG) approach:
- Vector Search: Uses a pre-trained clinical Sentence Transformer model (
MedEmbed-large-v0.1) to find the top 10 most similar OMOP concepts (Retrieval) from the already vectorisedconcept_embeddings.ptfile, which is a PyTorch vector database generated from all the LOINC lab test concepts in thelab_concepts.csvfile. - QA Matching: Employs a Question-Answering (QA) model (
deepset/roberta-base-squad2) to select the single best match from those 10 candidates (Generation/Refinement).
In our case, the input documents are OMOP labratory test concepts. These are embedded into a vector database, which the user can query, to pull out closley related concepts to free text lab test events. Another LLM Agent can use this shortened context to provide a more accurate result and match.
This project uses Poetry for dependency management.
-
Install Poetry (if you haven't already):
(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | py - -
Clone the repository:
git clone https://github.com/answerdigital/omop-rag.git cd omop-rag -
Install Dependencies: Poetry will create a virtual environment and install all necessary packages.
poetry install
The application is run via poetry run python main.py followed by one of three subcommands: create-embeddings, similar-search, or best-match.
Skip this step if you want to use our provided LOINC lab test embeddings, the create-embeddings process may take a while without GPU power. This command preprocesses your raw concept data into a vector database. It requires two arguments.
| Argument | Description | Example Path |
|---|---|---|
--concepts-file |
Path to the input CSV containing OMOP concepts (concept_name, concept_id columns). |
data/drug_concepts.csv |
--embeddings-file |
Path to save the resulting PyTorch embeddings (.pt file). |
embeddings/drug/drug_embeddings.pt |
Example:
poetry run python main.py create-embeddings \
--concepts-file data/drug_concepts.csv \
--embeddings-file embeddings/drug/drug_embeddings.ptThis command queries the vector database to find the top 10 similar concepts for each event in your input CSV. It requires four arguments.
| Argument | Description | Example Path |
|---|---|---|
--concepts-file |
Path to the original concepts CSV (used for mapping IDs to names). | data/lab/lab_concepts.csv |
--embeddings-file |
Path to the pre-created embeddings (.pt file) to search against. |
embeddings/lab/concept_embeddings.pt |
--input-csv |
Path to the CSV containing the raw events to query (must have an EVENT column). |
data/lab/lab_events.csv |
--output-json |
Path to save the search results JSON file. | results/lab_similar_results.json |
Example:
poetry run python main.py similar-search \
--concepts-file data/lab/lab_concepts.csv \
--embeddings-file embeddings/lab/concept_embeddings.pt \
--input-csv data/lab/lab_events.csv \
--output-json results/lab_similar_results.jsonExample Output (results/lab_similar_results.json snippet):
[
{
"input": "Haemoglobin levels in blood",
"similar_concepts": [
{
"id": 3005872,
"name": "Hemoglobin [Presence] in Blood",
"score": 0.866
},
// ... 9 more results
]
},
{
"input": "creatinine levels in blood",
"similar_concepts": [
{
"id": 3051825,
"name": "Creatinine [Mass/volume] in Blood",
"score": 0.8939
}
// ... 9 more results
]
}
]This command takes the JSON output from similar-search, uses a QA model to select the single best match, and exports the final mapping to a clean CSV. It requires two arguments.
| Argument | Description | Example Path |
|---|---|---|
--input-json |
Path to the input JSON file from the similar-search step. |
results/lab_similar_results.json |
--output-csv |
Path to save the final concept mapping CSV file. | results/lab_matches.csv |
Example:
poetry run python main.py best-match \
--input-json results/lab_similar_results.json \
--output-csv results/lab_matches.csvExample Output (results/lab_matches.csv):
| raw_event_input | concept_id | concept_name |
|---|---|---|
| Haemoglobin levels in blood | 3005872 | Hemoglobin [Presence] in Blood |
| creatinine levels in blood | 3051825 | Creatinine [Mass/volume] in Blood |
| o2 sat test | 3016502 | Oxygen saturation in Arterial blood |
| ph blood | 3010421 | pH of Blood |
| potassium levels in blood | 21490733 | Potassium [Mass/volume] in Blood |
| na levels in blood | 3000285 | Sodium [Moles/volume] in Blood |
To run the complete concept mapping pipeline from raw concepts to final matches, execute the three commands in sequence:
# 1. Create the vector database
poetry run python main.py create-embeddings --concepts-file data/lab/lab_concepts.csv --embeddings-file embeddings/lab/concept_embeddings.pt
# 2. Find the top 10 similar concepts for each event
poetry run python main.py similar-search --concepts-file data/lab/lab_concepts.csv --embeddings-file embeddings/lab/concept_embeddings.pt --input-csv data/lab/lab_events.csv --output-json results/lab_similar_results.json
# 3. Use the QA model to select the single best match
poetry run python main.py best-match --input-json results/lab_similar_results.json --output-csv results/lab_matches.csv