Skip to content
Tushar edited this page May 14, 2025 · 1 revision

PdcmDataTransformers Developer Wiki

Overview

The PdcmDataTransformers repository contains Jupyter notebooks and Python code for bespoke data transformations used in the PDCM Finder project. PDCM Finder is a cancer research platform that aggregates and standardizes data from patient-derived models (xenografts, organoids, cell lines) across multiple sources. This repository provides ETL (Extract–Transform–Load) pipelines for various model data providers, converting raw source data into the common PDCM data schema and supporting the integration of diverse biological and clinical information. Key ETL workflows include Cell Model Passports, Human Cancer Model Initiative (HCMI), Jackson Laboratory (JAX), PDMR, PDXNet, and others. Each pipeline typically loads source data, applies cleaning and mapping steps, and outputs standardized JSON/CSV aligned with the PDCM model.

Architecture and Workflows

The codebase is organized into source-specific modules (folders) and utility scripts. Each module handles one data provider or task: Source ETL Pipelines (in subfolders named after providers):

  • CMP – Processes Cell Model Passports data.
  • HCMI – Processes data from the NCI Human Cancer Models Initiative.
  • HCI-BCM – Processes data from the Baylor College of Medicine branch of HCMI.
  • JAX – Processes data from The Jackson Laboratory.
  • PDMR – Processes data from the Patient-Derived Models Repository (NCI).
  • PDXNet – Processes data from the PDXNet consortium.
  • DFCI – Processes data from Dana-Farber Cancer Institute.
  • NKI – Processes data from the Netherlands Cancer Institute.
  • IRCCS-CRC – Processes data from the Italian IRCCS-CRC consortium.
  • CRL, cccells, CancerModelsFinder, Others, PMLB, etc. – Similar ETL scripts for additional data sources or special datasets.

Each pipeline is typically a Jupyter notebook (or set of notebooks) that reads raw files (CSV/Excel/JSON), cleans and maps fields (e.g. renaming columns, handling missing values, converting dates), and produces output files in the PDCM format. The workflows often end by writing one or more standardized JSONs or CSVs. These modules share common goals of data harmonization and comply with the PDCM Finder’s strategy to “standardise, harmonise and integrate the complex and diverse data associated with PDCMs”.

Utility Scripts:

utils.py – Common helper functions used by multiple pipelines, such as data loading, field normalization, and basic cleaning routines. For example, utility functions may rename dataframe columns, fill or drop missing values, convert units, and handle logging. Each major pipeline script calls functions from utils.py to avoid code duplication.

  • EnsemblToVCF – Contains code to convert variant data from Ensembl (or other formats) into VCF format, if genomic variant transformation is needed.
  • cbioportal – Contains scripts to transform PDCM data into formats compatible with cBioPortal or other downstream platforms.
  • cccells – Likely processes curated lists of cancer cell lines (e.g. CCLE) into the PDCM schema.
  • citation – Scripts to attach or format literature citations and references for models (e.g., fetching DOIs or PubMed IDs).
  • data_model_changes – Notebooks analyzing or migrating changes in the PDCM data model (e.g. schema updates between versions).
  • Clean-up-scripts – Additional scripts for generic data cleaning tasks, deduplication, or one-off corrections.
  • Visualization – Notebooks containing exploratory plots or checks for transformed data (for QA/QC or reporting).
  • resources and publications – Likely hold reference files (e.g. static lists of resources or publications) used by other scripts.

Each pipeline generally proceeds as follows: Extraction: Read in raw data files (e.g. CSV, TSV, Excel, JSON) from source. Transformation: Use pandas (or similar) to clean, filter, and map fields. This may involve calling utility functions (from utils.py) to normalize columns, remove duplicates, or translate identifiers. Load/Output: Write the cleaned data into PDCM-standard JSON or CSV outputs, which can then be ingested into the PDCM Finder database.

Clone this wiki locally