NAX-Narcolepsy

NLP-based narcolepsy detection from electronic health record (EHR) clinical notes. This repository contains two complementary approaches:

Discriminative Modeling -- Classifies individual clinical notes as belonging to patients with or without narcolepsy (NT1, NT2/IH).
Predictive Modeling -- Computes a longitudinal risk score from pre-diagnostic clinical notes to identify patients likely to be diagnosed with narcolepsy in the future.

Repository Structure

NAX-Narcolepsy/
├── discriminative-modeling/       # Note-level classification
│   ├── narcolepsy_model.py        # Feature extraction & prediction pipeline
│   ├── model_comp.py              # Model training & evaluation framework
│   ├── discriminative-model.ipynb # Usage instructions and example workflow
│   ├── config.yaml                # Feature definitions and model paths
│   ├── env.toml                   # Environment configuration
│   └── models/                    # Pre-trained classifiers
│       ├── nt1_vs_not.joblib
│       ├── nt2_vs_not.joblib
│       └── nt12_vs_not.joblib
│
├── predictive-modeling/           # Pre-diagnostic risk scores
│   ├── features_update/           # Input feature data (parquet)
│   │   ├── nt1/
│   │   └── nt2ih/
│   ├── risk_score_v2/             # Active risk score pipeline
│   │   ├── risk_score_v2.py       # Main training/evaluation/plotting script
│   │   └── METHODS.md             # Detailed methodology
│   └── pooled-logistic-regression/  # Alternative PLR approach (archived)
│
├── paper_figures/                 # Notebooks & scripts to reproduce paper figures
│   ├── confusion_matrices.ipynb   # Confusion matrix plots
│   ├── roc_prc.ipynb              # ROC and precision-recall curves
│   ├── swimmer_plot.ipynb         # Swimmer plot of patient timelines
│   └── feature_heatmap.py         # Feature evolution heatmaps (cases vs controls)
│
├── timeline-viewer/               # Annotation tool (git submodule)
│
└── LICENSE

Discriminative Modeling

Classifies whether a clinical note belongs to a patient with narcolepsy. Uses keyword matching with negation detection, ICD code features, and medication features extracted from clinical notes.

Models

Three pre-trained Random Forest classifiers:

nt1_vs_not -- NT1 vs. non-narcolepsy
nt2_vs_not -- NT2/IH vs. non-narcolepsy
nt12_vs_not -- Any narcolepsy (NT1 + NT2/IH) vs. non-narcolepsy

Features

924 features per visit, including:

Clinical keywords (~446 stemmed terms with negation detection, e.g., cataplexi_, sleepi attack_, narcolepsi_neg_)
ICD codes (regex matching for narcolepsy diagnosis codes, e.g., G47.41, G47.42)
Medications (27 narcolepsy-relevant drugs: modafinil, Xyrem, stimulants, antidepressants, etc.)

Usage

See discriminative-modeling/discriminative-model.ipynb for a complete walkthrough including:

Loading clinical data (notes, ICD codes, medications) from parquet files
Feature extraction using the NarcolepsyModel class
Model inference for all three classification tasks
Leave-one-source-out cross-validation and model comparison

See predictive-modeling/predictive-model.ipynb for additional feature extraction code for predictive modeling

Dependencies

polars, pandas, NLTK, Ray, scikit-learn, joblib

Predictive Modeling

Computes a longitudinal risk score from clinical notes written before a narcolepsy diagnosis is made, to support earlier referral for diagnostic testing.

Data

Source: BDSP (5 academic medical centers: BCH, BIDMC, Emory, MGB, Stanford)
Cohort: 196 any-narcolepsy training cases (66 NT1), 11,049 controls
Features: Same 924 NLP features as the discriminative models, extracted per visit

Method

SGD logistic regression with L1 penalty and balanced minibatches
Chi-squared feature prefiltering (top 100 features)
Training window: [-2.5yr, -0.5yr] before diagnosis (0.5yr horizon exclusion prevents learning from diagnostic-workup visits)
Alpha (regularization) selected via modal cross-validation across folds
Validation: stratified 5-fold CV (primary), leave-one-site-out CV (secondary)

Results

Outcome	5-fold CV AUC	LOSO AUC	Training Cases
Any Narcolepsy (NT1 + NT2/IH)	0.835	0.797	196
NT1 Only	0.838	0.788	66

Running

cd predictive-modeling/risk_score_v2
python risk_score_v2.py both

This trains both outcome models (any_narcolepsy, NT1), runs all cross-validation, trains the final model, and generates all figures and tables:

v2_performance_combined.png -- CV / LOSO / resubstitution AUC and AUPRC
v2_distributions_combined.png -- Risk score distributions for cases vs. controls
v2_features_combined.png -- Top predictive features by coefficient magnitude
v2_trajectories_combined.png -- Risk score trajectories aligned to diagnosis (logit scale)
v2_nnt_analysis.png -- Number Needed to Test analysis (assumed prevalence 0.08%)
v2_loso_by_site.csv -- LOSO performance broken down by site

Dependencies

numpy, pandas, scikit-learn, matplotlib, seaborn, scipy, pyarrow

Paper Figures

Jupyter notebooks and scripts in paper_figures/ reproduce the main figures from the manuscript. Each notebook loads data from ../data (relative to the notebook) and requires the discriminative-modeling data files (features, notes, models).

feature_heatmap.py generates feature evolution heatmaps showing how selected predictive model features (non-zero L1 coefficients) evolve over time leading up to diagnosis in cases vs. matched controls. Run with python feature_heatmap.py to generate both outcome heatmaps, or pass any_narcolepsy or nt1 for a single outcome.

Annotation Tool

The timeline-viewer/ directory is a git submodule pointing to bdsp-core/timeline-viewer, a web application for reviewing and annotating patient clinical timelines. See its own README for setup instructions.

License

CC BY-NC 4.0 (Attribution-NonCommercial 4.0 International). Commercial use is prohibited. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
discriminative-modeling		discriminative-modeling
manuscript		manuscript
paper_figures		paper_figures
predictive-modeling		predictive-modeling
timeline-viewer @ d7fb04c		timeline-viewer @ d7fb04c
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
REPRODUCIBILITY.md		REPRODUCIBILITY.md
build_manuscript_figures.sh		build_manuscript_figures.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NAX-Narcolepsy

Repository Structure

Discriminative Modeling

Models

Features

Usage

Dependencies

Predictive Modeling

Data

Method

Results

Running

Dependencies

Paper Figures

Annotation Tool

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NAX-Narcolepsy

Repository Structure

Discriminative Modeling

Models

Features

Usage

Dependencies

Predictive Modeling

Data

Method

Results

Running

Dependencies

Paper Figures

Annotation Tool

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages