Skip to content

Cerebras/exome_bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


ExomeBench A Benchmark for Clinical Variant Interpretation in Exome Regions 🧬

Evaluation Code

arxiv Paper | Hugging Face Hugging Face Dataset

ExomeBench consists of datasets and code. Datasets are licensed under Creative Commons Attribution–Non-Commercial 4.0 (CC BY NC 4.0). Code provided in ExomeBench is licensed under Apache 2.0. ExomeBench is a research benchmark. It is not a diagnostic tool and should not be used to make clinical decisions.

1. Project Overview

ExomeBench is a benchmark dataset designed for the evaluation of models in the field of clinical genomics, specifically focusing on the interpretation of genetic variants in exome regions. This repo contains the code to fine-tune and evaluate models on the ExomeBench dataset using the Hugging Face Transformers library.

The ExomeBench dataset is derived from ClinVar (Nov 2024 release), a publicly accessible database maintained by the National Center for Biotechnology Information (NCBI). ClinVar provides comprehensive information on the clinical significance of genetic variants and their associations with human diseases. This dataset focuses on variants located in exome-specific regions and includes input sequences generated from the Human Reference Genome (HRG, GRCh38).

This dataset provides a valuable resource for researchers and practitioners working on genetic variant analysis and its clinical implications. Exome-specific regions are critically important because they encompass all protein-coding regions of the genome, where disease-associated variants are most likely to occur. By focusing on exome-specific regions and using sequences from the Human Reference Genome, this dataset enables robust evaluation of models on clinically significant tasks.

2. Data Curation

Data Collection

  • Source: Variants are sourced from the ClinVar database.
  • Clinical Significance: ClinVar provides detailed information on the clinical significance of each variant and its association with human diseases.

Data Filtering

  • Assertion Criteria: We include only variants with at least one submitter providing an interpretation and satisfying the assertion criteria for reliability.
  • Variant Type: Only single-nucleotide variants (SNVs) are selected.
  • Exome-Specific Regions: Filter the variants to include only those located in exome-specific regions (GENCODE v.38).

Sequence Generation

  • Human Reference Genome (HRG, GRCh38): For each variant, generate input sequences from the HRG using the variants from the ClinVar database.
  • Sequence Length: The length of the sequences is a parameter, typically set to 100 base pairs (bp).
  • Variant Positioning: The variant is centered within the sequence, which is read in from a FASTA file.

Dataset Format

Each dataset entry consists of two main fields:

  • sequence (str): A DNA sequence centered around the variant.
  • label (int): Task-specific integer-encoded class index.

3. Tasks

ExomeBench includes five supervised tasks, each framed as a classification problem:

  • Pathogenic Variant Prediction (PV)
    Classify exome variants into four clinical significance categories: pathogenic, likely pathogenic, likely benign, or benign. Variants from the same gene are split across train/test to prevent leakage.

  • Phenotype Association

    • Cancer-Predisposing Syndrome (CPS): Determine if a variant is linked to Hereditary Cancer-Predisposing Syndrome.
    • Cardiovascular Phenotype (CP): Predict whether a variant is associated with cardiovascular conditions.
  • Gene Localization

    • BRCA Classification (BRCA): Identify whether a variant belongs to BRCA1, BRCA2, or neither.
    • Top 5 Genes Prediction (TFG): Classify a variant into one of the five most frequently represented genes in the dataset.

4. SOTA Model Performances

Please see our experiments folder for details on the hyperparameters and results folder for the best model performance metrics. Below we provide the best MCC metric on the test set for each task across different models.

Model Task (MCC)
PV CPS CP BRCA TFG
STRAND[1] 0.360 0.937 0.774 0.877 0.996
DNABERT-2-117M 0.162 0.876 0.549 0.552 0.996
HyenaDNA-Tiny-1k 0.135 0.816 0.445 0.700 0.994
NT-Multispecies-2.5B 0.306 0.624 0.293 0.422 0.991

Note: For some models and tasks, the seed settings in the STRAND paper were slightly different from those used in this repository, which may lead to minor variations in the reported results. Due to this, on an overly saturated tasks like TFG, you might observe a small discrepancy in the ordering of models based on MCC values compared to those reported in the paper.

5. Installation & Setup

Prerequisites

  • Python 3.10 or higher
  • Conda or Mamba (recommended)

Setup with Conda (Recommended)

1. Clone the Repository

git clone https://github.com/Cerebras/exome_bench.git
cd exome_bench

2. Create Conda Environment

conda create -n exome_bench python=3.10
conda activate exome_bench

3. Install bedtools (Required for Dataset Generation)

conda install -c bioconda bedtools

4. Install Python Packages

pip install -r requirements.txt
pip install pybedtools==0.12.0

Note: If you're only using the pre-generated datasets from Hugging Face, you don't need to install bedtools or pybedtools (skip step 3). These are only required for regenerating datasets from ClinVar and reference genome files.

6. Fine-tuning and Evaluation

You can easily fine-tune and evaluate your model on the ExomeBench tasks:

Evaluate the Model

You can evaluate your model on test split by specifing pretrained_model and the task_name in eval_only mode.

python main.py \
  --pretrained_model InstaDeepAI/nucleotide-transformer-2.5b-multi-species \
  --output_dir results \
  --mode eval_only \
  --task_name brca

Fine-Tune and Evaluate the Model

You can fine-tune the model and provide training arguments with a yaml file.

python main.py \
  --config configs/brca.yaml

(Optional) Perform Hyper-Parameter Optimization, Fine-Tune, and Evaluate the Model

We also provide a script to do hyper-parameter search and run the fine-tuning with best configuration.

python main.py \
  --config configs/brca.yaml
  --mode hp_optimize

7. Contact

Corresponding Email: exome-bench@cerebras.net

8. Citation

This benchmark was developed as part of the efforts supporting the paper: Introducing STRAND: A Foundational Sequence Transformer for Range Adaptive Nucleotide Decoding in collaboration with Mayo Clinic. If you find our work valuable, please consider giving the project a star and citing it in your research:

@article{ExomeBench, 
    DOI={https://doi.org/10.1093/bib/bbaf618}, 
    title={Introducing a foundational sequence transformer for range adaptive nucleotide decoding (STRAND)}, 
    author={Ayanian, Shant et al.}, 
    year={2025},
    journal={Briefings in Bioinformatics, Volume 26, Issue 6}
} 

Thank you for your support!

9. Acknowledgments

The fine-tuning process developed for this repository uses Hugging Face Transformers library.

10. License

This project is licensed under a Apache 2.0 License. For more details, please see the LICENSE file.

(back to top)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages