OTAR3088: Automated Knowledge Extraction for Biomedical Literature

This repository hosts the codebase and resources for the OTAR3088 project — a collaborative initiative between Europe PMC (EPMC), ChEMBL, and Open Targets.

The project aims to modernise and extend the existing Named Entity Recognition (NER) workflows used by EPMC and Open Targets to cover a broader range of biomedical entities relevant to drug discovery — including variants, biomarkers, tissues/cell types, adverse events, and assay conditions.

By incorporating these new entity types, the project seeks to provide higher confidence in the relevance of target–disease associations and enhance downstream knowledge extraction and integration

Key Objectives

Extend existing NER pipelines to support new biomedical entity types.
Develop a modular, flexible framework that enables easy replacement or integration of new NLP models and datasets as they become available.
Explore and benchmark modern NLP architectures (e.g., Transformer-based models) and advanced fine-tuning techniques for biomedical text mining.

🧩 Repository Structure

| Folder | Description |
OTAR3088/
│
├── Entity-Extraction-Modular-pipeline/      # Main modular pipeline for biomedical NER
│   ├── steps/                               
│   ├── configs/                             # YAML configuration files (Hydra-based)
│   ├── pipelines/                           # Data preprocessing and model training pipelines
│   ├── utils/                               # Helper functions and utilities
│   └── README.md                            # Documentation for this module (multi-page)
│
├── Data_mining/                             # Scripts & notebooks for dataset exploration or sourcing
├── Data_extraction-Query/                   # Query-based data extraction workflows
├── Scripts/                                 # General-purpose or legacy scripts
└── README.md                                # Central project documentation (this file)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OTAR3088: Automated Knowledge Extraction for Biomedical Literature

Key Objectives

🧩 Repository Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 200 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
Data_mining		Data_mining
Entity-Extraction-Modular-pipeline		Entity-Extraction-Modular-pipeline
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

ML4LitS/OTAR3088

Folders and files

Latest commit

History

Repository files navigation

OTAR3088: Automated Knowledge Extraction for Biomedical Literature

Key Objectives

🧩 Repository Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages