BioCreative IX @ IJCAI Track 2: Sentence segmentation of real-life clinical notes

We invite researchers and practitioners to participate in the Track 2: Sentence segmentation of real-life clinical notes at the BioCreative IX @ IJCAI. This task focuses on developing automatic systems that can accurately segment clinical notes into sentences. This task requires participants to identify spans of each sentence.

Task Details

Sentence segmentation is a fundamental linguistic task widely used as a pre-processing step in many NLP applications. While modern LLMs and sparse attention mechanisms in transformer networks have reduced the necessity of sentence-level inputs in some NLP tasks, many models are still designed and tested for shorter sequences. The need for sentence segmentation is particularly pronounced in clinical notes, as most clinical NLP tasks depend on this information for annotation and model training. In 2024, we introduced a baseline system for sentence segmentation that outperformed several standard tools, such as Stanza, SpaCy, and Trankit. To foster further innovation in sentence segmentation techniques, we organized a shared task as part of the BioCreative 9 workshop.

Registration and Resources

To join this shared task, please register for the Track 2: Sentence segmentation of real-life clinical notes through the BioCreative IX Shared Task Registration Form. Upon registration, participants will gain access to the full dataset and annotation guideline.

Requirements

The following requirements apply to both data preparation and evaluation scripts:

Python 3.6 or higher
pandas library

Additional requirement for evaluation:

numpy library

Corpus and Data Preparation

This shared task utilizes a corpus of clinical notes derived from the MIMIC-III Database.

Accessing MIMIC-III Notes

Participants are required to complete necessary training and sign a data usage agreement to access the MIMIC-III Clinical Database (v1.4). After gaining access and downloading 'NOTEEVENTS.csv.gz' and row_id files from the google drive (downloading link is provided in the google group), participants must run the collect_notes_from_mimic.py script to retrieve clinical notes using the provided ROW_ID.

Script Usage

The script requires three command-line arguments:

--note_ids_path: The file path to the text file containing the note IDs.
--mimic_path: The directory path containing the MIMIC-III v1.4 CSV files (NOTEEVENTS.csv.gz, PRESCRIPTIONS.csv.gz and PATIENTS.csv.gz).
--output_path: The file path where the processed corpus CSV will be saved.

Command Syntax

python collect_notes_from_mimic.py --note_ids_path data/biocreative/sample/sample_row_id.txt  --mimic_path ./mimic-iii/1.4 --output_path data/biocreative/sample/sample_corpus.csv

Annotations Format

Ground truth annotations are provided in CSV format. A sample of the training set annotations is available at data/biocreative/sample. Each annotation file is named using the ROW_ID of clinical notes from the MIMIC-III and includes information on sentence boundaries and types. There are two types of text chunks:

Sentence
Unstructured

The distinction between these types is detailed in our annotation guidelines, which will be provided upon registration. Registered participants will have access to the full dataset and annotation guidelines.

Evaluation

The evaluation metrics for this task is adapted from the 2018 Universal Dependency Parsing Shared task, where we compared the sentence boundaries from the gold annotation. The evaluation.py is our evaluation script.

Script Usage

The script requires two command-line arguments:

--gold_dir: Directory containing gold annotation CSV files
--pred_dir: Directory containing prediction CSV files
--output_dir: Directory saving evaluation scores

Command Syntax

python evaluation.py --gold_dir data/biocreative/sample/annotation --pred_dir path/to/your/predictions --output_dir path/to/your/scores

The script outputs a formatted table showing:

Individual file performance
Macro-average metrics (precision, recall, F1)
Micro-average metrics (precision, recall, F1)

Submission

Please submit your system predictions at Codebench

Baseline system

Baseline system is available at baseline/SliderSplit.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BioCreative IX @ IJCAI Track 2: Sentence segmentation of real-life clinical notes

Task Details

Registration and Resources

Requirements

Corpus and Data Preparation

Accessing MIMIC-III Notes

Script Usage

Command Syntax

Annotations Format

Evaluation

Script Usage

Command Syntax

Submission

Baseline system

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
baseline		baseline
data/biocreative/sample		data/biocreative/sample
README.md		README.md
collect_notes_from_mimic.py		collect_notes_from_mimic.py
evaluation.py		evaluation.py

dongfang91/sentence_segmenter

Folders and files

Latest commit

History

Repository files navigation

BioCreative IX @ IJCAI Track 2: Sentence segmentation of real-life clinical notes

Task Details

Registration and Resources

Requirements

Corpus and Data Preparation

Accessing MIMIC-III Notes

Script Usage

Command Syntax

Annotations Format

Evaluation

Script Usage

Command Syntax

Submission

Baseline system

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages