GPT-BERT Hybrid Pretraining & Evaluation Pipeline

Authors: Ethan Rodgers‑Gates, Brian Ma, Damien Liebov
Course: CMSC395: Natural Language Processing, University of Richmond

Overview

This repository implements a hybrid GPT‑BERT model (BabyLM 10M), unifying causal and masked language modeling in a single transformer. We apply curriculum learning on the Baby‑cosmo‑fine‑10M dataset and evaluate on BLiMP, GLUE and EWOK benchmarks.

Model Information

Pretrained Model Name: ltg/gpt-bert-babylm-small
Hugging Face URL: https://huggingface.co/ltg/gpt-bert-babylm-small
Baseline Implementation: Cloned from ltgoslo/gpt-bert and extended with our modifications.

Download Data / Models

Data can be found in the ReadMe.md files in their directories. Links to download data:

Checkpoints.zip contains the baseline and modified checkpoint files.
In order to run evaluation on these files, they must be converted to HF model's (instructions in run.slurm)
HF models should be uploaded on the GitHub and do not need further downloading (Saved in: hf_format_models)

Once downloaded, unzip the files in their appropriate subfolders.

Running the Pipeline

How to execute the run.slurm script:

Make sure it’s executable

chmod +x run.slurm

Submit to SLURM

sbatch /home/dl5ja/shared_cfinegan/group4_work/395SharedTask/run.slurm

Monitor the Job

squeue

run.slurm will:

Create the Tokenizer
Tokenize the corpus (splitting into test/train)
Pretrain the GPT‑BERT model
Evaluate on BLiMP and EWOK

All logs and outputs appear in slurm-<jobid>.out

Quick Test

Once training and evaluation finish, you can load and test your model:

Testing File for Masked Language Modeling: MLM Testing Testing File for Text Generation: Text Generation Testing

Citation

Lucas Georges Gabriel Charpentier and David Samuel. 2024. BERT or GPT: why not both?. In The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning, pages 262–283, Miami, FL, USA. Association for Computational Linguistics.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
checkpoints		checkpoints
configs		configs
corpus_tokenization		corpus_tokenization
data		data
debugging		debugging
evaluation-pipeline-2024		evaluation-pipeline-2024
hf_format_models		hf_format_models
pretraining		pretraining
tokenizer_creation		tokenizer_creation
tokenizers		tokenizers
.gitignore		.gitignore
LICENSE		LICENSE
ReadMe.md		ReadMe.md
run.slurm		run.slurm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPT-BERT Hybrid Pretraining & Evaluation Pipeline

Overview

Model Information

Download Data / Models

Running the Pipeline

Quick Test

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

Brian-1018/395SharedTask

Folders and files

Latest commit

History

Repository files navigation

GPT-BERT Hybrid Pretraining & Evaluation Pipeline

Overview

Model Information

Download Data / Models

Running the Pipeline

Quick Test

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages