Authors: Ethan Rodgers‑Gates, Brian Ma, Damien Liebov
Course: CMSC395: Natural Language Processing, University of Richmond
This repository implements a hybrid GPT‑BERT model (BabyLM 10M), unifying causal and masked language modeling in a single transformer. We apply curriculum learning on the Baby‑cosmo‑fine‑10M dataset and evaluate on BLiMP, GLUE and EWOK benchmarks.
- Pretrained Model Name: ltg/gpt-bert-babylm-small
- Hugging Face URL: https://huggingface.co/ltg/gpt-bert-babylm-small
- Baseline Implementation: Cloned from ltgoslo/gpt-bert and extended with our modifications.
Data can be found in the ReadMe.md files in their directories. Links to download data:
- Checkpoints.zip contains the
baselineandmodifiedcheckpoint files. - In order to run evaluation on these files, they must be converted to HF model's (instructions in
run.slurm) - HF models should be uploaded on the GitHub and do not need further downloading (Saved in:
hf_format_models)
Once downloaded, unzip the files in their appropriate subfolders.
How to execute the run.slurm script:
- Make sure it’s executable
chmod +x run.slurm
- Submit to SLURM
sbatch /home/dl5ja/shared_cfinegan/group4_work/395SharedTask/run.slurm
- Monitor the Job
squeue
run.slurm will:
- Create the Tokenizer
- Tokenize the corpus (splitting into test/train)
- Pretrain the GPT‑BERT model
- Evaluate on BLiMP and EWOK
All logs and outputs appear in slurm-<jobid>.out
Once training and evaluation finish, you can load and test your model:
Testing File for Masked Language Modeling: MLM Testing Testing File for Text Generation: Text Generation Testing
Lucas Georges Gabriel Charpentier and David Samuel. 2024. BERT or GPT: why not both?. In The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning, pages 262–283, Miami, FL, USA. Association for Computational Linguistics.