This assignment explores fine-tuning transformer models for code completion, specifically for fine-tuning the small version of CodeT5, known as codet5-small, which contains approximately 60 million parameters, for the task of predicting missing if conditions in Python functions.
The model will take as input a function containing a special token masking a single if condition and will attempt to predict it. Our work included preparing the dataset consisting of 50,000 training samples by masking the if conditions and flattening it, tokenizing the input using a pre-trained tokenizer, and training the model on this data.
This project is implemented in Python 3.9+ and is compatible with macOS, Linux, and Windows.
- Clone the repository to your workspace:
git clone https://github.com/jdkuffa/cs420-assignment2.git- Navigate into the repository:
cd cs420-assignment2
- Set up a virtual environment and activate it:
python -m venv ./venv/
source venv/bin/activate
- Install
virtualenv:
pip install virtualenv
- Create a virtual environment:
python -m virtualenv venv
- Activate the environment
venv\Scripts\activate
The name of your virtual environment should now appear within parentheses just before your commands.
To deactivate the virtual environment, use the command:
deactivate
Install the required dependencies:
pip install -r requirements.txt
Clone the CodeXGLUE repository. This is needed to calculate the CodeBLEU score.
git clone -q https://github.com/microsoft/CodeXGLUE.git
- Run
main.py
This program fine-tunes the pre-trained CodeT5-small Transformer model from Hugging Face to automatically recommend suitable if statements in Python functions. After preparing the dataset by masking the if conditions and tokenize the input using Hugging Face's pre-trained AutoTokenizer, the model is trained and fine-tuned on this data and evaluated using multiple evaluation metrics, including exact match, BLEU & CodeBLEU.
python main.py
The assignment report is available in the file "Assignment_Report.pdf."