CS420 Assignment 2: Fine-Tuning CodeT5 for Predicting if Statements

1. Introduction

This assignment explores fine-tuning transformer models for code completion, specifically for fine-tuning the small version of CodeT5, known as codet5-small, which contains approximately 60 million parameters, for the task of predicting missing if conditions in Python functions.

The model will take as input a function containing a special token masking a single if condition and will attempt to predict it. Our work included preparing the dataset consisting of 50,000 training samples by masking the if conditions and flattening it, tokenizing the input using a pre-trained tokenizer, and training the model on this data.

2. Getting Started

This project is implemented in Python 3.9+ and is compatible with macOS, Linux, and Windows.

2.1 Preparations

Clone the repository to your workspace:

git clone https://github.com/jdkuffa/cs420-assignment2.git

Navigate into the repository:

cd cs420-assignment2

Set up a virtual environment and activate it:

For macOS/Linux:

python -m venv ./venv/

source venv/bin/activate

For Windows:

Install virtualenv:

pip install virtualenv

Create a virtual environment:

python -m virtualenv venv

Activate the environment

venv\Scripts\activate

The name of your virtual environment should now appear within parentheses just before your commands.

To deactivate the virtual environment, use the command:

deactivate

2.2 Install Packages

Install the required dependencies:

pip install -r requirements.txt

Clone the CodeXGLUE repository. This is needed to calculate the CodeBLEU score.

git clone -q https://github.com/microsoft/CodeXGLUE.git

2.3 Run Program

Run main.py

This program fine-tunes the pre-trained CodeT5-small Transformer model from Hugging Face to automatically recommend suitable if statements in Python functions. After preparing the dataset by masking the if conditions and tokenize the input using Hugging Face's pre-trained AutoTokenizer, the model is trained and fine-tuned on this data and evaluated using multiple evaluation metrics, including exact match, BLEU & CodeBLEU.

python main.py

3. Report

The assignment report is available in the file "Assignment_Report.pdf."

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.gitignore		.gitignore
Assignment_Report.pdf		Assignment_Report.pdf
LICENSE		LICENSE
README.md		README.md
ft_test.csv		ft_test.csv
ft_test_subset.csv		ft_test_subset.csv
ft_train.csv		ft_train.csv
ft_train_subset.csv		ft_train_subset.csv
ft_valid.csv		ft_valid.csv
ft_valid_subset.csv		ft_valid_subset.csv
main.py		main.py
requirements.txt		requirements.txt
testset-results.csv		testset-results.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS420 Assignment 2: Fine-Tuning CodeT5 for Predicting if Statements

1. Introduction

2. Getting Started

2.1 Preparations

For macOS/Linux:

For Windows:

2.2 Install Packages

2.3 Run Program

3. Report

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

gkdot/cs420-assignment2

Folders and files

Latest commit

History

Repository files navigation

CS420 Assignment 2: Fine-Tuning CodeT5 for Predicting if Statements

1. Introduction

2. Getting Started

2.1 Preparations

For macOS/Linux:

For Windows:

2.2 Install Packages

2.3 Run Program

3. Report

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages