An analysis of various prompting strategies using two large language models: GPT-4 and Mistral Large. To evaluate the quality of the outputs, a third model (Llama 4 Maverick) is used as a judge, providing comparative assessments and insights into model behavior and performance.
This project is implemented in Python 3.9+ and is compatible with macOS, Linux, and Windows.
- Clone the repository to your workspace:
git clone https://github.com/jdkuffa/cs420-assignment3.git- Navigate into the repository:
cd cs420-assignment3- Set up a virtual environment and activate it:
python -m venv ./venv/source venv/bin/activate- Install
virtualenv:
pip install virtualenv- Create a virtual environment:
python -m virtualenv venv- Activate the environment
venv\Scripts\activateThe name of your virtual environment should now appear within parentheses just before your commands.
To deactivate the virtual environment, use the command:
deactivateInstall the required dependencies:
pip install -r requirements.txt- Add GitHub token
Create a file named token.txt and go to this link to make a fine-grained PAT to add to this file.
You can do this manually or from the command line as shown below (feel free to replace nano with your preferred editor):
touch token.txt && nano token.txt- Run
data_automation.py
To process the incoming data.csv file containing prompts and problems.
python3 data_automation.py- Run
judge_model.py
To add a column to the output database containing the judge model's analyses.
python3 judge_model.py- Run
evaluation_metrics.py
To add a column to the output database containing the exact match, BLEU, or embedding-based similarity scores.
python3 evaluation_metrics.pyThe assignment report is available in the root directory, labelled as "analysis-report.pdf".
For the extra credit, we included Llama 4 Maverick 17B 128E Instruct FP8 as a judge model.
We wrote the judge_model.py script to output the resulting comparison and analysis from the model to the output_db.csv under the column "Output Model 3: Meta Llama 4 Maverick."
The metrics have also been added to the analyses sections of the report under each task's table for each prompt.