Skip to content

andrewkchan/pbf-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repository contains the static leaderboard webpage and the source code to generate and evaluate the PBF Comics Benchmark. The goal of this benchmark is to evaluate AI model visual understanding and comic explanation using a dataset of 285 comics from Nicholas Gurewitch's Perry Bible Fellowship comics (https://pbfcomics.com/). These comics are interesting because:

Methodology

Ground truth labels

Each comic was assigned a ground truth label (explanation of the comic) by first generating candidate labels from GPT-4o, Claude 3.5 Sonnet, and Gemini 2.0 Flash with the prompt "Explain this comic. Describe what's happening and explain the humor or message." using the generate_explanations.py script, then having a human labeler either select the best candidate label or write their own label using the webapp found in labeling_app.py. This was a labor-saving measure; most of the final labels ended up with moderate to significant human modifications on top of the candidates.

Benchmark scores

The benchmark tasks models with explaining each comic via the same prompt as above. Explanations are collected for all 285 comics. Claude 4 Opus is then used as a judge to grade the predicted explanation for a given comic; it is given the predicted explanation, the ground truth explanation, and the comic image, and gives scores (1-10) on:

  1. Accuracy (weight 40%): Does it correctly identify what's happening?
  2. Completeness (weight 25%): Does it cover all important visual elements?
  3. Insight (weight 25%): Does it understand the humor or message?
  4. Clarity (weight 10%): Is it well-written and easy to understand?

See judge.py for more. The overall score is a weighted sum of these individual scores.

Running the benchmark

To run the benchmark and update the leaderboard with new results, make a .env file in the root directory containing your API keys from the .env.example template, then:

# Run new benchmark, updating models in models_config.yaml if needed
python run_benchmark.py
# Generate leaderboard static website
python generate_leaderboard.py

# Commit and push the updated leaderboard
git add docs/index.html
git commit -m "Update leaderboard with latest results"
git push

To run and update for just a subset of models, use the --models argument:

python run_benchmark.py --models gemini-2.5-flash

About

Visual understanding and comic explanation benchmark for LLMs

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors