GitHub - andrewkchan/pbf-bench: Visual understanding and comic explanation benchmark for LLMs

This repository contains the static leaderboard webpage and the source code to generate and evaluate the PBF Comics Benchmark. The goal of this benchmark is to evaluate AI model visual understanding and comic explanation using a dataset of 285 comics from Nicholas Gurewitch's Perry Bible Fellowship comics (https://pbfcomics.com/). These comics are interesting because:

They're highly visual, unlike other popular comics such as XKCD or Saturday Morning Breakfast Comics, which derive much humor from captions. Many comics in the series do not have any words at all.
There is diversity of styles, entities, and situations in the series. While the early PBF comics feature mostly human characters with a similar cartoony style, the series as a whole has a mix of black-and-white, watercolor, cartoon, and realistic visual styles, animal, human, anthropomorphic, and abstract characters.
The themes are complex and ranges from slapstick humor to subversions of common tropes in pop culture. Sometimes the humor comes from what's left out of the picture and sometimes it comes from how the visual styles of the picture are creatively mixed. Unlike other series which have existing datasets such as "Yes, But", there is no single format that comics follow. Some comics are just a single panel while others have dozens.
They are not yet perfectly understood by AI. As of mid-2025, while the series is well-known and has been on the internet long enough that frontier models know about it, individual comics do not have enough material about them online for their themes or messages to be memorized by models. The crude comic humor is also not yet well understood by AI models (as also seen in Hu 2024).

Methodology

Ground truth labels

Each comic was assigned a ground truth label (explanation of the comic) by first generating candidate labels from GPT-4o, Claude 3.5 Sonnet, and Gemini 2.0 Flash with the prompt "Explain this comic. Describe what's happening and explain the humor or message." using the generate_explanations.py script, then having a human labeler either select the best candidate label or write their own label using the webapp found in labeling_app.py. This was a labor-saving measure; most of the final labels ended up with moderate to significant human modifications on top of the candidates.

Benchmark scores

The benchmark tasks models with explaining each comic via the same prompt as above. Explanations are collected for all 285 comics. Claude 4 Opus is then used as a judge to grade the predicted explanation for a given comic; it is given the predicted explanation, the ground truth explanation, and the comic image, and gives scores (1-10) on:

Accuracy (weight 40%): Does it correctly identify what's happening?
Completeness (weight 25%): Does it cover all important visual elements?
Insight (weight 25%): Does it understand the humor or message?
Clarity (weight 10%): Is it well-written and easy to understand?

See judge.py for more. The overall score is a weighted sum of these individual scores.

Running the benchmark

To run the benchmark and update the leaderboard with new results, make a .env file in the root directory containing your API keys from the .env.example template, then:

# Run new benchmark, updating models in models_config.yaml if needed
python run_benchmark.py
# Generate leaderboard static website
python generate_leaderboard.py

# Commit and push the updated leaderboard
git add docs/index.html
git commit -m "Update leaderboard with latest results"
git push

To run and update for just a subset of models, use the --models argument:

python run_benchmark.py --models gemini-2.5-flash

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
docs		docs
pbf_comics		pbf_comics
templates		templates
.env.example		.env.example
.gitignore		.gitignore
benchmark_details.json		benchmark_details.json
benchmark_results.csv		benchmark_results.csv
download_pbf_comics_regex.py		download_pbf_comics_regex.py
generate_explanations.py		generate_explanations.py
generate_leaderboard.py		generate_leaderboard.py
ground_truth_labels.json		ground_truth_labels.json
judge.py		judge.py
labeling_app.py		labeling_app.py
model_runner.py		model_runner.py
models_config.yaml		models_config.yaml
pbf_comics_metadata.json		pbf_comics_metadata.json
requirements.txt		requirements.txt
run_benchmark.py		run_benchmark.py
setup_phase1.py		setup_phase1.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Methodology

Ground truth labels

Benchmark scores

Running the benchmark

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

andrewkchan/pbf-bench

Folders and files

Latest commit

History

Repository files navigation

Methodology

Ground truth labels

Benchmark scores

Running the benchmark

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages