WebGames

This is a collection of challenges for general-purpose web-browsing AI agents.

They're designed to be:

easy for humans to complete
hard for AI agents to complete
fast and simple to run
- just client-side state and a single-page JavaScript app
easy to evaluate
- each task provides a unique password on successful completion

Read the annoucement blog on the Convergence website: https://convergence.ai/introducing-webgames/

Try it now

cd webgames
pnpm install
pnpm run dev

Tasks are available as a dataset on Hugging Face.

Alternatively, you can download them from the webgames website:

Go to webgames.convergence.ai?showDownloads=true
Click the download buttons in the top-right corner (csv or jsonl available)
Verify your agent solutions using solution in messages[-1] or equivalent, or use the Inspect AI eval scaffolding in the eval folder.

We tested 10 large vision-language AI agents on 150 interactive web challenges across 5 capability domains.

🎯 Performance Gap: Best AI model (gemini-2.5-pro-preview-05-06) achieved ~50% success vs 95.7% human success on base tasks.

Technical Fluency: Mixed results. Models struggled with right-clicking, horizontal scrolling, and file operations
Real-Time Responsiveness: Challenging due to latency. Gemini models performed better with multi-action planning
Adversarial Resistance: Varied responses to deception/instruction conflicts. Claude 3.7 showed strong robustness
Cognitive Abilities: Success correlated with model size. High variance (Qwen-7B: very low, Gemini Pro: 43.8%)
Visual Comprehension: Worst category. Pixel-copy tasks largely unsolved

61/150 tasks unsolved by any model
11/53 task families completely unsolved
Game-like tasks combining vision + real-time response were particularly difficult
Text-only vs Vision+Text: 15% performance drop without vision (but sometimes text-only performed better on reasoning tasks!)
Model Scaling: Qwen2.5-VL showed significant gains from 7B→32B (12.4%→25%), minimal gains 32B→72B

Full details in the WebGames paper and analysis/outputs2/ directory.

Name		Name	Last commit message	Last commit date
Latest commit History 160 Commits
analysis		analysis
evals		evals
scratch		scratch
webgames		webgames
.cursorrules		.cursorrules
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md