Extensible Python-first benchmark comparing VLMs (CLIP-style and LLaVA-style) to children's behavioral data from LEVANTE. R is used for downloading trials (Redivis) and for statistical comparison (DevBench-style metrics); Python is used for config, data loaders, model adapters, and the evaluation runner.
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
pip install -e . # install this package
# Optional: pip install -r requirements-transformers.txt # for CLIPPinned deps: requirements.txt. Dev: requirements-dev.txt.
- Data (R): Install R and the
redivispackage; runRscript scripts/download_levante_data.Rto fetch trials intodata/raw/<version>/. - Assets (Python): Run
python scripts/download_levante_assets.py [--version YYYY-MM-DD]to download corpus and images from the public LEVANTE bucket intodata/assets/<version>/. - Evaluate: Then:
levante-bench list-taskslevante-bench list-modelslevante-bench run-eval --task egma-math --model clip_base [--version VERSION]
- Compare (R): Run
levante-bench run-comparison --task egma-math --model clip_baseor runRscript comparison/compare_levante.R --task TASK --model MODELdirectly.
See docs/README.md for data schema, releases, adding tasks/models, and secrets setup.
Cite the LEVANTE manuscript and the DevBench (NeurIPS 2024) paper when using this benchmark.