Minimal agentic framework for iterative code improvement using a tree-based search with draft → debug → improve operations.
AIDE/
├── aide/ # Core agent
│ ├── agent.py # AIDE agent and tree search
│ └── backend.py # LiteLLM backend integration
├── benchmarks/ # Benchmark integrations
│ ├── base.py # Common interface for all benchmarks
│ ├── ale_bench/ # ALE-Bench integration
│ │ └── adapter.py # ALE-Bench <-> AIDE glue
│ ├── mle_bench/ # MLE-Bench integration (Kaggle competitions)
│ │ ├── adapter.py # MLE-Bench <-> AIDE glue
│ │ └── splits/ # Train/val data splitting strategies (per-competition)
│ └── rsi_bench/ # RSI-Bench integration
│ ├── adapter.py # RSI-Bench <-> AIDE glue
│ ├── task_configs.py # Task configurations
│ └── rsi-bench/ # RSI-Bench submodule
├── experiments/ # Experiment runners
│ ├── cli.py # Unified CLI entry point
│ ├── schema.py # Experiment metadata schemas
│ ├── device_scheduler.py # Queue-based GPU scheduling (+ optional CPU pinning)
│ ├── ale_bench/ # ALE-Bench experiments
│ │ ├── config.py # ALE paper-matching configs
│ │ └── run_lite_aide.py # Run AIDE on ALE-Bench LITE
│ ├── mle_bench/ # MLE-Bench experiments
│ │ ├── config.py # MLE-Bench default configs
│ │ └── run_aide.py # Run AIDE on Kaggle competitions
│ └── rsi_bench/ # RSI-Bench experiments
│ ├── config.py # RSI-Bench configs with per-task steps
│ └── run_rsi_aide.py # Run AIDE on RSI-Bench tasks
├── scripts/ # Analysis & utility scripts
│ ├── generate_results_table.py # Generate results summary
│ ├── generate_performance_curve.py # Plot performance curves
│ ├── visualize_embedding_tree.py # Embed & visualize solution tree
│ └── migrate_rsi_eval_metadata.py # Migrate old RSI-bench experiments to include eval costs
├── viz/ # Visualization library
│ ├── run_model.py # Run log data model
│ └── viz_run.py # Plotting utilities
├── tests/ # Tests
├── examples/ # Usage examples (thin wrappers)
├── .github/
│ └── workflows/ # GitHub Actions workflows (CI, linting, etc)
│ └── lint.yml # Lint & format code via Ruff on push to main
├── README.md
└── pyproject.toml
# Clone with submodules (for RSI-Bench support)
git clone --recurse-submodules git@github.com:WecoAI/aide2.git
cd aide2
# Or if already cloned, initialize submodules
git submodule update --init --recursive
uv sync
which python # make sure that path is something like <something>/aide2/.venv/bin/python
export LITELLM_AIDE_API_KEY="your-api-key"If you're virtual environment is not being activated automatically, you can either run the following or install direnv and add the following to a .envrc:
source .venv/bin/activateTo lint/format run:
ruff check --fix && ruff format .import time
from aide.agent import AIDE, AgentConfig
from aide.backend import LLMBackend
backend = LLMBackend()
agent = AIDE(
config=AgentConfig(
steps=10, # stop after 10 steps
cost_budget=5.0, # or after $5 spent (USD)
time_budget=3600.0, # or after 1 hour (seconds)
),
source_code="your code here",
metric_name="accuracy",
maximize=True,
generate_chat=backend.chat,
eval_chat_json=backend.chat_json,
)
# Run until any constraint is hit
step = 0
start_time = time.time()
while True:
# stopping conditions
step_condition = agent.config.steps is not None and step >= agent.config.steps
elapsed_time = time.time() - start_time
time_condition = agent.config.time_budget is not None and elapsed_time >= agent.config.time_budget
total_cost = sum(n.usage.get("cost", 0.0) for n in agent.solution_tree.nodes)
cost_condition = agent.config.cost_budget is not None and total_cost >= agent.config.cost_budget
if step_condition or time_condition or cost_condition:
break
node = agent.next_candidate()
exec_output = execute(node.code) # you implement this
agent.update_latest_candidate(exec_output)
step += 1The LiteLLM backend automatically tracks costs via the x-litellm-response-cost header. Access per-node costs via node.usage["cost"] or sum them to find the total cost.
Note: The cost_budget constraint includes both LLM optimization costs (node.usage["cost"]) and evaluation costs (node.eval_cost). Evaluation costs (e.g., Modal compute, LLM API spend during eval) are tracked per-node and included in budget calculations for RSI-Bench. For ALE-Bench and MLE-Bench, only LLM costs are currently tracked.
python tests/test_sorting.pyOptimizes bubble sort (~4950 comparisons) → mergesort variant (~500 comparisons). Takes ~30-60 seconds, uses ~10k input / ~3k output tokens.
Run AIDE on ALE-Bench (AtCoder Heuristic Contest problems) using ALE-Bench as the executor and AIDE for code generation & tree search.
# In the AIDE repo root
uv sync
# Install ALE-Bench toolkit
uv pip install git+https://github.com/SakanaAI/ALE-Bench.git
# Add yourself to the docker group (required for ALE-Bench) (when on Linux)
sudo usermod -aG docker $USER
# Log out and back in, or run: newgrp docker
# Build ALE-Bench Docker images (CPU only)
git clone https://github.com/SakanaAI/ALE-Bench.git
cd ALE-Bench
bash ./scripts/docker_build_all.sh $(id -u) $(id -g)
cd ..
# Set LiteLLM configuration
export LITELLM_AIDE_API_KEY="your-api-key"# Quick test (6 steps, dev config)
python -m experiments.cli ale-lite --config dev
# Run only a specific problem
python -m experiments.cli ale-lite --config dev --only ahc016
# Production run with cost budget ($5/task, 200 max steps)
python -m experiments.cli ale-lite --config iterate
# Run with multiple seeds and hierarchical organization
python -m experiments.cli ale-lite --config iterate --num-seeds 3 --parallel 30 \
--purpose ablation_drafts --num-drafts 10
# Creates: results/ale-bench/experiments/ablation_drafts/d10/exp_iterate_YYYYMMDD_HHMMSS/
# Run with time budget (10 minutes = 600 seconds) - agent stops when time exceeded
python -m experiments.cli ale-lite --config dev --steps 100 --time-budget 600 --only ahc008
# Run with cost budget ($5 USD) - agent stops when cost exceeded
python -m experiments.cli ale-lite --config dev --steps 100 --cost-budget 5.0 --only ahc008
# Run with periodic checkpoint evaluation (private eval at intervals)
python -m experiments.cli ale-lite --config dev --cost-budget 5.0 \
--checkpoint-budget-type cost --checkpoint-frequency 0.5 --only ahc016
# Evaluates best solution at $0.50, $1.00, $1.50, ... thresholds
# Run with LLM-based output extraction (richer analysis, higher cost)
python -m experiments.cli ale-lite --config dev --use-llm-eval --only ahc016For reproducibility, run multiple seeds in parallel using screen:
# Create 3 screen sessions for 3 seeds (all 10 tasks run in parallel by default)
screen -dmS ale_seed1 && screen -S ale_seed1 -X stuff 'cd /path/to/aide && sg docker -c "./.venv/bin/python3 -u -m experiments.cli ale-lite --config iterate" 2>&1 | tee results/ale-bench/seed1.log\n'
sleep 2 # Stagger starts for different timestamps
screen -dmS ale_seed2 && screen -S ale_seed2 -X stuff 'cd /path/to/aide && sg docker -c "./.venv/bin/python3 -u -m experiments.cli ale-lite --config iterate" 2>&1 | tee results/ale-bench/seed2.log\n'
sleep 2
screen -dmS ale_seed3 && screen -S ale_seed3 -X stuff 'cd /path/to/aide && sg docker -c "./.venv/bin/python3 -u -m experiments.cli ale-lite --config iterate" 2>&1 | tee results/ale-bench/seed3.log\n'Monitoring:
# List screen sessions
screen -ls
# Attach to a session
screen -r ale_seed1 # Use Ctrl+A+D to detach
# Watch logs
tail -f results/ale-bench/seed1.log
# Check aggregated progress
cat results/ale-bench/experiments/exp_iterate_*/progress.json | python3 -m json.toolRecommended settings:
--parallel 10: All 10 LITE tasks in parallel (default)--config iterate: $5 cost budget per task (200 max steps)- 3 seeds: Standard for statistical significance
Configs are defined in experiments/ale_bench/config.py:
| Preset | Steps | Cost Budget | Description |
|---|---|---|---|
dev |
6 | — | Quick development/testing |
iterate |
200 | $5 | Production runs with cost budget (recommended) |
All presets use:
- 10 LITE tasks: ahc008, ahc011, ahc015, ahc016, ahc024, ahc025, ahc026, ahc027, ahc039, ahc046
- C++20 as the target language (matching ALE-Agent scaffolding)
- 4 hour time limit per problem (ALE-Bench enforced)
ALE-Bench supports different test case counts via the AleBenchAdapter parameters:
| Configuration | Public Tests | Private Tests | Use Case |
|---|---|---|---|
lite=True (default) |
5 | ~40-50 | Fast development |
lite=False, private_lite=True |
50 | ~40-50 | Better search signal, fast private eval |
lite=True, private_lite=False |
5 | 2,000-3,000 | Fast search, accurate ranking |
lite=False, private_lite=False |
50 | 2,000-3,000 | Official benchmarking |
The private_lite parameter allows independent configuration of public vs private evaluation test counts. This is useful for getting 10x more signal during search (50 vs 5 public tests) while keeping private evaluation fast.
| CLI Flag | Env Variable | Default | Description |
|---|---|---|---|
--config |
AIDE_ALE_CONFIG |
dev |
Config preset (dev, iterate) |
--steps |
AIDE_ALE_STEPS |
(from config) | Override number of AIDE steps |
--time-budget |
— | (from config) | Time budget in seconds (agent stops when exceeded) |
--cost-budget |
— | (from config) | Cost budget in USD (agent stops when exceeded) |
--num-workers |
AIDE_ALE_NUM_WORKERS |
4 |
ALE-Bench CPU workers per problem |
--out-dir |
AIDE_ALE_RUN_DIR |
results/ale-bench/experiments |
Output directory for run JSON files |
--only |
AIDE_ALE_ONLY |
(all) | Run only a specific problem (e.g., ahc016) |
--parallel |
AIDE_ALE_PARALLEL |
10 |
Number of tasks to run in parallel |
--num-seeds |
— | 1 |
Number of random seeds/runs per task |
--num-drafts |
— | (from config) | Override search policy num_drafts |
--batch-size |
— | 1 |
Multi-AIDE batch size (siblings before parent re-selection) |
--purpose |
— | — | Experiment purpose for hierarchical organization (e.g., ablation_drafts) |
--variation |
— | — | Experiment variation (auto-detected for ablation_drafts from num_drafts) |
--checkpoint-budget-type |
— | — | Budget type for periodic checkpoint evaluation (cost, step, time) |
--checkpoint-frequency |
— | — | Frequency for checkpoints (e.g., 0.5 for $0.50, 2 for 2 steps, 120 for 120s) |
| — | AIDE_ALE_DOCKER_LOCK |
/tmp/aide_ale_docker_start.lock |
Host-wide lock file for serializing Docker session creation |
| — | AIDE_ALE_PRIVATE_MAX_PARALLEL |
2 |
Max concurrent private evals across the host (P=2 optimal for accuracy) |
| — | AIDE_ALE_DOCKER_PRIVATE_SLOT_PREFIX |
/tmp/aide_ale_docker_private_slot |
Lock file prefix for private eval slots |
| — | AIDE_ALE_PUBLIC_MAX_PARALLEL |
8 |
Max concurrent public evals across the host |
| — | AIDE_ALE_DOCKER_PUBLIC_SLOT_PREFIX |
/tmp/aide_ale_docker_public_slot |
Lock file prefix for public eval slots |
Results are saved to timestamped experiment folders under results/ale-bench/experiments/:
results/ale-bench/experiments/{purpose}/{variation}/exp_{config}_{timestamp}/— experiment directorymeta.json— experiment metadata (source of truth)ale_{task_id}_seed{N}_aide_run.json— full run trajectory with human_evalale_{task_id}_seed{N}_private_eval.json— private eval sidecar for easy queryingale_{task_id}_seed{N}_log.txt— task execution logprogress.json— live progress tracking
Long-running experiments automatically checkpoint every step. If a run crashes or is interrupted, it will automatically resume from the latest checkpoint:
# Run gets interrupted at step 47...
# Just re-run the same command - it will resume from the latest checkpoint
python -m experiments.cli ale-lite --config iterate --only ahc016Output when resuming:
[ahc016] Found checkpoint at step 47: results/ale-bench/experiments/.../ale_ahc016_seed0_aide_run.json
[ahc016] ✓ Resuming from step 47 (loaded 47 nodes, last_step=46)
[ahc016] Current best metric: 0.85
Budget behavior on resume:
- Cost budget: Cumulative across resumes (restored from checkpoint nodes)
- Time budget: Resets on each run (measured from current run start only)
You can also programmatically load checkpoints:
from aide.agent import load_run
agent, last_step = load_run(
path="checkpoint.json",
generate_chat=my_chat_fn,
eval_chat_json=my_eval_fn,
)
# Continue from step last_step + 1Run AIDE on RSI-Bench (Recursive Self-Improvement Benchmark) tasks for code optimization. RSI-Bench evaluates systems on tasks like circle packing, kernel optimization, and neural architecture search.
Termination is multi-dimensional: runs stop when any constraint (steps, cost, or time) is reached. Cost budget is typically the binding constraint for most tasks.
| Task | Metric | Optimize | Cost Budget | GPU | Description |
|---|---|---|---|---|---|
circle_packing |
reported_sum_of_radii |
Maximize | $10 | No | Pack 26 circles in unit square |
prefix_sum |
time_ms |
Minimize | $5 | Yes | Triton prefix sum kernel |
nanogpt_inference |
average_system_generation_time |
Minimize | $15 | Yes | NanoGPT inference latency |
structural_break |
test_auc |
Maximize | $5 | No | Time series structural breaks |
nats_bench |
test-accuracy |
Maximize | $5 | No | Neural architecture search |
ds_1000 |
overall_score |
Maximize | $20 | No | DS-1000 code generation (LLM-based eval) |
extract_line_plot |
accuracy |
Maximize | $10 | No | Extract data from charts (LLM-based eval) |
Note: All tasks have recommended_steps=200, but cost budget is usually the binding constraint. LLM-based evaluation tasks (ds_1000, extract_line_plot) have higher cost budgets due to evaluation API costs.
# Initialize the RSI-Bench submodule (if not already done during clone)
git submodule update --init --recursive
# Set up each task's environment (run from each task directory)
# Example for circle_packing:
cd benchmarks/rsi_bench/rsi-bench/tasks/circle_packing
./setup.sh # Creates venv, installs deps, sets up direnv
cd -
# Repeat for other tasks you want to run...Modal provides GPU access for tasks like prefix_sum and nanogpt_inference, and is recommended for all tasks to avoid local environment issues.
# Install Modal CLI
uv tool install modal
# Authenticate with Modal (creates ~/.modal.toml)
modal token new
# Verify authentication
modal run --helpCreate a .env.rsi-bench file or export these variables:
# Required
export LITELLM_AIDE_API_KEY="sk-..."
# Required for nats_bench, extract_line_plot (Hugging Face model access)
export HF_TOKEN="hf_..."
# Required for ds_1000, extract_line_plot (LiteLLM for code execution)
export LITELLM_MANAGEMENT_API_KEY="..."
export LITELLM_BASE_URL="..."# Quick test (dev config, single task)
python -m experiments.cli rsi --config dev --task circle_packing
# Run with multiple seeds in parallel (each seed gets isolated workspace)
python -m experiments.cli rsi --config dev --task circle_packing --num-seeds 3 --parallel 3
# Run with Modal cloud evaluation
python -m experiments.cli rsi --config dev --task circle_packing --use-modal
# Run all standalone tasks
python -m experiments.cli rsi --preset standalone --config full
# Run with time budget (10 minutes = 600 seconds) - agent stops when time exceeded
python -m experiments.cli rsi --config dev --time-budget 600 --task circle_packing
# Run with cost budget ($5 USD) - agent stops when cost exceeded
python -m experiments.cli rsi --config dev --cost-budget 5.0 --task circle_packing
# Override model and reasoning effort
python -m experiments.cli rsi --config full --model gpt-5.1-codex --reasoning-effort high --task circle_packing
# Run with LLM-based output extraction (richer analysis, higher cost)
python -m experiments.cli rsi --config dev --use-llm-eval --task circle_packing
# Run with hierarchical organization (purpose/variation)
python -m experiments.cli rsi --config full --purpose model_comparison --variation gpt52
# Creates: results/rsi-bench/model_comparison/gpt52/exp_full_YYYYMMDD_HHMMSS/from experiments.rsi_bench.run_rsi_aide import run_rsi_experiment
# Quick test on standalone tasks (no external APIs needed)
run_rsi_experiment(
config_name='dev',
preset='standalone', # circle_packing, structural_break
)
# Full run on Modal with custom settings
run_rsi_experiment(
config_name='full',
preset='full',
use_modal=True,
model_override='gpt-5.1-codex',
reasoning_effort_override='high',
parallel=5, # Run all full tasks in parallel
)
# Single task
run_rsi_experiment(
config_name='dev',
task='circle_packing',
use_modal=True,
time_budget_override=600, # 10 minutes (in seconds)
cost_budget_override=5.0, # $5 USD
)# Source environment
source .env.rsi-bench
# Quick dev run (local execution)
AIDE_RSI_CONFIG=dev AIDE_RSI_PRESET=lite python -m experiments.rsi_bench.run_rsi_aide
# Full run on Modal
AIDE_RSI_CONFIG=full \
AIDE_RSI_PRESET=full \
AIDE_RSI_USE_MODAL=1 \
AIDE_RSI_PARALLEL=5 \
python -m experiments.rsi_bench.run_rsi_aide
# Override model and reasoning effort
AIDE_RSI_CONFIG=full \
AIDE_RSI_MODEL=gpt-5.1-codex \
AIDE_RSI_REASONING_EFFORT=high \
python -m experiments.rsi_bench.run_rsi_aideConfigs define termination constraints. The full config uses per-task budgets from task_configs.py:
| Preset | Steps | Cost Budget | Time Budget | Description |
|---|---|---|---|---|
dev |
5 | $1 | Per-task (12h) | Quick testing (fixed steps/cost for all tasks) |
full |
200 | Per-task | Per-task (12h) | Full optimization (uses per-task defaults) |
The agent stops when any constraint (steps, cost, or time) is reached. For full config, cost budget is typically the binding constraint.
| Preset | Tasks | Description |
|---|---|---|
lite |
circle_packing, structural_break |
Quick testing |
full |
All tasks except circle_packing, nats_bench |
Default benchmark (5 tasks) |
extended |
All 7 tasks | Full benchmark including all tasks |
cpu_only |
All non-GPU tasks | Works without CUDA |
standalone |
Tasks without external API requirements | No HF_TOKEN or LiteLLM needed |
| CLI Flag | Env Variable | Default | Description |
|---|---|---|---|
--config |
AIDE_RSI_CONFIG |
dev |
Config preset (dev, full) |
--task |
AIDE_RSI_TASK |
(all in preset) | Run specific task only |
--preset |
AIDE_RSI_PRESET |
full |
Task preset (lite, full, extended, cpu_only, standalone) |
--steps |
AIDE_RSI_STEPS |
(from config) | Override steps for ALL tasks |
--model |
AIDE_RSI_MODEL |
gpt-5.1-codex |
Override model name |
--reasoning-effort |
AIDE_RSI_REASONING_EFFORT |
— | Override reasoning effort (low, medium, high) |
--time-budget |
AIDE_RSI_TIME_BUDGET |
— | Time budget in seconds (agent stops when exceeded) |
--cost-budget |
AIDE_RSI_COST_BUDGET |
— | Cost budget in USD (agent stops when exceeded) |
--parallel |
AIDE_RSI_PARALLEL |
1 |
Number of tasks to run in parallel |
--num-seeds |
— | 1 |
Number of random seeds per task (creates isolated workspaces) |
--use-modal |
AIDE_RSI_USE_MODAL |
false |
Use Modal for cloud evaluation |
--use-llm-eval |
AIDE_RSI_USE_LLM_EVAL |
false |
Use LLM for output extraction (richer analysis, higher cost). Regex otherwise. |
--purpose |
— | — | Experiment purpose for hierarchical organization (e.g., ablation_drafts, model_comparison) |
--variation |
— | — | Experiment variation (e.g., d10, gpt52) |
--out-dir |
AIDE_RSI_RUN_DIR |
results/rsi-bench |
Output directory |
| — | AIDE_RSI_RUN_SETUP |
false |
Run data prep before optimization |
Note: The agent stops when any constraint (steps, time, or cost) is reached.
Multi-seed runs: When using --num-seeds, each seed runs in an isolated workspace copy of the task directory (following RSI-Bench's design). This enables safe parallel execution where each seed has its own optimize.py and results/ folder. Workspaces are automatically cleaned up after each run completes.
Results are saved to timestamped experiment folders:
results/rsi-bench/
├── exp_{config}_{timestamp}/ # Without --purpose
│ ├── experiment_log.md # Summary with results table
│ ├── progress_{task_id}.json # Real-time progress
│ ├── rsi_{task_id}_aide_run.json # Full run trajectory
│ ├── rsi_{task_id}_checkpoint_step{N}.json # Checkpoints
│ └── rsi_{task_id}_log.txt # Task execution log
├── {purpose}/exp_{config}_{timestamp}/ # With --purpose only
│ └── ... # Same structure as above
└── {purpose}/{variation}/exp_{config}_{timestamp}/ # With --purpose and --variation
└── ... # Same structure as above
# Watch all progress files
watch -n 5 'for f in results/rsi-bench/exp_*/progress_*.json; do echo "=== $f ==="; cat "$f"; done'
# Check specific task
cat results/rsi-bench/exp_full_*/progress_circle_packing.json | jq
# View experiment summary
cat results/rsi-bench/exp_full_*/experiment_log.mdExperiments automatically checkpoint every step. Each CLI invocation creates a fresh timestamped directory (exp_{config}_{timestamp}/).
RSI-Bench supports cross-run resume via the --resume-from flag:
# Resume incomplete tasks from a previous experiment
python -m experiments.cli rsi --resume-from results/rsi-bench/exp_full_20260105_024454Budget behavior on resume (differs from ALE-Bench/MLE-Bench):
- Cost budget: Cumulative across resumes (includes both LLM costs and eval costs)
- Time budget: Cumulative across resumes (restored from progress files)
- Eval costs: Tracked per-node (
node.eval_cost) and included in budget calculations
The resume system validates that the current config matches the original experiment (model, search policy, etc.) and prompts for confirmation if there are mismatches.
- First run may be slow: Modal builds container images on first use
- GPU tasks require Modal:
prefix_sumandnanogpt_inferenceneed GPU - Data setup: Some tasks need data prep on first Modal run (handled automatically with
run_setup=True)
# Run with data prep (first time)
AIDE_RSI_RUN_SETUP=1 AIDE_RSI_USE_MODAL=1 python -m experiments.rsi_bench.run_rsi_aideRun AIDE on MLE-Bench Kaggle-style ML tasks using local code execution (GPU scheduling supported when GPUs are available).
# In the AIDE repo root
# Install dependencies (includes deep learning, NLP, CV libraries)
# Use GIT_LFS_SKIP_SMUDGE=1 to avoid Git LFS download issues (mlebench uses Git LFS)
GIT_LFS_SKIP_SMUDGE=1 uv sync --group mle-bench
# Activate environment
source .venv/bin/activate
# Prepare competition data (downloads from Kaggle)
mlebench prepare -c spooky-author-identification
mlebench prepare -c leaf-classification
# ... prepare other competitions as needed# Quick development run (5 steps, single task)
python -m experiments.cli mle-bench --task spooky-author-identification --steps 5
# Run with default config (100 steps, $5 cost budget, 24h time budget)
python -m experiments.cli mle-bench --task leaf-classification
# Run multiple tasks in parallel on GPU machines (1 worker per visible GPU; capped by --parallel)
python -m experiments.cli mle-bench --preset core --parallel 4
# Override model and cost budget
python -m experiments.cli mle-bench --task tweet-sentiment-extraction \
--model o4-mini --cost-budget 10.0
# Run with multiple seeds for statistical significance
python -m experiments.cli mle-bench --task stanford-covid-vaccine --num-seeds 3
# K-fold stratified split (balanced class distribution)
python -m experiments.cli mle-bench --task leaf-classification \
--split-mode kfold --n-folds 5
# Run with checkpoint evaluation at cost thresholds
python -m experiments.cli mle-bench --task leaf-classification \
--cost-budget 5.0 --checkpoint-budget-type cost --checkpoint-frequency 1.0| Preset | Tasks | Description |
|---|---|---|
core |
5 core tasks | Quick validation (spooky, stanford-covid, google-quest, etc.) |
lite |
22 tasks | MLE-Bench "lite" split (smaller datasets) |
all |
22 tasks | Alias of lite in this repo (for compatibility) |
MLE-Bench uses queue-based dynamic GPU scheduling for efficient parallel execution:
- One worker per GPU: Each GPU runs one task at a time
- Dynamic load balancing: When a GPU finishes, it pulls the next task from the queue
- CPU oversubscription control: best-effort thread caps + CPU affinity pinning where supported (Linux) to reduce
n_jobs=-1blowups
| CLI Flag | Env Var | Default | Description |
|---|---|---|---|
--task |
— | (all in preset) | Run specific task(s) (repeatable) |
--preset |
— | core |
Task preset (core, lite, all) |
--steps |
— | 100 |
Maximum optimization steps |
--model |
— | o4-mini |
LLM model for code generation |
--cost-budget |
— | 5.0 |
Cost budget in USD |
--time-budget |
— | 86400 |
Time budget in seconds (24h default) |
--num-drafts |
— | 15 |
Initial draft solutions before improving |
--parallel |
— | 1 |
Parallel GPU workers (ignored without GPUs) |
--num-seeds |
— | 1 |
Random seeds per task |
--cpu-threads |
— | (auto) | CPU threads per GPU worker (default: floor(cpus / effective_parallel)) |
--out-dir |
AIDE_MLE_RUN_DIR |
results/mle-bench |
Output directory |
--checkpoint-budget-type |
— | (off) | Periodic private eval trigger (cost, step, time) |
--checkpoint-frequency |
— | (off) | Periodic private eval frequency (e.g., 0.5, 10, 60) |
--split-mode |
— | simple |
Data split: simple or kfold |
--val-ratio |
— | 0.2 |
Validation ratio for simple split |
--n-folds |
— | 5 |
Number of folds for k-fold |
--random-seed |
— | 42 |
Random seed for split |
Results are saved to results/mle-bench/exp_{timestamp}/:
results/mle-bench/exp_20251217_143022/
├── summary.json # Experiment metadata and results
├── mle_{task_id}_aide_run.json # Full run trajectory
├── mle_{task_id}_private_eval.json # Private test evaluation
├── mle_{task_id}_resources.json # Resource usage rollup (per task run)
├── mle_{task_id}_step_resources.jsonl # Per-step resource aggregates (JSONL)
├── mle_{task_id}_checkpoint_step{N}.json # Periodic checkpoints
├── mle_{task_id}_checkpoints.json # Checkpoint eval results (if enabled)
└── mle_{task_id}_step{N}_grading.json # Per-step grading reports
MLE-Bench experiments checkpoint every 10 steps. If a run crashes or is interrupted, re-run the same command to resume:
# Run gets interrupted...
python -m experiments.cli mle-bench --task leaf-classification --steps 50
# Re-run to resume from latest checkpoint
python -m experiments.cli mle-bench --task leaf-classification --steps 50
# Output: Resuming from checkpoint_step30.jsonBudget behavior on resume:
- Cost budget: Cumulative across resumes (restored from checkpoint nodes)
- Time budget: Resets on each run (measured from current run start only)
# List experiment hierarchy
python scripts/generate_results_table.py results/ale-bench/experiments --list-hierarchy
# Generate results for a specific purpose/variation
python scripts/generate_results_table.py results/ale-bench/experiments --purpose ablation_drafts --variation d10
# Or specify directory directly
python scripts/generate_results_table.py results/ale-bench/experiments/ablation_drafts/d10
# Legacy: Generate from ale_runs with seed count
python scripts/generate_results_table.py ale_runs --seeds 5Output: results_summary.md in the target directory.
Example output:
## Per-Task Results (Averaged Across Seeds)
| Task | Avg Score | Avg Perf | Avg Rank |
|------|-----------|----------|----------|
| ahc008 | 2.08e+05 | 780 | 625.0 |
| ahc039 | 9.64e+03 | 1333 | 343.0 |
| **AGGREGATE** | — | **786** | **692.2** |# Generate performance improvement plot for a specific purpose/variation
python scripts/generate_performance_curve.py results/ale-bench/experiments --purpose ablation_drafts --variation d10
# Or specify directory directly
python scripts/generate_performance_curve.py results/ale-bench/experiments/ablation_drafts/d10
# Custom output path and seeds
python scripts/generate_performance_curve.py results/ale-bench/experiments/ablation_drafts/d10 --seeds 3 --output my_curve.pngOutput: performance_curve.png in the target directory.
Shows normalized metric improvement over iterations, aggregated across seeds and tasks.
Embed solution code with OpenAI embeddings and visualize the search tree structure in 2D:
# Single task, multiple seeds - color by seed
python scripts/visualize_embedding_tree.py \
--runs "results/ale-bench/experiments/ablation_drafts/d10/*/ale_ahc015_*_aide_run.json" \
--projection tsne \
--color-by seed \
--cache-dir results/ale-bench/.embedding_cache \
--output embedding_ahc015_seeds.png
# Color by step range (gradient showing search progression over time)
python scripts/visualize_embedding_tree.py \
--runs "results/ale-bench/experiments/ablation_drafts/d60/*/ale_ahc015_seed*_aide_run.json" \
--projection tsne \
--color-by step \
--cache-dir results/ale-bench/.embedding_cache \
--output embedding_ahc015_by_step.png
# Color by stage (draft/debug/improve) with metric heatmap overlay
python scripts/visualize_embedding_tree.py \
--runs "results/ale-bench/experiments/*/exp_*/ale_ahc016_seed*_aide_run.json" \
--projection tsne \
--color-by stage \
--heatmap \
--cache-dir results/ale-bench/.embedding_cache \
--output embedding_ahc016_stage_heatmap.png
# Filter to single task from mixed runs
python scripts/visualize_embedding_tree.py \
--runs "results/ale-bench/experiments/*/exp_*/ale_*_aide_run.json" \
--task ahc015 \
--color-by seed \
--output embedding_ahc015_filtered.pngOptions:
--projection:umap(default),tsne, orpca--color-by:stage,task,seed,run, orstep(gradient by step ranges)--heatmap: Overlay metric heatmap showing high/low performing regions--cache-dir: Cache embeddings to avoid repeated API calls--task: Filter to a single task when loading multiple runs
Requirements:
uv pip install scikit-learn # For t-SNE/PCA
uv pip install umap-learn # For UMAP (optional)
export LITELLM_AIDE_API_KEY="your-api-key"Output shows solution code embeddings projected to 2D with tree edges connecting parent→child nodes. Use --color-by seed to compare how different seeds explore the solution space, or --color-by step to visualize search progression over time with a gradient.
Run private evaluations at periodic intervals during the experiment to track performance over budget consumption:
# Checkpoint by cost (every $0.50)
python -m experiments.cli ale-lite --config dev --cost-budget 5.0 \
--checkpoint-budget-type cost --checkpoint-frequency 0.5 --only ahc016
# Checkpoint by step (every 2 steps)
python -m experiments.cli ale-lite --config dev --steps 10 \
--checkpoint-budget-type step --checkpoint-frequency 2 --only ahc016
# Checkpoint by time (every 120 seconds)
python -m experiments.cli ale-lite --config dev --time-budget 600 \
--checkpoint-budget-type time --checkpoint-frequency 120 --only ahc016Output: ale_{task}_checkpoints.json in the experiment directory with results at each checkpoint:
{
"task_id": "ahc016",
"seed": 0,
"budget_type": "cost",
"frequency": 0.5,
"results": [
{
"checkpoint_value": 0.5,
"actual_value": 0.59,
"step": 4,
"public_score": 274187466.0,
"private_score": 1455185961.0,
"rank": 414,
"performance": 1355.0
}
]
}Evaluate performance at different cost thresholds ($1, $2, $3, $4, $5) after an experiment completes:
# Run cost checkpoint evaluation for an experiment
python scripts/run_cost_checkpoint_eval.py <experiment_dir> \
--output <output.json> \
--thresholds 1 2 3 4 5 \
--parallel 2
# Example: Evaluate experiment at $1-$5 checkpoints with 2 parallel workers
python scripts/run_cost_checkpoint_eval.py \
results/ale-bench/experiments/model_comparison/o3_d5/exp_100_20251205_133418 \
--output results/checkpoints/o3_d5_checkpoints.json \
--thresholds 1 2 3 4 5 \
--parallel 2For each (task, seed, cost_threshold) combination, the script:
- Finds the best node with cumulative cost ≤ threshold
- Runs private evaluation on that solution
- Records performance, rank, and score
Output format (JSON):
{
"experiment_dir": "...",
"thresholds": [1.0, 2.0, 3.0, 4.0, 5.0],
"results": [
{
"task_id": "ahc008",
"seed": 0,
"cost_threshold": 1.0,
"actual_cost": 0.95,
"step": 15,
"public_score": 1234567.0,
"private_score": 9876543.0,
"rank": 450,
"performance": 1150.0
},
...
]
}This is useful for analyzing cost-performance tradeoffs and comparing models. See notes/PER_DOLLAR_O3_VS_CODEX.md for an example analysis comparing o3 vs gpt-5.1-codex.
The benchmark system uses a generic adapter interface. To add a new benchmark (e.g., RE-Bench, Kernel-Bench):
- Create
benchmarks/new_bench/adapter.pyimplementingBenchmarkAdapter:
from benchmarks.base import TaskSpec, EvalResult, BenchmarkAdapter
class NewBenchAdapter(BenchmarkAdapter):
name = "new_bench"
metric_name = "score"
maximize = True
def list_tasks(self, preset: str = "full") -> list[TaskSpec]: ...
def build_base_code(self, task: TaskSpec) -> str: ...
def eval_candidate(self, task: TaskSpec, code: str) -> EvalResult: ...
def close(self) -> None: ...- Create
experiments/new_bench/run_experiment.pyusing the adapter with AIDE.
Implement two callables matching these signatures:
def my_chat(system: str, user: str) -> tuple[str, dict[str, int]]:
"""Returns (response_text, {"input_tokens": N, "output_tokens": M})"""
...
def my_chat_json(system: str, user: str) -> tuple[dict, dict[str, int]]:
"""Returns (parsed_json, usage_dict)"""
...AgentConfig parameters:
| Parameter | Default | Description |
|---|---|---|
steps |
None |
Max optimization steps (optional) |
cost_budget |
None |
Max cost in USD (optional) |
time_budget |
None |
Max time in seconds (optional) |
search_policy |
SearchPolicyConfig() |
Search policy configuration (see below) |
model |
gpt-5.1-codex |
OpenAI model for code generation |
chat_kwargs |
{} |
Extra kwargs passed to LLM calls (e.g., {"reasoning_effort": "high"}) |
Note: At least one of steps, cost_budget, or time_budget must be specified. The agent stops when any constraint is reached.
Note: The cost_budget parameter only constrains the agent's LLM API costs. External evaluation costs (e.g., Modal GPU compute, ALE-Bench Docker execution) are not included in cost tracking.
Tip: reasoning_effort support varies by provider—Gemini and Anthropic support it natively, but OpenAI non-reasoning models (e.g., gpt-4o) do not and may raise UnsupportedParamsError. See aide/backend.py for details.
SearchPolicyConfig parameters (nested under search_policy):
| Parameter | Default | Description |
|---|---|---|
num_drafts |
5 | Initial solutions before improving |
debug_prob |
0.5 | Probability of debugging vs improving |
max_debug_depth |
3 | Max consecutive debug attempts |
batch_size |
1 | Multi-AIDE: siblings to generate before re-selecting parent (1 = original greedy) |
By default (batch_size=1), AIDE performs greedy parent selection after every step. Setting batch_size > 1 enables Multi-AIDE, which generates n siblings from the same parent before re-evaluating:
batch_size=1 (default): Generate → Select best → Generate → Select best → ...
batch_size=3 (Multi-AIDE): Generate 3 siblings → Select best → Generate 3 siblings → ...
This trades depth for breadth in the search, exploring more diverse strategies from each promising node. Multi-AIDE tends to produce more consistent results (better IQM), while the default greedy approach has higher variance with potential for breakthrough solutions on complex problems.
# Run with Multi-AIDE (batch_size=3)
python -m experiments.cli ale-lite --config dev --batch-size 3
# Compare Multi-AIDE vs baseline
python -m experiments.cli ale-lite --config iterate --num-seeds 3 --batch-size 1 # baseline
python -m experiments.cli ale-lite --config iterate --num-seeds 3 --batch-size 3 # Multi-AIDESee notes/MULTI_AIDE_ANALYSIS.md for detailed analysis and when to use each approach.
AIDE can dump optimization runs to JSON for offline analysis and visualization.
from aide.agent import AIDE, dump_run, load_run
# Save a run (includes full tree, metrics, source code)
dump_run(agent, "my_run.json")
# Restore an agent from checkpoint to continue execution
agent, last_step = load_run(
path="my_run.json",
generate_chat=openai_chat,
eval_chat_json=openai_chat_json,
)Or use the environment variable with test_sorting.py:
AIDE_RUN_PATH=sorting_run.json python tests/test_sorting.pyfrom viz.run_model import load_run, get_best_node, get_path_to_root, compute_run_stats
run = load_run("sorting_run.json")
# Get statistics
stats = compute_run_stats(run)
print(f"Total nodes: {stats['total_nodes']}, Success rate: {stats['success_rate']:.1%}")
# Find best solution
best = get_best_node(run)
print(f"Best metric: {best.metric}")
# Trace path from root to best
path = get_path_to_root(run, best.id)
for node in path:
print(f"Step {node.step}: {node.stage} -> metric={node.metric}")from viz.run_model import load_run
from viz.viz_run import plot_metric_over_steps, plot_tree, save_metric_plot, save_tree_plot
run = load_run("sorting_run.json")
# Interactive plotting
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
plot_metric_over_steps(run, ax=ax)
plt.show()
# Or save directly
save_metric_plot(run, "metric_over_steps.png")
save_tree_plot(run, "tree.png")# Generate plots and print analysis
python examples/inspect_sorting_run.py sorting_run.json --save-plotsThis produces:
sorting_run_metric.png- Metric over steps scatter plotsorting_run_tree.png- Search tree structure visualization
uv pip install matplotlib networkxPermissionError(13, 'Permission denied') - docker.sock
Fix: Ensure you're in the docker group and the group is active:
sudo usermod -aG docker $USER
newgrp docker # Apply without logoutIn rare cases with very high parallelization, tasks may hang due to:
- API rate limiting: OpenAI 429 errors cause exponential backoff across all workers
- Docker resource exhaustion: Too many concurrent containers
Symptoms: Processes at 0% CPU, no output for 5+ minutes, memory static.
Fix: If experiencing issues, reduce parallelism or check API rate limits. In practice, --parallel 10 works reliably for most setups.
When running multiple experiments simultaneously with high parallelism (e.g., 2 experiments with parallel=30), many ale_bench.start calls can overwhelm the Docker daemon socket.
Solution: Session creation is automatically serialized using a host-wide file lock at /tmp/aide_ale_docker_start.lock. This only affects initial Docker session startup—once sessions are created, evaluations run fully in parallel as before. The only cost is a few extra seconds at experiment startup.
To customize the lock file path (e.g., for per-user isolation):
export AIDE_ALE_DOCKER_LOCK="/tmp/my_custom_lock.lock"Private evaluations are expensive (~30-60 seconds each) and can overwhelm the Docker daemon if too many run simultaneously. Private evals are automatically throttled to 8 concurrent evaluations across all processes on the host using an N-slot file-based semaphore.
To customize the concurrency limit:
export AIDE_ALE_PRIVATE_MAX_PARALLEL=2 # default: 2 concurrent private evals (optimal for accuracy)Running via newgrp docker <<EOF ... EOF may cause stdout buffering issues where output doesn't appear.
Fix: Use sg docker -c "command" instead, or redirect to a log file:
nohup sg docker -c "./.venv/bin/python3 -u -m experiments.cli ale-lite ..." > run.log 2>&1 &Lessons learned from running ALE-Bench experiments.
90 parallel runs tested on weco-gpu
# Watch progress
tail -f results/ale-bench/experiments/*/exp_*/progress.json
# Check if stuck (no CPU = waiting on I/O)
ps aux | grep "experiments.cli" | grep -v grep
# Kill stuck experiment
pkill -f "experiments.cli ale-lite"If an experiment fails mid-run:
- Completed tasks have
ale_{task_id}_aide_run.jsonfiles - Re-run only failed tasks with
--only {task_id} - Results go to a new timestamped folder
# Re-run single failed task
./.venv/bin/python3 -m experiments.cli ale-lite --config iterate --only ahc039[TODO: Figure out the precise cost]
Before starting a multi-hour experiment:
-
docker psworks without sudo -
LITELLM_AIDE_API_KEYis set - Log output to file, not just terminal
- Use
screenortmuxfor long runs
The notes/ directory contains AI-generated documentation and investigation reports. These are working documents created during development and debugging sessions.
| File | Description |
|---|---|
ABLATION_RESULTS.md |
num_drafts ablation study results and analysis |
ALE_BENCH_DETAILED.md |
Advanced ALE-Bench options and analysis scripts |
DOCKER_ISSUE_REPORT.md |
Docker container stampede investigation and RAM disk solution |
EXPERIMENT_SCHEMA.md |
Experiment data organization and JSON schemas |
HISTORICAL_FINDINGS.md |
Multi-AIDE analysis, num_drafts ablation findings |
MIGRATION_STATUS.md |
Experiment schema migration tracking |
MULTI_AIDE_ANALYSIS.md |
Multi-AIDE batch-based parent selection analysis |
PER_DOLLAR_O3_VS_CODEX.md |
Cost checkpoint analysis comparing o3 vs gpt-5.1-codex |
Additional notes for specific experiments and investigations are also available in the directory.
These notes may be useful for understanding past issues and their resolutions.