AIDE

Minimal agentic framework for iterative code improvement using a tree-based search with draft → debug → improve operations.

Project Structure

AIDE/
├── aide/                    # Core agent
│   ├── agent.py             # AIDE agent and tree search
│   └── backend.py           # LiteLLM backend integration
├── benchmarks/              # Benchmark integrations
│   ├── base.py              # Common interface for all benchmarks
│   ├── ale_bench/           # ALE-Bench integration
│   │   └── adapter.py       # ALE-Bench <-> AIDE glue
│   ├── mle_bench/           # MLE-Bench integration (Kaggle competitions)
│   │   ├── adapter.py       # MLE-Bench <-> AIDE glue
│   │   └── splits/          # Train/val data splitting strategies (per-competition)
│   └── rsi_bench/           # RSI-Bench integration
│       ├── adapter.py       # RSI-Bench <-> AIDE glue
│       ├── task_configs.py  # Task configurations
│       └── rsi-bench/       # RSI-Bench submodule
├── experiments/             # Experiment runners
│   ├── cli.py               # Unified CLI entry point
│   ├── schema.py            # Experiment metadata schemas
│   ├── device_scheduler.py  # Queue-based GPU scheduling (+ optional CPU pinning)
│   ├── ale_bench/           # ALE-Bench experiments
│   │   ├── config.py        # ALE paper-matching configs
│   │   └── run_lite_aide.py # Run AIDE on ALE-Bench LITE
│   ├── mle_bench/           # MLE-Bench experiments
│   │   ├── config.py        # MLE-Bench default configs
│   │   └── run_aide.py      # Run AIDE on Kaggle competitions
│   └── rsi_bench/           # RSI-Bench experiments
│       ├── config.py        # RSI-Bench configs with per-task steps
│       └── run_rsi_aide.py  # Run AIDE on RSI-Bench tasks
├── scripts/                 # Analysis & utility scripts
│   ├── generate_results_table.py    # Generate results summary
│   ├── generate_performance_curve.py # Plot performance curves
│   ├── visualize_embedding_tree.py  # Embed & visualize solution tree
│   └── migrate_rsi_eval_metadata.py # Migrate old RSI-bench experiments to include eval costs
├── viz/                     # Visualization library
│   ├── run_model.py         # Run log data model
│   └── viz_run.py           # Plotting utilities
├── tests/                   # Tests
├── examples/                # Usage examples (thin wrappers)
├── .github/
│   └── workflows/           # GitHub Actions workflows (CI, linting, etc)
│        └── lint.yml        # Lint & format code via Ruff on push to main
├── README.md
└── pyproject.toml

Quick Start

# Clone with submodules (for RSI-Bench support)
git clone --recurse-submodules git@github.com:WecoAI/aide2.git
cd aide2

# Or if already cloned, initialize submodules
git submodule update --init --recursive

uv sync
which python # make sure that path is something like <something>/aide2/.venv/bin/python
export LITELLM_AIDE_API_KEY="your-api-key"

If you're virtual environment is not being activated automatically, you can either run the following or install direnv and add the following to a .envrc:

source .venv/bin/activate

To lint/format run:

ruff check --fix && ruff format .

Using the LiteLLM Backend

import time
from aide.agent import AIDE, AgentConfig
from aide.backend import LLMBackend

backend = LLMBackend()

agent = AIDE(
    config=AgentConfig(
        steps=10,           # stop after 10 steps
        cost_budget=5.0,    # or after $5 spent (USD)
        time_budget=3600.0, # or after 1 hour (seconds)
    ),
    source_code="your code here",
    metric_name="accuracy",
    maximize=True,
    generate_chat=backend.chat,
    eval_chat_json=backend.chat_json,
)

# Run until any constraint is hit
step = 0
start_time = time.time()
while True:
    # stopping conditions
    step_condition = agent.config.steps is not None and step >= agent.config.steps
    elapsed_time = time.time() - start_time
    time_condition = agent.config.time_budget is not None and elapsed_time >= agent.config.time_budget
    total_cost = sum(n.usage.get("cost", 0.0) for n in agent.solution_tree.nodes)
    cost_condition = agent.config.cost_budget is not None and total_cost >= agent.config.cost_budget
    if step_condition or time_condition or cost_condition:
        break
    node = agent.next_candidate()
    exec_output = execute(node.code)  # you implement this
    agent.update_latest_candidate(exec_output)
    step += 1

The LiteLLM backend automatically tracks costs via the x-litellm-response-cost header. Access per-node costs via node.usage["cost"] or sum them to find the total cost.

Note: The cost_budget constraint includes both LLM optimization costs (node.usage["cost"]) and evaluation costs (node.eval_cost). Evaluation costs (e.g., Modal compute, LLM API spend during eval) are tracked per-node and included in budget calculations for RSI-Bench. For ALE-Bench and MLE-Bench, only LLM costs are currently tracked.

Run the Example

python tests/test_sorting.py

Optimizes bubble sort (~4950 comparisons) → mergesort variant (~500 comparisons). Takes ~30-60 seconds, uses ~10k input / ~3k output tokens.

ALE-Bench Integration

Run AIDE on ALE-Bench (AtCoder Heuristic Contest problems) using ALE-Bench as the executor and AIDE for code generation & tree search.

Setup (one-time)

# In the AIDE repo root
uv sync

# Install ALE-Bench toolkit
uv pip install git+https://github.com/SakanaAI/ALE-Bench.git

# Add yourself to the docker group (required for ALE-Bench) (when on Linux)
sudo usermod -aG docker $USER
# Log out and back in, or run: newgrp docker

# Build ALE-Bench Docker images (CPU only)
git clone https://github.com/SakanaAI/ALE-Bench.git
cd ALE-Bench
bash ./scripts/docker_build_all.sh $(id -u) $(id -g)
cd ..

# Set LiteLLM configuration
export LITELLM_AIDE_API_KEY="your-api-key"

Running ALE-Bench LITE Tasks

# Quick test (6 steps, dev config)
python -m experiments.cli ale-lite --config dev

# Run only a specific problem
python -m experiments.cli ale-lite --config dev --only ahc016

# Production run with cost budget ($5/task, 200 max steps)
python -m experiments.cli ale-lite --config iterate

# Run with multiple seeds and hierarchical organization
python -m experiments.cli ale-lite --config iterate --num-seeds 3 --parallel 30 \
    --purpose ablation_drafts --num-drafts 10
# Creates: results/ale-bench/experiments/ablation_drafts/d10/exp_iterate_YYYYMMDD_HHMMSS/

# Run with time budget (10 minutes = 600 seconds) - agent stops when time exceeded
python -m experiments.cli ale-lite --config dev --steps 100 --time-budget 600 --only ahc008

# Run with cost budget ($5 USD) - agent stops when cost exceeded  
python -m experiments.cli ale-lite --config dev --steps 100 --cost-budget 5.0 --only ahc008

# Run with periodic checkpoint evaluation (private eval at intervals)
python -m experiments.cli ale-lite --config dev --cost-budget 5.0 \
    --checkpoint-budget-type cost --checkpoint-frequency 0.5 --only ahc016
# Evaluates best solution at $0.50, $1.00, $1.50, ... thresholds

# Run with LLM-based output extraction (richer analysis, higher cost)
python -m experiments.cli ale-lite --config dev --use-llm-eval --only ahc016

Running Parallel Experiments with Multiple Seeds

For reproducibility, run multiple seeds in parallel using screen:

# Create 3 screen sessions for 3 seeds (all 10 tasks run in parallel by default)
screen -dmS ale_seed1 && screen -S ale_seed1 -X stuff 'cd /path/to/aide && sg docker -c "./.venv/bin/python3 -u -m experiments.cli ale-lite --config iterate" 2>&1 | tee results/ale-bench/seed1.log\n'

sleep 2  # Stagger starts for different timestamps

screen -dmS ale_seed2 && screen -S ale_seed2 -X stuff 'cd /path/to/aide && sg docker -c "./.venv/bin/python3 -u -m experiments.cli ale-lite --config iterate" 2>&1 | tee results/ale-bench/seed2.log\n'

sleep 2

screen -dmS ale_seed3 && screen -S ale_seed3 -X stuff 'cd /path/to/aide && sg docker -c "./.venv/bin/python3 -u -m experiments.cli ale-lite --config iterate" 2>&1 | tee results/ale-bench/seed3.log\n'

Monitoring:

# List screen sessions
screen -ls

# Attach to a session
screen -r ale_seed1  # Use Ctrl+A+D to detach

# Watch logs
tail -f results/ale-bench/seed1.log

# Check aggregated progress
cat results/ale-bench/experiments/exp_iterate_*/progress.json | python3 -m json.tool

Recommended settings:

--parallel 10: All 10 LITE tasks in parallel (default)
--config iterate: $5 cost budget per task (200 max steps)
3 seeds: Standard for statistical significance

Config Presets

Configs are defined in experiments/ale_bench/config.py:

Preset	Steps	Cost Budget	Description
`dev`	6	—	Quick development/testing
`iterate`	200	$5	Production runs with cost budget (recommended)

All presets use:

10 LITE tasks: ahc008, ahc011, ahc015, ahc016, ahc024, ahc025, ahc026, ahc027, ahc039, ahc046
C++20 as the target language (matching ALE-Agent scaffolding)
4 hour time limit per problem (ALE-Bench enforced)

Test Case Configuration

ALE-Bench supports different test case counts via the AleBenchAdapter parameters:

Configuration	Public Tests	Private Tests	Use Case
`lite=True` (default)	5	~40-50	Fast development
`lite=False, private_lite=True`	50	~40-50	Better search signal, fast private eval
`lite=True, private_lite=False`	5	2,000-3,000	Fast search, accurate ranking
`lite=False, private_lite=False`	50	2,000-3,000	Official benchmarking

The private_lite parameter allows independent configuration of public vs private evaluation test counts. This is useful for getting 10x more signal during search (50 vs 5 public tests) while keeping private evaluation fast.

CLI Options and Environment Variables

CLI Flag	Env Variable	Default	Description
`--config`	`AIDE_ALE_CONFIG`	`dev`	Config preset (`dev`, `iterate`)
`--steps`	`AIDE_ALE_STEPS`	(from config)	Override number of AIDE steps
`--time-budget`	—	(from config)	Time budget in seconds (agent stops when exceeded)
`--cost-budget`	—	(from config)	Cost budget in USD (agent stops when exceeded)
`--num-workers`	`AIDE_ALE_NUM_WORKERS`	`4`	ALE-Bench CPU workers per problem
`--out-dir`	`AIDE_ALE_RUN_DIR`	`results/ale-bench/experiments`	Output directory for run JSON files
`--only`	`AIDE_ALE_ONLY`	(all)	Run only a specific problem (e.g., `ahc016`)
`--parallel`	`AIDE_ALE_PARALLEL`	`10`	Number of tasks to run in parallel
`--num-seeds`	—	`1`	Number of random seeds/runs per task
`--num-drafts`	—	(from config)	Override search policy `num_drafts`
`--batch-size`	—	`1`	Multi-AIDE batch size (siblings before parent re-selection)
`--purpose`	—	—	Experiment purpose for hierarchical organization (e.g., `ablation_drafts`)
`--variation`	—	—	Experiment variation (auto-detected for `ablation_drafts` from `num_drafts`)
`--checkpoint-budget-type`	—	—	Budget type for periodic checkpoint evaluation (`cost`, `step`, `time`)
`--checkpoint-frequency`	—	—	Frequency for checkpoints (e.g., 0.5 for $0.50, 2 for 2 steps, 120 for 120s)
—	`AIDE_ALE_DOCKER_LOCK`	`/tmp/aide_ale_docker_start.lock`	Host-wide lock file for serializing Docker session creation
—	`AIDE_ALE_PRIVATE_MAX_PARALLEL`	`2`	Max concurrent private evals across the host (P=2 optimal for accuracy)
—	`AIDE_ALE_DOCKER_PRIVATE_SLOT_PREFIX`	`/tmp/aide_ale_docker_private_slot`	Lock file prefix for private eval slots
—	`AIDE_ALE_PUBLIC_MAX_PARALLEL`	`8`	Max concurrent public evals across the host
—	`AIDE_ALE_DOCKER_PUBLIC_SLOT_PREFIX`	`/tmp/aide_ale_docker_public_slot`	Lock file prefix for public eval slots

Output

Results are saved to timestamped experiment folders under results/ale-bench/experiments/:

results/ale-bench/experiments/{purpose}/{variation}/exp_{config}_{timestamp}/ — experiment directory
- meta.json — experiment metadata (source of truth)
- ale_{task_id}_seed{N}_aide_run.json — full run trajectory with human_eval
- ale_{task_id}_seed{N}_private_eval.json — private eval sidecar for easy querying
- ale_{task_id}_seed{N}_log.txt — task execution log
- progress.json — live progress tracking

Auto-Resume from Checkpoints

Long-running experiments automatically checkpoint every step. If a run crashes or is interrupted, it will automatically resume from the latest checkpoint:

# Run gets interrupted at step 47...
# Just re-run the same command - it will resume from the latest checkpoint
python -m experiments.cli ale-lite --config iterate --only ahc016

Output when resuming:

[ahc016] Found checkpoint at step 47: results/ale-bench/experiments/.../ale_ahc016_seed0_aide_run.json
[ahc016] ✓ Resuming from step 47 (loaded 47 nodes, last_step=46)
[ahc016] Current best metric: 0.85

Budget behavior on resume:

Cost budget: Cumulative across resumes (restored from checkpoint nodes)
Time budget: Resets on each run (measured from current run start only)

You can also programmatically load checkpoints:

from aide.agent import load_run

agent, last_step = load_run(
    path="checkpoint.json",
    generate_chat=my_chat_fn,
    eval_chat_json=my_eval_fn,
)
# Continue from step last_step + 1

RSI-Bench Integration

Run AIDE on RSI-Bench (Recursive Self-Improvement Benchmark) tasks for code optimization. RSI-Bench evaluates systems on tasks like circle packing, kernel optimization, and neural architecture search.

Available Tasks

Termination is multi-dimensional: runs stop when any constraint (steps, cost, or time) is reached. Cost budget is typically the binding constraint for most tasks.

Task	Metric	Optimize	Cost Budget	GPU	Description
`circle_packing`	`reported_sum_of_radii`	Maximize	$10	No	Pack 26 circles in unit square
`prefix_sum`	`time_ms`	Minimize	$5	Yes	Triton prefix sum kernel
`nanogpt_inference`	`average_system_generation_time`	Minimize	$15	Yes	NanoGPT inference latency
`structural_break`	`test_auc`	Maximize	$5	No	Time series structural breaks
`nats_bench`	`test-accuracy`	Maximize	$5	No	Neural architecture search
`ds_1000`	`overall_score`	Maximize	$20	No	DS-1000 code generation (LLM-based eval)
`extract_line_plot`	`accuracy`	Maximize	$10	No	Extract data from charts (LLM-based eval)

Note: All tasks have recommended_steps=200, but cost budget is usually the binding constraint. LLM-based evaluation tasks (ds_1000, extract_line_plot) have higher cost budgets due to evaluation API costs.

Setup (One-Time)

# Initialize the RSI-Bench submodule (if not already done during clone)
git submodule update --init --recursive

# Set up each task's environment (run from each task directory)
# Example for circle_packing:
cd benchmarks/rsi_bench/rsi-bench/tasks/circle_packing
./setup.sh  # Creates venv, installs deps, sets up direnv
cd -

# Repeat for other tasks you want to run...

Modal CLI Setup (Required for Cloud Evaluation)

Modal provides GPU access for tasks like prefix_sum and nanogpt_inference, and is recommended for all tasks to avoid local environment issues.

# Install Modal CLI
uv tool install modal

# Authenticate with Modal (creates ~/.modal.toml)
modal token new

# Verify authentication
modal run --help

Environment Variables

Create a .env.rsi-bench file or export these variables:

# Required
export LITELLM_AIDE_API_KEY="sk-..."

# Required for nats_bench, extract_line_plot (Hugging Face model access)
export HF_TOKEN="hf_..."

# Required for ds_1000, extract_line_plot (LiteLLM for code execution)
export LITELLM_MANAGEMENT_API_KEY="..."
export LITELLM_BASE_URL="..."

Running RSI-Bench Tasks

CLI Usage

# Quick test (dev config, single task)
python -m experiments.cli rsi --config dev --task circle_packing

# Run with multiple seeds in parallel (each seed gets isolated workspace)
python -m experiments.cli rsi --config dev --task circle_packing --num-seeds 3 --parallel 3

# Run with Modal cloud evaluation
python -m experiments.cli rsi --config dev --task circle_packing --use-modal

# Run all standalone tasks
python -m experiments.cli rsi --preset standalone --config full

# Run with time budget (10 minutes = 600 seconds) - agent stops when time exceeded
python -m experiments.cli rsi --config dev --time-budget 600 --task circle_packing

# Run with cost budget ($5 USD) - agent stops when cost exceeded
python -m experiments.cli rsi --config dev --cost-budget 5.0 --task circle_packing

# Override model and reasoning effort
python -m experiments.cli rsi --config full --model gpt-5.1-codex --reasoning-effort high --task circle_packing

# Run with LLM-based output extraction (richer analysis, higher cost)
python -m experiments.cli rsi --config dev --use-llm-eval --task circle_packing

# Run with hierarchical organization (purpose/variation)
python -m experiments.cli rsi --config full --purpose model_comparison --variation gpt52
# Creates: results/rsi-bench/model_comparison/gpt52/exp_full_YYYYMMDD_HHMMSS/

Programmatic Usage

from experiments.rsi_bench.run_rsi_aide import run_rsi_experiment

# Quick test on standalone tasks (no external APIs needed)
run_rsi_experiment(
    config_name='dev',
    preset='standalone',  # circle_packing, structural_break
)

# Full run on Modal with custom settings
run_rsi_experiment(
    config_name='full',
    preset='full',
    use_modal=True,
    model_override='gpt-5.1-codex',
    reasoning_effort_override='high',
    parallel=5,  # Run all full tasks in parallel
)

# Single task
run_rsi_experiment(
    config_name='dev',
    task='circle_packing',
    use_modal=True,
    time_budget_override=600,  # 10 minutes (in seconds)
    cost_budget_override=5.0,  # $5 USD
)

Environment-Driven CLI

# Source environment
source .env.rsi-bench

# Quick dev run (local execution)
AIDE_RSI_CONFIG=dev AIDE_RSI_PRESET=lite python -m experiments.rsi_bench.run_rsi_aide

# Full run on Modal
AIDE_RSI_CONFIG=full \
AIDE_RSI_PRESET=full \
AIDE_RSI_USE_MODAL=1 \
AIDE_RSI_PARALLEL=5 \
python -m experiments.rsi_bench.run_rsi_aide

# Override model and reasoning effort
AIDE_RSI_CONFIG=full \
AIDE_RSI_MODEL=gpt-5.1-codex \
AIDE_RSI_REASONING_EFFORT=high \
python -m experiments.rsi_bench.run_rsi_aide

Config Presets

Configs define termination constraints. The full config uses per-task budgets from task_configs.py:

Preset	Steps	Cost Budget	Time Budget	Description
`dev`	5	$1	Per-task (12h)	Quick testing (fixed steps/cost for all tasks)
`full`	200	Per-task	Per-task (12h)	Full optimization (uses per-task defaults)

The agent stops when any constraint (steps, cost, or time) is reached. For full config, cost budget is typically the binding constraint.

Task Presets

Preset	Tasks	Description
`lite`	`circle_packing`, `structural_break`	Quick testing
`full`	All tasks except `circle_packing`, `nats_bench`	Default benchmark (5 tasks)
`extended`	All 7 tasks	Full benchmark including all tasks
`cpu_only`	All non-GPU tasks	Works without CUDA
`standalone`	Tasks without external API requirements	No HF_TOKEN or LiteLLM needed

CLI Options and Environment Variables

CLI Flag	Env Variable	Default	Description
`--config`	`AIDE_RSI_CONFIG`	`dev`	Config preset (`dev`, `full`)
`--task`	`AIDE_RSI_TASK`	(all in preset)	Run specific task only
`--preset`	`AIDE_RSI_PRESET`	`full`	Task preset (`lite`, `full`, `extended`, `cpu_only`, `standalone`)
`--steps`	`AIDE_RSI_STEPS`	(from config)	Override steps for ALL tasks
`--model`	`AIDE_RSI_MODEL`	`gpt-5.1-codex`	Override model name
`--reasoning-effort`	`AIDE_RSI_REASONING_EFFORT`	—	Override reasoning effort (`low`, `medium`, `high`)
`--time-budget`	`AIDE_RSI_TIME_BUDGET`	—	Time budget in seconds (agent stops when exceeded)
`--cost-budget`	`AIDE_RSI_COST_BUDGET`	—	Cost budget in USD (agent stops when exceeded)
`--parallel`	`AIDE_RSI_PARALLEL`	`1`	Number of tasks to run in parallel
`--num-seeds`	—	`1`	Number of random seeds per task (creates isolated workspaces)
`--use-modal`	`AIDE_RSI_USE_MODAL`	`false`	Use Modal for cloud evaluation
`--use-llm-eval`	`AIDE_RSI_USE_LLM_EVAL`	`false`	Use LLM for output extraction (richer analysis, higher cost). Regex otherwise.
`--purpose`	—	—	Experiment purpose for hierarchical organization (e.g., `ablation_drafts`, `model_comparison`)
`--variation`	—	—	Experiment variation (e.g., `d10`, `gpt52`)
`--out-dir`	`AIDE_RSI_RUN_DIR`	`results/rsi-bench`	Output directory
—	`AIDE_RSI_RUN_SETUP`	`false`	Run data prep before optimization

Note: The agent stops when any constraint (steps, time, or cost) is reached.

Multi-seed runs: When using --num-seeds, each seed runs in an isolated workspace copy of the task directory (following RSI-Bench's design). This enables safe parallel execution where each seed has its own optimize.py and results/ folder. Workspaces are automatically cleaned up after each run completes.

Output Structure

Results are saved to timestamped experiment folders:

results/rsi-bench/
├── exp_{config}_{timestamp}/                         # Without --purpose
│   ├── experiment_log.md           # Summary with results table
│   ├── progress_{task_id}.json     # Real-time progress
│   ├── rsi_{task_id}_aide_run.json # Full run trajectory
│   ├── rsi_{task_id}_checkpoint_step{N}.json  # Checkpoints
│   └── rsi_{task_id}_log.txt       # Task execution log
├── {purpose}/exp_{config}_{timestamp}/               # With --purpose only
│   └── ...                         # Same structure as above
└── {purpose}/{variation}/exp_{config}_{timestamp}/   # With --purpose and --variation
    └── ...                         # Same structure as above

Monitoring Progress

# Watch all progress files
watch -n 5 'for f in results/rsi-bench/exp_*/progress_*.json; do echo "=== $f ==="; cat "$f"; done'

# Check specific task
cat results/rsi-bench/exp_full_*/progress_circle_packing.json | jq

# View experiment summary
cat results/rsi-bench/exp_full_*/experiment_log.md

Checkpointing and Cross-Run Resume

Experiments automatically checkpoint every step. Each CLI invocation creates a fresh timestamped directory (exp_{config}_{timestamp}/).

RSI-Bench supports cross-run resume via the --resume-from flag:

# Resume incomplete tasks from a previous experiment
python -m experiments.cli rsi --resume-from results/rsi-bench/exp_full_20260105_024454

Budget behavior on resume (differs from ALE-Bench/MLE-Bench):

Cost budget: Cumulative across resumes (includes both LLM costs and eval costs)
Time budget: Cumulative across resumes (restored from progress files)
Eval costs: Tracked per-node (node.eval_cost) and included in budget calculations

The resume system validates that the current config matches the original experiment (model, search policy, etc.) and prompts for confirmation if there are mismatches.

Modal-Specific Notes

First run may be slow: Modal builds container images on first use
GPU tasks require Modal: prefix_sum and nanogpt_inference need GPU
Data setup: Some tasks need data prep on first Modal run (handled automatically with run_setup=True)

# Run with data prep (first time)
AIDE_RSI_RUN_SETUP=1 AIDE_RSI_USE_MODAL=1 python -m experiments.rsi_bench.run_rsi_aide

MLE-Bench Integration

Run AIDE on MLE-Bench Kaggle-style ML tasks using local code execution (GPU scheduling supported when GPUs are available).

Setup (One-Time)

# In the AIDE repo root
# Install dependencies (includes deep learning, NLP, CV libraries)
# Use GIT_LFS_SKIP_SMUDGE=1 to avoid Git LFS download issues (mlebench uses Git LFS)
GIT_LFS_SKIP_SMUDGE=1 uv sync --group mle-bench

# Activate environment
source .venv/bin/activate

# Prepare competition data (downloads from Kaggle)
mlebench prepare -c spooky-author-identification
mlebench prepare -c leaf-classification
# ... prepare other competitions as needed

Running MLE-Bench Tasks

# Quick development run (5 steps, single task)
python -m experiments.cli mle-bench --task spooky-author-identification --steps 5

# Run with default config (100 steps, $5 cost budget, 24h time budget)
python -m experiments.cli mle-bench --task leaf-classification

# Run multiple tasks in parallel on GPU machines (1 worker per visible GPU; capped by --parallel)
python -m experiments.cli mle-bench --preset core --parallel 4

# Override model and cost budget
python -m experiments.cli mle-bench --task tweet-sentiment-extraction \
    --model o4-mini --cost-budget 10.0

# Run with multiple seeds for statistical significance
python -m experiments.cli mle-bench --task stanford-covid-vaccine --num-seeds 3

# K-fold stratified split (balanced class distribution)
python -m experiments.cli mle-bench --task leaf-classification \
    --split-mode kfold --n-folds 5

# Run with checkpoint evaluation at cost thresholds
python -m experiments.cli mle-bench --task leaf-classification \
    --cost-budget 5.0 --checkpoint-budget-type cost --checkpoint-frequency 1.0

Task Presets

Preset	Tasks	Description
`core`	5 core tasks	Quick validation (spooky, stanford-covid, google-quest, etc.)
`lite`	22 tasks	MLE-Bench "lite" split (smaller datasets)
`all`	22 tasks	Alias of `lite` in this repo (for compatibility)

GPU Scheduling

MLE-Bench uses queue-based dynamic GPU scheduling for efficient parallel execution:

One worker per GPU: Each GPU runs one task at a time
Dynamic load balancing: When a GPU finishes, it pulls the next task from the queue
CPU oversubscription control: best-effort thread caps + CPU affinity pinning where supported (Linux) to reduce n_jobs=-1 blowups

CLI Options

CLI Flag	Env Var	Default	Description
`--task`	—	(all in preset)	Run specific task(s) (repeatable)
`--preset`	—	`core`	Task preset (`core`, `lite`, `all`)
`--steps`	—	`100`	Maximum optimization steps
`--model`	—	`o4-mini`	LLM model for code generation
`--cost-budget`	—	`5.0`	Cost budget in USD
`--time-budget`	—	`86400`	Time budget in seconds (24h default)
`--num-drafts`	—	`15`	Initial draft solutions before improving
`--parallel`	—	`1`	Parallel GPU workers (ignored without GPUs)
`--num-seeds`	—	`1`	Random seeds per task
`--cpu-threads`	—	(auto)	CPU threads per GPU worker (default: floor(cpus / effective_parallel))
`--out-dir`	`AIDE_MLE_RUN_DIR`	`results/mle-bench`	Output directory
`--checkpoint-budget-type`	—	(off)	Periodic private eval trigger (`cost`, `step`, `time`)
`--checkpoint-frequency`	—	(off)	Periodic private eval frequency (e.g., `0.5`, `10`, `60`)
`--split-mode`	—	`simple`	Data split: `simple` or `kfold`
`--val-ratio`	—	`0.2`	Validation ratio for simple split
`--n-folds`	—	`5`	Number of folds for k-fold
`--random-seed`	—	`42`	Random seed for split

Output Structure

Results are saved to results/mle-bench/exp_{timestamp}/:

results/mle-bench/exp_20251217_143022/
├── summary.json                           # Experiment metadata and results
├── mle_{task_id}_aide_run.json           # Full run trajectory
├── mle_{task_id}_private_eval.json       # Private test evaluation
├── mle_{task_id}_resources.json          # Resource usage rollup (per task run)
├── mle_{task_id}_step_resources.jsonl    # Per-step resource aggregates (JSONL)
├── mle_{task_id}_checkpoint_step{N}.json # Periodic checkpoints
├── mle_{task_id}_checkpoints.json        # Checkpoint eval results (if enabled)
└── mle_{task_id}_step{N}_grading.json    # Per-step grading reports

Auto-Resume from Checkpoints

MLE-Bench experiments checkpoint every 10 steps. If a run crashes or is interrupted, re-run the same command to resume:

# Run gets interrupted...
python -m experiments.cli mle-bench --task leaf-classification --steps 50

# Re-run to resume from latest checkpoint
python -m experiments.cli mle-bench --task leaf-classification --steps 50
# Output: Resuming from checkpoint_step30.json

Budget behavior on resume:

Cost budget: Cumulative across resumes (restored from checkpoint nodes)
Time budget: Resets on each run (measured from current run start only)

Analyzing Experiment Results

Generate Results Summary

# List experiment hierarchy
python scripts/generate_results_table.py results/ale-bench/experiments --list-hierarchy

# Generate results for a specific purpose/variation
python scripts/generate_results_table.py results/ale-bench/experiments --purpose ablation_drafts --variation d10

# Or specify directory directly
python scripts/generate_results_table.py results/ale-bench/experiments/ablation_drafts/d10

# Legacy: Generate from ale_runs with seed count
python scripts/generate_results_table.py ale_runs --seeds 5

Output: results_summary.md in the target directory.

Example output:

## Per-Task Results (Averaged Across Seeds)

| Task | Avg Score | Avg Perf | Avg Rank |
|------|-----------|----------|----------|
| ahc008 | 2.08e+05 | 780 | 625.0 |
| ahc039 | 9.64e+03 | 1333 | 343.0 |
| **AGGREGATE** | — | **786** | **692.2** |

Generate Performance Curve

# Generate performance improvement plot for a specific purpose/variation
python scripts/generate_performance_curve.py results/ale-bench/experiments --purpose ablation_drafts --variation d10

# Or specify directory directly
python scripts/generate_performance_curve.py results/ale-bench/experiments/ablation_drafts/d10

# Custom output path and seeds
python scripts/generate_performance_curve.py results/ale-bench/experiments/ablation_drafts/d10 --seeds 3 --output my_curve.png

Output: performance_curve.png in the target directory.

Shows normalized metric improvement over iterations, aggregated across seeds and tasks.

Visualize Solution Embeddings

Embed solution code with OpenAI embeddings and visualize the search tree structure in 2D:

# Single task, multiple seeds - color by seed
python scripts/visualize_embedding_tree.py \
    --runs "results/ale-bench/experiments/ablation_drafts/d10/*/ale_ahc015_*_aide_run.json" \
    --projection tsne \
    --color-by seed \
    --cache-dir results/ale-bench/.embedding_cache \
    --output embedding_ahc015_seeds.png

# Color by step range (gradient showing search progression over time)
python scripts/visualize_embedding_tree.py \
    --runs "results/ale-bench/experiments/ablation_drafts/d60/*/ale_ahc015_seed*_aide_run.json" \
    --projection tsne \
    --color-by step \
    --cache-dir results/ale-bench/.embedding_cache \
    --output embedding_ahc015_by_step.png

# Color by stage (draft/debug/improve) with metric heatmap overlay
python scripts/visualize_embedding_tree.py \
    --runs "results/ale-bench/experiments/*/exp_*/ale_ahc016_seed*_aide_run.json" \
    --projection tsne \
    --color-by stage \
    --heatmap \
    --cache-dir results/ale-bench/.embedding_cache \
    --output embedding_ahc016_stage_heatmap.png

# Filter to single task from mixed runs
python scripts/visualize_embedding_tree.py \
    --runs "results/ale-bench/experiments/*/exp_*/ale_*_aide_run.json" \
    --task ahc015 \
    --color-by seed \
    --output embedding_ahc015_filtered.png

Options:

--projection: umap (default), tsne, or pca
--color-by: stage, task, seed, run, or step (gradient by step ranges)
--heatmap: Overlay metric heatmap showing high/low performing regions
--cache-dir: Cache embeddings to avoid repeated API calls
--task: Filter to a single task when loading multiple runs

Requirements:

uv pip install scikit-learn  # For t-SNE/PCA
uv pip install umap-learn    # For UMAP (optional)
export LITELLM_AIDE_API_KEY="your-api-key"

Output shows solution code embeddings projected to 2D with tree edges connecting parent→child nodes. Use --color-by seed to compare how different seeds explore the solution space, or --color-by step to visualize search progression over time with a gradient.

Periodic Checkpoint Evaluation (Live)

Run private evaluations at periodic intervals during the experiment to track performance over budget consumption:

# Checkpoint by cost (every $0.50)
python -m experiments.cli ale-lite --config dev --cost-budget 5.0 \
    --checkpoint-budget-type cost --checkpoint-frequency 0.5 --only ahc016

# Checkpoint by step (every 2 steps)
python -m experiments.cli ale-lite --config dev --steps 10 \
    --checkpoint-budget-type step --checkpoint-frequency 2 --only ahc016

# Checkpoint by time (every 120 seconds)
python -m experiments.cli ale-lite --config dev --time-budget 600 \
    --checkpoint-budget-type time --checkpoint-frequency 120 --only ahc016

Output: ale_{task}_checkpoints.json in the experiment directory with results at each checkpoint:

{
  "task_id": "ahc016",
  "seed": 0,
  "budget_type": "cost",
  "frequency": 0.5,
  "results": [
    {
      "checkpoint_value": 0.5,
      "actual_value": 0.59,
      "step": 4,
      "public_score": 274187466.0,
      "private_score": 1455185961.0,
      "rank": 414,
      "performance": 1355.0
    }
  ]
}

Cost Checkpoint Evaluation (Post-hoc)

Evaluate performance at different cost thresholds ($1, $2, $3, $4, $5) after an experiment completes:

# Run cost checkpoint evaluation for an experiment
python scripts/run_cost_checkpoint_eval.py <experiment_dir> \
    --output <output.json> \
    --thresholds 1 2 3 4 5 \
    --parallel 2

# Example: Evaluate experiment at $1-$5 checkpoints with 2 parallel workers
python scripts/run_cost_checkpoint_eval.py \
    results/ale-bench/experiments/model_comparison/o3_d5/exp_100_20251205_133418 \
    --output results/checkpoints/o3_d5_checkpoints.json \
    --thresholds 1 2 3 4 5 \
    --parallel 2

For each (task, seed, cost_threshold) combination, the script:

Finds the best node with cumulative cost ≤ threshold
Runs private evaluation on that solution
Records performance, rank, and score

Output format (JSON):

{
  "experiment_dir": "...",
  "thresholds": [1.0, 2.0, 3.0, 4.0, 5.0],
  "results": [
    {
      "task_id": "ahc008",
      "seed": 0,
      "cost_threshold": 1.0,
      "actual_cost": 0.95,
      "step": 15,
      "public_score": 1234567.0,
      "private_score": 9876543.0,
      "rank": 450,
      "performance": 1150.0
    },
    ...
  ]
}

This is useful for analyzing cost-performance tradeoffs and comparing models. See notes/PER_DOLLAR_O3_VS_CODEX.md for an example analysis comparing o3 vs gpt-5.1-codex.

Adding New Benchmarks

The benchmark system uses a generic adapter interface. To add a new benchmark (e.g., RE-Bench, Kernel-Bench):

Create benchmarks/new_bench/adapter.py implementing BenchmarkAdapter:

from benchmarks.base import TaskSpec, EvalResult, BenchmarkAdapter

class NewBenchAdapter(BenchmarkAdapter):
    name = "new_bench"
    metric_name = "score"
    maximize = True

    def list_tasks(self, preset: str = "full") -> list[TaskSpec]: ...
    def build_base_code(self, task: TaskSpec) -> str: ...
    def eval_candidate(self, task: TaskSpec, code: str) -> EvalResult: ...
    def close(self) -> None: ...

Create experiments/new_bench/run_experiment.py using the adapter with AIDE.

Custom LLM Backend

Implement two callables matching these signatures:

def my_chat(system: str, user: str) -> tuple[str, dict[str, int]]:
    """Returns (response_text, {"input_tokens": N, "output_tokens": M})"""
    ...

def my_chat_json(system: str, user: str) -> tuple[dict, dict[str, int]]:
    """Returns (parsed_json, usage_dict)"""
    ...

Configuration

AgentConfig parameters:

Parameter	Default	Description
`steps`	`None`	Max optimization steps (optional)
`cost_budget`	`None`	Max cost in USD (optional)
`time_budget`	`None`	Max time in seconds (optional)
`search_policy`	`SearchPolicyConfig()`	Search policy configuration (see below)
`model`	`gpt-5.1-codex`	OpenAI model for code generation
`chat_kwargs`	`{}`	Extra kwargs passed to LLM calls (e.g., `{"reasoning_effort": "high"}`)

Note: At least one of steps, cost_budget, or time_budget must be specified. The agent stops when any constraint is reached.

Note: The cost_budget parameter only constrains the agent's LLM API costs. External evaluation costs (e.g., Modal GPU compute, ALE-Bench Docker execution) are not included in cost tracking.

Tip: reasoning_effort support varies by provider—Gemini and Anthropic support it natively, but OpenAI non-reasoning models (e.g., gpt-4o) do not and may raise UnsupportedParamsError. See aide/backend.py for details.

SearchPolicyConfig parameters (nested under search_policy):

Parameter	Default	Description
`num_drafts`	5	Initial solutions before improving
`debug_prob`	0.5	Probability of debugging vs improving
`max_debug_depth`	3	Max consecutive debug attempts
`batch_size`	1	Multi-AIDE: siblings to generate before re-selecting parent (1 = original greedy)

Multi-AIDE: Batch-Based Parent Selection

By default (batch_size=1), AIDE performs greedy parent selection after every step. Setting batch_size > 1 enables Multi-AIDE, which generates n siblings from the same parent before re-evaluating:

batch_size=1 (default):  Generate → Select best → Generate → Select best → ...
batch_size=3 (Multi-AIDE): Generate 3 siblings → Select best → Generate 3 siblings → ...

This trades depth for breadth in the search, exploring more diverse strategies from each promising node. Multi-AIDE tends to produce more consistent results (better IQM), while the default greedy approach has higher variance with potential for breakthrough solutions on complex problems.

# Run with Multi-AIDE (batch_size=3)
python -m experiments.cli ale-lite --config dev --batch-size 3

# Compare Multi-AIDE vs baseline
python -m experiments.cli ale-lite --config iterate --num-seeds 3 --batch-size 1  # baseline
python -m experiments.cli ale-lite --config iterate --num-seeds 3 --batch-size 3  # Multi-AIDE

See notes/MULTI_AIDE_ANALYSIS.md for detailed analysis and when to use each approach.

Run Analysis & Visualization

AIDE can dump optimization runs to JSON for offline analysis and visualization.

Dumping and Loading Runs

from aide.agent import AIDE, dump_run, load_run

# Save a run (includes full tree, metrics, source code)
dump_run(agent, "my_run.json")

# Restore an agent from checkpoint to continue execution
agent, last_step = load_run(
    path="my_run.json",
    generate_chat=openai_chat,
    eval_chat_json=openai_chat_json,
)

Or use the environment variable with test_sorting.py:

AIDE_RUN_PATH=sorting_run.json python tests/test_sorting.py

Loading and Analyzing Runs

from viz.run_model import load_run, get_best_node, get_path_to_root, compute_run_stats

run = load_run("sorting_run.json")

# Get statistics
stats = compute_run_stats(run)
print(f"Total nodes: {stats['total_nodes']}, Success rate: {stats['success_rate']:.1%}")

# Find best solution
best = get_best_node(run)
print(f"Best metric: {best.metric}")

# Trace path from root to best
path = get_path_to_root(run, best.id)
for node in path:
    print(f"Step {node.step}: {node.stage} -> metric={node.metric}")

Generating Visualizations

from viz.run_model import load_run
from viz.viz_run import plot_metric_over_steps, plot_tree, save_metric_plot, save_tree_plot

run = load_run("sorting_run.json")

# Interactive plotting
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
plot_metric_over_steps(run, ax=ax)
plt.show()

# Or save directly
save_metric_plot(run, "metric_over_steps.png")
save_tree_plot(run, "tree.png")

Using the Inspection Script

# Generate plots and print analysis
python examples/inspect_sorting_run.py sorting_run.json --save-plots

This produces:

sorting_run_metric.png - Metric over steps scatter plot
sorting_run_tree.png - Search tree structure visualization

Visualization Dependencies

uv pip install matplotlib networkx

Known Issues

Docker Permission Errors

PermissionError(13, 'Permission denied') - docker.sock

Fix: Ensure you're in the docker group and the group is active:

sudo usermod -aG docker $USER
newgrp docker  # Apply without logout

Parallel Runs Getting Stuck

In rare cases with very high parallelization, tasks may hang due to:

API rate limiting: OpenAI 429 errors cause exponential backoff across all workers
Docker resource exhaustion: Too many concurrent containers

Symptoms: Processes at 0% CPU, no output for 5+ minutes, memory static.

Fix: If experiencing issues, reduce parallelism or check API rate limits. In practice, --parallel 10 works reliably for most setups.

Docker Daemon Contention (Multiple Experiments)

When running multiple experiments simultaneously with high parallelism (e.g., 2 experiments with parallel=30), many ale_bench.start calls can overwhelm the Docker daemon socket.

Solution: Session creation is automatically serialized using a host-wide file lock at /tmp/aide_ale_docker_start.lock. This only affects initial Docker session startup—once sessions are created, evaluations run fully in parallel as before. The only cost is a few extra seconds at experiment startup.

To customize the lock file path (e.g., for per-user isolation):

export AIDE_ALE_DOCKER_LOCK="/tmp/my_custom_lock.lock"

Private Eval Throttling

Private evaluations are expensive (~30-60 seconds each) and can overwhelm the Docker daemon if too many run simultaneously. Private evals are automatically throttled to 8 concurrent evaluations across all processes on the host using an N-slot file-based semaphore.

To customize the concurrency limit:

export AIDE_ALE_PRIVATE_MAX_PARALLEL=2  # default: 2 concurrent private evals (optimal for accuracy)

Output Buffering with `newgrp`

Running via newgrp docker <<EOF ... EOF may cause stdout buffering issues where output doesn't appear.

Fix: Use sg docker -c "command" instead, or redirect to a log file:

nohup sg docker -c "./.venv/bin/python3 -u -m experiments.cli ale-lite ..." > run.log 2>&1 &

Research Operations

Lessons learned from running ALE-Bench experiments.

Parallelization Strategy

90 parallel runs tested on weco-gpu

Monitoring Long Runs

# Watch progress
tail -f results/ale-bench/experiments/*/exp_*/progress.json

# Check if stuck (no CPU = waiting on I/O)
ps aux | grep "experiments.cli" | grep -v grep

# Kill stuck experiment
pkill -f "experiments.cli ale-lite"

Recovery from Partial Failures

If an experiment fails mid-run:

Completed tasks have ale_{task_id}_aide_run.json files
Re-run only failed tasks with --only {task_id}
Results go to a new timestamped folder

# Re-run single failed task
./.venv/bin/python3 -m experiments.cli ale-lite --config iterate --only ahc039

API Cost Estimation

[TODO: Figure out the precise cost]

Pre-flight Checklist

Before starting a multi-hour experiment:

docker ps works without sudo
LITELLM_AIDE_API_KEY is set
Log output to file, not just terminal
Use screen or tmux for long runs

Notes Directory

The notes/ directory contains AI-generated documentation and investigation reports. These are working documents created during development and debugging sessions.

File	Description
`ABLATION_RESULTS.md`	num_drafts ablation study results and analysis
`ALE_BENCH_DETAILED.md`	Advanced ALE-Bench options and analysis scripts
`DOCKER_ISSUE_REPORT.md`	Docker container stampede investigation and RAM disk solution
`EXPERIMENT_SCHEMA.md`	Experiment data organization and JSON schemas
`HISTORICAL_FINDINGS.md`	Multi-AIDE analysis, num_drafts ablation findings
`MIGRATION_STATUS.md`	Experiment schema migration tracking
`MULTI_AIDE_ANALYSIS.md`	Multi-AIDE batch-based parent selection analysis
`PER_DOLLAR_O3_VS_CODEX.md`	Cost checkpoint analysis comparing o3 vs gpt-5.1-codex

Additional notes for specific experiments and investigations are also available in the directory.

These notes may be useful for understanding past issues and their resolutions.

Name		Name	Last commit message	Last commit date
Latest commit History 250 Commits
.github/workflows		.github/workflows
aide		aide
benchmarks		benchmarks
examples		examples
experiments		experiments
notes		notes
results		results
scripts		scripts
tests		tests
viz		viz
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
.repomixignore		.repomixignore
README.md		README.md
pyproject.toml		pyproject.toml

WecoAI/aide2-fork

Folders and files

Latest commit

History

Repository files navigation

AIDE

Project Structure

Quick Start

Using the LiteLLM Backend

Run the Example

ALE-Bench Integration

Setup (one-time)

Running ALE-Bench LITE Tasks

Running Parallel Experiments with Multiple Seeds

Config Presets

Test Case Configuration

CLI Options and Environment Variables

Output

Auto-Resume from Checkpoints

RSI-Bench Integration

Available Tasks

Setup (One-Time)

Modal CLI Setup (Required for Cloud Evaluation)

Environment Variables

Running RSI-Bench Tasks

CLI Usage

Programmatic Usage

Environment-Driven CLI

Config Presets

Task Presets

CLI Options and Environment Variables

Output Structure

Monitoring Progress

Checkpointing and Cross-Run Resume

Modal-Specific Notes

MLE-Bench Integration

Setup (One-Time)

Running MLE-Bench Tasks

Task Presets

GPU Scheduling

CLI Options

Output Structure

Auto-Resume from Checkpoints

Analyzing Experiment Results

Generate Results Summary

Generate Performance Curve

Visualize Solution Embeddings

Periodic Checkpoint Evaluation (Live)

Cost Checkpoint Evaluation (Post-hoc)

Adding New Benchmarks

Custom LLM Backend

Configuration

Multi-AIDE: Batch-Based Parent Selection

Run Analysis & Visualization

Dumping and Loading Runs

Loading and Analyzing Runs

Generating Visualizations

Using the Inspection Script

Visualization Dependencies

Known Issues

Docker Permission Errors

Parallel Runs Getting Stuck

Docker Daemon Contention (Multiple Experiments)

Private Eval Throttling

Output Buffering with newgrp

Research Operations

Parallelization Strategy

Monitoring Long Runs

Recovery from Partial Failures

API Cost Estimation

Pre-flight Checklist

Notes Directory

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Output Buffering with `newgrp`

Packages