Skip to content

WecoAI/aide2-fork

Repository files navigation

AIDE

Minimal agentic framework for iterative code improvement using a tree-based search with draft → debug → improve operations.

Project Structure

AIDE/
├── aide/                    # Core agent
│   ├── agent.py             # AIDE agent and tree search
│   └── backend.py           # LiteLLM backend integration
├── benchmarks/              # Benchmark integrations
│   ├── base.py              # Common interface for all benchmarks
│   ├── ale_bench/           # ALE-Bench integration
│   │   └── adapter.py       # ALE-Bench <-> AIDE glue
│   ├── mle_bench/           # MLE-Bench integration (Kaggle competitions)
│   │   ├── adapter.py       # MLE-Bench <-> AIDE glue
│   │   └── splits/          # Train/val data splitting strategies (per-competition)
│   └── rsi_bench/           # RSI-Bench integration
│       ├── adapter.py       # RSI-Bench <-> AIDE glue
│       ├── task_configs.py  # Task configurations
│       └── rsi-bench/       # RSI-Bench submodule
├── experiments/             # Experiment runners
│   ├── cli.py               # Unified CLI entry point
│   ├── schema.py            # Experiment metadata schemas
│   ├── device_scheduler.py  # Queue-based GPU scheduling (+ optional CPU pinning)
│   ├── ale_bench/           # ALE-Bench experiments
│   │   ├── config.py        # ALE paper-matching configs
│   │   └── run_lite_aide.py # Run AIDE on ALE-Bench LITE
│   ├── mle_bench/           # MLE-Bench experiments
│   │   ├── config.py        # MLE-Bench default configs
│   │   └── run_aide.py      # Run AIDE on Kaggle competitions
│   └── rsi_bench/           # RSI-Bench experiments
│       ├── config.py        # RSI-Bench configs with per-task steps
│       └── run_rsi_aide.py  # Run AIDE on RSI-Bench tasks
├── scripts/                 # Analysis & utility scripts
│   ├── generate_results_table.py    # Generate results summary
│   ├── generate_performance_curve.py # Plot performance curves
│   ├── visualize_embedding_tree.py  # Embed & visualize solution tree
│   └── migrate_rsi_eval_metadata.py # Migrate old RSI-bench experiments to include eval costs
├── viz/                     # Visualization library
│   ├── run_model.py         # Run log data model
│   └── viz_run.py           # Plotting utilities
├── tests/                   # Tests
├── examples/                # Usage examples (thin wrappers)
├── .github/
│   └── workflows/           # GitHub Actions workflows (CI, linting, etc)
│        └── lint.yml        # Lint & format code via Ruff on push to main
├── README.md
└── pyproject.toml

Quick Start

# Clone with submodules (for RSI-Bench support)
git clone --recurse-submodules git@github.com:WecoAI/aide2.git
cd aide2

# Or if already cloned, initialize submodules
git submodule update --init --recursive

uv sync
which python # make sure that path is something like <something>/aide2/.venv/bin/python
export LITELLM_AIDE_API_KEY="your-api-key"

If you're virtual environment is not being activated automatically, you can either run the following or install direnv and add the following to a .envrc:

source .venv/bin/activate

To lint/format run:

ruff check --fix && ruff format .

Using the LiteLLM Backend

import time
from aide.agent import AIDE, AgentConfig
from aide.backend import LLMBackend

backend = LLMBackend()

agent = AIDE(
    config=AgentConfig(
        steps=10,           # stop after 10 steps
        cost_budget=5.0,    # or after $5 spent (USD)
        time_budget=3600.0, # or after 1 hour (seconds)
    ),
    source_code="your code here",
    metric_name="accuracy",
    maximize=True,
    generate_chat=backend.chat,
    eval_chat_json=backend.chat_json,
)

# Run until any constraint is hit
step = 0
start_time = time.time()
while True:
    # stopping conditions
    step_condition = agent.config.steps is not None and step >= agent.config.steps
    elapsed_time = time.time() - start_time
    time_condition = agent.config.time_budget is not None and elapsed_time >= agent.config.time_budget
    total_cost = sum(n.usage.get("cost", 0.0) for n in agent.solution_tree.nodes)
    cost_condition = agent.config.cost_budget is not None and total_cost >= agent.config.cost_budget
    if step_condition or time_condition or cost_condition:
        break
    node = agent.next_candidate()
    exec_output = execute(node.code)  # you implement this
    agent.update_latest_candidate(exec_output)
    step += 1

The LiteLLM backend automatically tracks costs via the x-litellm-response-cost header. Access per-node costs via node.usage["cost"] or sum them to find the total cost.

Note: The cost_budget constraint includes both LLM optimization costs (node.usage["cost"]) and evaluation costs (node.eval_cost). Evaluation costs (e.g., Modal compute, LLM API spend during eval) are tracked per-node and included in budget calculations for RSI-Bench. For ALE-Bench and MLE-Bench, only LLM costs are currently tracked.

Run the Example

python tests/test_sorting.py

Optimizes bubble sort (~4950 comparisons) → mergesort variant (~500 comparisons). Takes ~30-60 seconds, uses ~10k input / ~3k output tokens.

ALE-Bench Integration

Run AIDE on ALE-Bench (AtCoder Heuristic Contest problems) using ALE-Bench as the executor and AIDE for code generation & tree search.

Setup (one-time)

# In the AIDE repo root
uv sync

# Install ALE-Bench toolkit
uv pip install git+https://github.com/SakanaAI/ALE-Bench.git

# Add yourself to the docker group (required for ALE-Bench) (when on Linux)
sudo usermod -aG docker $USER
# Log out and back in, or run: newgrp docker

# Build ALE-Bench Docker images (CPU only)
git clone https://github.com/SakanaAI/ALE-Bench.git
cd ALE-Bench
bash ./scripts/docker_build_all.sh $(id -u) $(id -g)
cd ..

# Set LiteLLM configuration
export LITELLM_AIDE_API_KEY="your-api-key"

Running ALE-Bench LITE Tasks

# Quick test (6 steps, dev config)
python -m experiments.cli ale-lite --config dev

# Run only a specific problem
python -m experiments.cli ale-lite --config dev --only ahc016

# Production run with cost budget ($5/task, 200 max steps)
python -m experiments.cli ale-lite --config iterate

# Run with multiple seeds and hierarchical organization
python -m experiments.cli ale-lite --config iterate --num-seeds 3 --parallel 30 \
    --purpose ablation_drafts --num-drafts 10
# Creates: results/ale-bench/experiments/ablation_drafts/d10/exp_iterate_YYYYMMDD_HHMMSS/

# Run with time budget (10 minutes = 600 seconds) - agent stops when time exceeded
python -m experiments.cli ale-lite --config dev --steps 100 --time-budget 600 --only ahc008

# Run with cost budget ($5 USD) - agent stops when cost exceeded  
python -m experiments.cli ale-lite --config dev --steps 100 --cost-budget 5.0 --only ahc008

# Run with periodic checkpoint evaluation (private eval at intervals)
python -m experiments.cli ale-lite --config dev --cost-budget 5.0 \
    --checkpoint-budget-type cost --checkpoint-frequency 0.5 --only ahc016
# Evaluates best solution at $0.50, $1.00, $1.50, ... thresholds

# Run with LLM-based output extraction (richer analysis, higher cost)
python -m experiments.cli ale-lite --config dev --use-llm-eval --only ahc016

Running Parallel Experiments with Multiple Seeds

For reproducibility, run multiple seeds in parallel using screen:

# Create 3 screen sessions for 3 seeds (all 10 tasks run in parallel by default)
screen -dmS ale_seed1 && screen -S ale_seed1 -X stuff 'cd /path/to/aide && sg docker -c "./.venv/bin/python3 -u -m experiments.cli ale-lite --config iterate" 2>&1 | tee results/ale-bench/seed1.log\n'

sleep 2  # Stagger starts for different timestamps

screen -dmS ale_seed2 && screen -S ale_seed2 -X stuff 'cd /path/to/aide && sg docker -c "./.venv/bin/python3 -u -m experiments.cli ale-lite --config iterate" 2>&1 | tee results/ale-bench/seed2.log\n'

sleep 2

screen -dmS ale_seed3 && screen -S ale_seed3 -X stuff 'cd /path/to/aide && sg docker -c "./.venv/bin/python3 -u -m experiments.cli ale-lite --config iterate" 2>&1 | tee results/ale-bench/seed3.log\n'

Monitoring:

# List screen sessions
screen -ls

# Attach to a session
screen -r ale_seed1  # Use Ctrl+A+D to detach

# Watch logs
tail -f results/ale-bench/seed1.log

# Check aggregated progress
cat results/ale-bench/experiments/exp_iterate_*/progress.json | python3 -m json.tool

Recommended settings:

  • --parallel 10: All 10 LITE tasks in parallel (default)
  • --config iterate: $5 cost budget per task (200 max steps)
  • 3 seeds: Standard for statistical significance

Config Presets

Configs are defined in experiments/ale_bench/config.py:

Preset Steps Cost Budget Description
dev 6 Quick development/testing
iterate 200 $5 Production runs with cost budget (recommended)

All presets use:

  • 10 LITE tasks: ahc008, ahc011, ahc015, ahc016, ahc024, ahc025, ahc026, ahc027, ahc039, ahc046
  • C++20 as the target language (matching ALE-Agent scaffolding)
  • 4 hour time limit per problem (ALE-Bench enforced)

Test Case Configuration

ALE-Bench supports different test case counts via the AleBenchAdapter parameters:

Configuration Public Tests Private Tests Use Case
lite=True (default) 5 ~40-50 Fast development
lite=False, private_lite=True 50 ~40-50 Better search signal, fast private eval
lite=True, private_lite=False 5 2,000-3,000 Fast search, accurate ranking
lite=False, private_lite=False 50 2,000-3,000 Official benchmarking

The private_lite parameter allows independent configuration of public vs private evaluation test counts. This is useful for getting 10x more signal during search (50 vs 5 public tests) while keeping private evaluation fast.

CLI Options and Environment Variables

CLI Flag Env Variable Default Description
--config AIDE_ALE_CONFIG dev Config preset (dev, iterate)
--steps AIDE_ALE_STEPS (from config) Override number of AIDE steps
--time-budget (from config) Time budget in seconds (agent stops when exceeded)
--cost-budget (from config) Cost budget in USD (agent stops when exceeded)
--num-workers AIDE_ALE_NUM_WORKERS 4 ALE-Bench CPU workers per problem
--out-dir AIDE_ALE_RUN_DIR results/ale-bench/experiments Output directory for run JSON files
--only AIDE_ALE_ONLY (all) Run only a specific problem (e.g., ahc016)
--parallel AIDE_ALE_PARALLEL 10 Number of tasks to run in parallel
--num-seeds 1 Number of random seeds/runs per task
--num-drafts (from config) Override search policy num_drafts
--batch-size 1 Multi-AIDE batch size (siblings before parent re-selection)
--purpose Experiment purpose for hierarchical organization (e.g., ablation_drafts)
--variation Experiment variation (auto-detected for ablation_drafts from num_drafts)
--checkpoint-budget-type Budget type for periodic checkpoint evaluation (cost, step, time)
--checkpoint-frequency Frequency for checkpoints (e.g., 0.5 for $0.50, 2 for 2 steps, 120 for 120s)
AIDE_ALE_DOCKER_LOCK /tmp/aide_ale_docker_start.lock Host-wide lock file for serializing Docker session creation
AIDE_ALE_PRIVATE_MAX_PARALLEL 2 Max concurrent private evals across the host (P=2 optimal for accuracy)
AIDE_ALE_DOCKER_PRIVATE_SLOT_PREFIX /tmp/aide_ale_docker_private_slot Lock file prefix for private eval slots
AIDE_ALE_PUBLIC_MAX_PARALLEL 8 Max concurrent public evals across the host
AIDE_ALE_DOCKER_PUBLIC_SLOT_PREFIX /tmp/aide_ale_docker_public_slot Lock file prefix for public eval slots

Output

Results are saved to timestamped experiment folders under results/ale-bench/experiments/:

  • results/ale-bench/experiments/{purpose}/{variation}/exp_{config}_{timestamp}/ — experiment directory
    • meta.json — experiment metadata (source of truth)
    • ale_{task_id}_seed{N}_aide_run.json — full run trajectory with human_eval
    • ale_{task_id}_seed{N}_private_eval.json — private eval sidecar for easy querying
    • ale_{task_id}_seed{N}_log.txt — task execution log
    • progress.json — live progress tracking

Auto-Resume from Checkpoints

Long-running experiments automatically checkpoint every step. If a run crashes or is interrupted, it will automatically resume from the latest checkpoint:

# Run gets interrupted at step 47...
# Just re-run the same command - it will resume from the latest checkpoint
python -m experiments.cli ale-lite --config iterate --only ahc016

Output when resuming:

[ahc016] Found checkpoint at step 47: results/ale-bench/experiments/.../ale_ahc016_seed0_aide_run.json
[ahc016] ✓ Resuming from step 47 (loaded 47 nodes, last_step=46)
[ahc016] Current best metric: 0.85

Budget behavior on resume:

  • Cost budget: Cumulative across resumes (restored from checkpoint nodes)
  • Time budget: Resets on each run (measured from current run start only)

You can also programmatically load checkpoints:

from aide.agent import load_run

agent, last_step = load_run(
    path="checkpoint.json",
    generate_chat=my_chat_fn,
    eval_chat_json=my_eval_fn,
)
# Continue from step last_step + 1

RSI-Bench Integration

Run AIDE on RSI-Bench (Recursive Self-Improvement Benchmark) tasks for code optimization. RSI-Bench evaluates systems on tasks like circle packing, kernel optimization, and neural architecture search.

Available Tasks

Termination is multi-dimensional: runs stop when any constraint (steps, cost, or time) is reached. Cost budget is typically the binding constraint for most tasks.

Task Metric Optimize Cost Budget GPU Description
circle_packing reported_sum_of_radii Maximize $10 No Pack 26 circles in unit square
prefix_sum time_ms Minimize $5 Yes Triton prefix sum kernel
nanogpt_inference average_system_generation_time Minimize $15 Yes NanoGPT inference latency
structural_break test_auc Maximize $5 No Time series structural breaks
nats_bench test-accuracy Maximize $5 No Neural architecture search
ds_1000 overall_score Maximize $20 No DS-1000 code generation (LLM-based eval)
extract_line_plot accuracy Maximize $10 No Extract data from charts (LLM-based eval)

Note: All tasks have recommended_steps=200, but cost budget is usually the binding constraint. LLM-based evaluation tasks (ds_1000, extract_line_plot) have higher cost budgets due to evaluation API costs.

Setup (One-Time)

# Initialize the RSI-Bench submodule (if not already done during clone)
git submodule update --init --recursive

# Set up each task's environment (run from each task directory)
# Example for circle_packing:
cd benchmarks/rsi_bench/rsi-bench/tasks/circle_packing
./setup.sh  # Creates venv, installs deps, sets up direnv
cd -

# Repeat for other tasks you want to run...

Modal CLI Setup (Required for Cloud Evaluation)

Modal provides GPU access for tasks like prefix_sum and nanogpt_inference, and is recommended for all tasks to avoid local environment issues.

# Install Modal CLI
uv tool install modal

# Authenticate with Modal (creates ~/.modal.toml)
modal token new

# Verify authentication
modal run --help

Environment Variables

Create a .env.rsi-bench file or export these variables:

# Required
export LITELLM_AIDE_API_KEY="sk-..."

# Required for nats_bench, extract_line_plot (Hugging Face model access)
export HF_TOKEN="hf_..."

# Required for ds_1000, extract_line_plot (LiteLLM for code execution)
export LITELLM_MANAGEMENT_API_KEY="..."
export LITELLM_BASE_URL="..."

Running RSI-Bench Tasks

CLI Usage

# Quick test (dev config, single task)
python -m experiments.cli rsi --config dev --task circle_packing

# Run with multiple seeds in parallel (each seed gets isolated workspace)
python -m experiments.cli rsi --config dev --task circle_packing --num-seeds 3 --parallel 3

# Run with Modal cloud evaluation
python -m experiments.cli rsi --config dev --task circle_packing --use-modal

# Run all standalone tasks
python -m experiments.cli rsi --preset standalone --config full

# Run with time budget (10 minutes = 600 seconds) - agent stops when time exceeded
python -m experiments.cli rsi --config dev --time-budget 600 --task circle_packing

# Run with cost budget ($5 USD) - agent stops when cost exceeded
python -m experiments.cli rsi --config dev --cost-budget 5.0 --task circle_packing

# Override model and reasoning effort
python -m experiments.cli rsi --config full --model gpt-5.1-codex --reasoning-effort high --task circle_packing

# Run with LLM-based output extraction (richer analysis, higher cost)
python -m experiments.cli rsi --config dev --use-llm-eval --task circle_packing

# Run with hierarchical organization (purpose/variation)
python -m experiments.cli rsi --config full --purpose model_comparison --variation gpt52
# Creates: results/rsi-bench/model_comparison/gpt52/exp_full_YYYYMMDD_HHMMSS/

Programmatic Usage

from experiments.rsi_bench.run_rsi_aide import run_rsi_experiment

# Quick test on standalone tasks (no external APIs needed)
run_rsi_experiment(
    config_name='dev',
    preset='standalone',  # circle_packing, structural_break
)

# Full run on Modal with custom settings
run_rsi_experiment(
    config_name='full',
    preset='full',
    use_modal=True,
    model_override='gpt-5.1-codex',
    reasoning_effort_override='high',
    parallel=5,  # Run all full tasks in parallel
)

# Single task
run_rsi_experiment(
    config_name='dev',
    task='circle_packing',
    use_modal=True,
    time_budget_override=600,  # 10 minutes (in seconds)
    cost_budget_override=5.0,  # $5 USD
)

Environment-Driven CLI

# Source environment
source .env.rsi-bench

# Quick dev run (local execution)
AIDE_RSI_CONFIG=dev AIDE_RSI_PRESET=lite python -m experiments.rsi_bench.run_rsi_aide

# Full run on Modal
AIDE_RSI_CONFIG=full \
AIDE_RSI_PRESET=full \
AIDE_RSI_USE_MODAL=1 \
AIDE_RSI_PARALLEL=5 \
python -m experiments.rsi_bench.run_rsi_aide

# Override model and reasoning effort
AIDE_RSI_CONFIG=full \
AIDE_RSI_MODEL=gpt-5.1-codex \
AIDE_RSI_REASONING_EFFORT=high \
python -m experiments.rsi_bench.run_rsi_aide

Config Presets

Configs define termination constraints. The full config uses per-task budgets from task_configs.py:

Preset Steps Cost Budget Time Budget Description
dev 5 $1 Per-task (12h) Quick testing (fixed steps/cost for all tasks)
full 200 Per-task Per-task (12h) Full optimization (uses per-task defaults)

The agent stops when any constraint (steps, cost, or time) is reached. For full config, cost budget is typically the binding constraint.

Task Presets

Preset Tasks Description
lite circle_packing, structural_break Quick testing
full All tasks except circle_packing, nats_bench Default benchmark (5 tasks)
extended All 7 tasks Full benchmark including all tasks
cpu_only All non-GPU tasks Works without CUDA
standalone Tasks without external API requirements No HF_TOKEN or LiteLLM needed

CLI Options and Environment Variables

CLI Flag Env Variable Default Description
--config AIDE_RSI_CONFIG dev Config preset (dev, full)
--task AIDE_RSI_TASK (all in preset) Run specific task only
--preset AIDE_RSI_PRESET full Task preset (lite, full, extended, cpu_only, standalone)
--steps AIDE_RSI_STEPS (from config) Override steps for ALL tasks
--model AIDE_RSI_MODEL gpt-5.1-codex Override model name
--reasoning-effort AIDE_RSI_REASONING_EFFORT Override reasoning effort (low, medium, high)
--time-budget AIDE_RSI_TIME_BUDGET Time budget in seconds (agent stops when exceeded)
--cost-budget AIDE_RSI_COST_BUDGET Cost budget in USD (agent stops when exceeded)
--parallel AIDE_RSI_PARALLEL 1 Number of tasks to run in parallel
--num-seeds 1 Number of random seeds per task (creates isolated workspaces)
--use-modal AIDE_RSI_USE_MODAL false Use Modal for cloud evaluation
--use-llm-eval AIDE_RSI_USE_LLM_EVAL false Use LLM for output extraction (richer analysis, higher cost). Regex otherwise.
--purpose Experiment purpose for hierarchical organization (e.g., ablation_drafts, model_comparison)
--variation Experiment variation (e.g., d10, gpt52)
--out-dir AIDE_RSI_RUN_DIR results/rsi-bench Output directory
AIDE_RSI_RUN_SETUP false Run data prep before optimization

Note: The agent stops when any constraint (steps, time, or cost) is reached.

Multi-seed runs: When using --num-seeds, each seed runs in an isolated workspace copy of the task directory (following RSI-Bench's design). This enables safe parallel execution where each seed has its own optimize.py and results/ folder. Workspaces are automatically cleaned up after each run completes.

Output Structure

Results are saved to timestamped experiment folders:

results/rsi-bench/
├── exp_{config}_{timestamp}/                         # Without --purpose
│   ├── experiment_log.md           # Summary with results table
│   ├── progress_{task_id}.json     # Real-time progress
│   ├── rsi_{task_id}_aide_run.json # Full run trajectory
│   ├── rsi_{task_id}_checkpoint_step{N}.json  # Checkpoints
│   └── rsi_{task_id}_log.txt       # Task execution log
├── {purpose}/exp_{config}_{timestamp}/               # With --purpose only
│   └── ...                         # Same structure as above
└── {purpose}/{variation}/exp_{config}_{timestamp}/   # With --purpose and --variation
    └── ...                         # Same structure as above

Monitoring Progress

# Watch all progress files
watch -n 5 'for f in results/rsi-bench/exp_*/progress_*.json; do echo "=== $f ==="; cat "$f"; done'

# Check specific task
cat results/rsi-bench/exp_full_*/progress_circle_packing.json | jq

# View experiment summary
cat results/rsi-bench/exp_full_*/experiment_log.md

Checkpointing and Cross-Run Resume

Experiments automatically checkpoint every step. Each CLI invocation creates a fresh timestamped directory (exp_{config}_{timestamp}/).

RSI-Bench supports cross-run resume via the --resume-from flag:

# Resume incomplete tasks from a previous experiment
python -m experiments.cli rsi --resume-from results/rsi-bench/exp_full_20260105_024454

Budget behavior on resume (differs from ALE-Bench/MLE-Bench):

  • Cost budget: Cumulative across resumes (includes both LLM costs and eval costs)
  • Time budget: Cumulative across resumes (restored from progress files)
  • Eval costs: Tracked per-node (node.eval_cost) and included in budget calculations

The resume system validates that the current config matches the original experiment (model, search policy, etc.) and prompts for confirmation if there are mismatches.

Modal-Specific Notes

  1. First run may be slow: Modal builds container images on first use
  2. GPU tasks require Modal: prefix_sum and nanogpt_inference need GPU
  3. Data setup: Some tasks need data prep on first Modal run (handled automatically with run_setup=True)
# Run with data prep (first time)
AIDE_RSI_RUN_SETUP=1 AIDE_RSI_USE_MODAL=1 python -m experiments.rsi_bench.run_rsi_aide

MLE-Bench Integration

Run AIDE on MLE-Bench Kaggle-style ML tasks using local code execution (GPU scheduling supported when GPUs are available).

Setup (One-Time)

# In the AIDE repo root
# Install dependencies (includes deep learning, NLP, CV libraries)
# Use GIT_LFS_SKIP_SMUDGE=1 to avoid Git LFS download issues (mlebench uses Git LFS)
GIT_LFS_SKIP_SMUDGE=1 uv sync --group mle-bench

# Activate environment
source .venv/bin/activate

# Prepare competition data (downloads from Kaggle)
mlebench prepare -c spooky-author-identification
mlebench prepare -c leaf-classification
# ... prepare other competitions as needed

Running MLE-Bench Tasks

# Quick development run (5 steps, single task)
python -m experiments.cli mle-bench --task spooky-author-identification --steps 5

# Run with default config (100 steps, $5 cost budget, 24h time budget)
python -m experiments.cli mle-bench --task leaf-classification

# Run multiple tasks in parallel on GPU machines (1 worker per visible GPU; capped by --parallel)
python -m experiments.cli mle-bench --preset core --parallel 4

# Override model and cost budget
python -m experiments.cli mle-bench --task tweet-sentiment-extraction \
    --model o4-mini --cost-budget 10.0

# Run with multiple seeds for statistical significance
python -m experiments.cli mle-bench --task stanford-covid-vaccine --num-seeds 3

# K-fold stratified split (balanced class distribution)
python -m experiments.cli mle-bench --task leaf-classification \
    --split-mode kfold --n-folds 5

# Run with checkpoint evaluation at cost thresholds
python -m experiments.cli mle-bench --task leaf-classification \
    --cost-budget 5.0 --checkpoint-budget-type cost --checkpoint-frequency 1.0

Task Presets

Preset Tasks Description
core 5 core tasks Quick validation (spooky, stanford-covid, google-quest, etc.)
lite 22 tasks MLE-Bench "lite" split (smaller datasets)
all 22 tasks Alias of lite in this repo (for compatibility)

GPU Scheduling

MLE-Bench uses queue-based dynamic GPU scheduling for efficient parallel execution:

  • One worker per GPU: Each GPU runs one task at a time
  • Dynamic load balancing: When a GPU finishes, it pulls the next task from the queue
  • CPU oversubscription control: best-effort thread caps + CPU affinity pinning where supported (Linux) to reduce n_jobs=-1 blowups

CLI Options

CLI Flag Env Var Default Description
--task (all in preset) Run specific task(s) (repeatable)
--preset core Task preset (core, lite, all)
--steps 100 Maximum optimization steps
--model o4-mini LLM model for code generation
--cost-budget 5.0 Cost budget in USD
--time-budget 86400 Time budget in seconds (24h default)
--num-drafts 15 Initial draft solutions before improving
--parallel 1 Parallel GPU workers (ignored without GPUs)
--num-seeds 1 Random seeds per task
--cpu-threads (auto) CPU threads per GPU worker (default: floor(cpus / effective_parallel))
--out-dir AIDE_MLE_RUN_DIR results/mle-bench Output directory
--checkpoint-budget-type (off) Periodic private eval trigger (cost, step, time)
--checkpoint-frequency (off) Periodic private eval frequency (e.g., 0.5, 10, 60)
--split-mode simple Data split: simple or kfold
--val-ratio 0.2 Validation ratio for simple split
--n-folds 5 Number of folds for k-fold
--random-seed 42 Random seed for split

Output Structure

Results are saved to results/mle-bench/exp_{timestamp}/:

results/mle-bench/exp_20251217_143022/
├── summary.json                           # Experiment metadata and results
├── mle_{task_id}_aide_run.json           # Full run trajectory
├── mle_{task_id}_private_eval.json       # Private test evaluation
├── mle_{task_id}_resources.json          # Resource usage rollup (per task run)
├── mle_{task_id}_step_resources.jsonl    # Per-step resource aggregates (JSONL)
├── mle_{task_id}_checkpoint_step{N}.json # Periodic checkpoints
├── mle_{task_id}_checkpoints.json        # Checkpoint eval results (if enabled)
└── mle_{task_id}_step{N}_grading.json    # Per-step grading reports

Auto-Resume from Checkpoints

MLE-Bench experiments checkpoint every 10 steps. If a run crashes or is interrupted, re-run the same command to resume:

# Run gets interrupted...
python -m experiments.cli mle-bench --task leaf-classification --steps 50

# Re-run to resume from latest checkpoint
python -m experiments.cli mle-bench --task leaf-classification --steps 50
# Output: Resuming from checkpoint_step30.json

Budget behavior on resume:

  • Cost budget: Cumulative across resumes (restored from checkpoint nodes)
  • Time budget: Resets on each run (measured from current run start only)

Analyzing Experiment Results

Generate Results Summary

# List experiment hierarchy
python scripts/generate_results_table.py results/ale-bench/experiments --list-hierarchy

# Generate results for a specific purpose/variation
python scripts/generate_results_table.py results/ale-bench/experiments --purpose ablation_drafts --variation d10

# Or specify directory directly
python scripts/generate_results_table.py results/ale-bench/experiments/ablation_drafts/d10

# Legacy: Generate from ale_runs with seed count
python scripts/generate_results_table.py ale_runs --seeds 5

Output: results_summary.md in the target directory.

Example output:

## Per-Task Results (Averaged Across Seeds)

| Task | Avg Score | Avg Perf | Avg Rank |
|------|-----------|----------|----------|
| ahc008 | 2.08e+05 | 780 | 625.0 |
| ahc039 | 9.64e+03 | 1333 | 343.0 |
| **AGGREGATE** || **786** | **692.2** |

Generate Performance Curve

# Generate performance improvement plot for a specific purpose/variation
python scripts/generate_performance_curve.py results/ale-bench/experiments --purpose ablation_drafts --variation d10

# Or specify directory directly
python scripts/generate_performance_curve.py results/ale-bench/experiments/ablation_drafts/d10

# Custom output path and seeds
python scripts/generate_performance_curve.py results/ale-bench/experiments/ablation_drafts/d10 --seeds 3 --output my_curve.png

Output: performance_curve.png in the target directory.

Shows normalized metric improvement over iterations, aggregated across seeds and tasks.

Visualize Solution Embeddings

Embed solution code with OpenAI embeddings and visualize the search tree structure in 2D:

# Single task, multiple seeds - color by seed
python scripts/visualize_embedding_tree.py \
    --runs "results/ale-bench/experiments/ablation_drafts/d10/*/ale_ahc015_*_aide_run.json" \
    --projection tsne \
    --color-by seed \
    --cache-dir results/ale-bench/.embedding_cache \
    --output embedding_ahc015_seeds.png

# Color by step range (gradient showing search progression over time)
python scripts/visualize_embedding_tree.py \
    --runs "results/ale-bench/experiments/ablation_drafts/d60/*/ale_ahc015_seed*_aide_run.json" \
    --projection tsne \
    --color-by step \
    --cache-dir results/ale-bench/.embedding_cache \
    --output embedding_ahc015_by_step.png

# Color by stage (draft/debug/improve) with metric heatmap overlay
python scripts/visualize_embedding_tree.py \
    --runs "results/ale-bench/experiments/*/exp_*/ale_ahc016_seed*_aide_run.json" \
    --projection tsne \
    --color-by stage \
    --heatmap \
    --cache-dir results/ale-bench/.embedding_cache \
    --output embedding_ahc016_stage_heatmap.png

# Filter to single task from mixed runs
python scripts/visualize_embedding_tree.py \
    --runs "results/ale-bench/experiments/*/exp_*/ale_*_aide_run.json" \
    --task ahc015 \
    --color-by seed \
    --output embedding_ahc015_filtered.png

Options:

  • --projection: umap (default), tsne, or pca
  • --color-by: stage, task, seed, run, or step (gradient by step ranges)
  • --heatmap: Overlay metric heatmap showing high/low performing regions
  • --cache-dir: Cache embeddings to avoid repeated API calls
  • --task: Filter to a single task when loading multiple runs

Requirements:

uv pip install scikit-learn  # For t-SNE/PCA
uv pip install umap-learn    # For UMAP (optional)
export LITELLM_AIDE_API_KEY="your-api-key"

Output shows solution code embeddings projected to 2D with tree edges connecting parent→child nodes. Use --color-by seed to compare how different seeds explore the solution space, or --color-by step to visualize search progression over time with a gradient.

Periodic Checkpoint Evaluation (Live)

Run private evaluations at periodic intervals during the experiment to track performance over budget consumption:

# Checkpoint by cost (every $0.50)
python -m experiments.cli ale-lite --config dev --cost-budget 5.0 \
    --checkpoint-budget-type cost --checkpoint-frequency 0.5 --only ahc016

# Checkpoint by step (every 2 steps)
python -m experiments.cli ale-lite --config dev --steps 10 \
    --checkpoint-budget-type step --checkpoint-frequency 2 --only ahc016

# Checkpoint by time (every 120 seconds)
python -m experiments.cli ale-lite --config dev --time-budget 600 \
    --checkpoint-budget-type time --checkpoint-frequency 120 --only ahc016

Output: ale_{task}_checkpoints.json in the experiment directory with results at each checkpoint:

{
  "task_id": "ahc016",
  "seed": 0,
  "budget_type": "cost",
  "frequency": 0.5,
  "results": [
    {
      "checkpoint_value": 0.5,
      "actual_value": 0.59,
      "step": 4,
      "public_score": 274187466.0,
      "private_score": 1455185961.0,
      "rank": 414,
      "performance": 1355.0
    }
  ]
}

Cost Checkpoint Evaluation (Post-hoc)

Evaluate performance at different cost thresholds ($1, $2, $3, $4, $5) after an experiment completes:

# Run cost checkpoint evaluation for an experiment
python scripts/run_cost_checkpoint_eval.py <experiment_dir> \
    --output <output.json> \
    --thresholds 1 2 3 4 5 \
    --parallel 2

# Example: Evaluate experiment at $1-$5 checkpoints with 2 parallel workers
python scripts/run_cost_checkpoint_eval.py \
    results/ale-bench/experiments/model_comparison/o3_d5/exp_100_20251205_133418 \
    --output results/checkpoints/o3_d5_checkpoints.json \
    --thresholds 1 2 3 4 5 \
    --parallel 2

For each (task, seed, cost_threshold) combination, the script:

  1. Finds the best node with cumulative cost ≤ threshold
  2. Runs private evaluation on that solution
  3. Records performance, rank, and score

Output format (JSON):

{
  "experiment_dir": "...",
  "thresholds": [1.0, 2.0, 3.0, 4.0, 5.0],
  "results": [
    {
      "task_id": "ahc008",
      "seed": 0,
      "cost_threshold": 1.0,
      "actual_cost": 0.95,
      "step": 15,
      "public_score": 1234567.0,
      "private_score": 9876543.0,
      "rank": 450,
      "performance": 1150.0
    },
    ...
  ]
}

This is useful for analyzing cost-performance tradeoffs and comparing models. See notes/PER_DOLLAR_O3_VS_CODEX.md for an example analysis comparing o3 vs gpt-5.1-codex.

Adding New Benchmarks

The benchmark system uses a generic adapter interface. To add a new benchmark (e.g., RE-Bench, Kernel-Bench):

  1. Create benchmarks/new_bench/adapter.py implementing BenchmarkAdapter:
from benchmarks.base import TaskSpec, EvalResult, BenchmarkAdapter

class NewBenchAdapter(BenchmarkAdapter):
    name = "new_bench"
    metric_name = "score"
    maximize = True

    def list_tasks(self, preset: str = "full") -> list[TaskSpec]: ...
    def build_base_code(self, task: TaskSpec) -> str: ...
    def eval_candidate(self, task: TaskSpec, code: str) -> EvalResult: ...
    def close(self) -> None: ...
  1. Create experiments/new_bench/run_experiment.py using the adapter with AIDE.

Custom LLM Backend

Implement two callables matching these signatures:

def my_chat(system: str, user: str) -> tuple[str, dict[str, int]]:
    """Returns (response_text, {"input_tokens": N, "output_tokens": M})"""
    ...

def my_chat_json(system: str, user: str) -> tuple[dict, dict[str, int]]:
    """Returns (parsed_json, usage_dict)"""
    ...

Configuration

AgentConfig parameters:

Parameter Default Description
steps None Max optimization steps (optional)
cost_budget None Max cost in USD (optional)
time_budget None Max time in seconds (optional)
search_policy SearchPolicyConfig() Search policy configuration (see below)
model gpt-5.1-codex OpenAI model for code generation
chat_kwargs {} Extra kwargs passed to LLM calls (e.g., {"reasoning_effort": "high"})

Note: At least one of steps, cost_budget, or time_budget must be specified. The agent stops when any constraint is reached.

Note: The cost_budget parameter only constrains the agent's LLM API costs. External evaluation costs (e.g., Modal GPU compute, ALE-Bench Docker execution) are not included in cost tracking.

Tip: reasoning_effort support varies by provider—Gemini and Anthropic support it natively, but OpenAI non-reasoning models (e.g., gpt-4o) do not and may raise UnsupportedParamsError. See aide/backend.py for details.

SearchPolicyConfig parameters (nested under search_policy):

Parameter Default Description
num_drafts 5 Initial solutions before improving
debug_prob 0.5 Probability of debugging vs improving
max_debug_depth 3 Max consecutive debug attempts
batch_size 1 Multi-AIDE: siblings to generate before re-selecting parent (1 = original greedy)

Multi-AIDE: Batch-Based Parent Selection

By default (batch_size=1), AIDE performs greedy parent selection after every step. Setting batch_size > 1 enables Multi-AIDE, which generates n siblings from the same parent before re-evaluating:

batch_size=1 (default):  Generate → Select best → Generate → Select best → ...
batch_size=3 (Multi-AIDE): Generate 3 siblings → Select best → Generate 3 siblings → ...

This trades depth for breadth in the search, exploring more diverse strategies from each promising node. Multi-AIDE tends to produce more consistent results (better IQM), while the default greedy approach has higher variance with potential for breakthrough solutions on complex problems.

# Run with Multi-AIDE (batch_size=3)
python -m experiments.cli ale-lite --config dev --batch-size 3

# Compare Multi-AIDE vs baseline
python -m experiments.cli ale-lite --config iterate --num-seeds 3 --batch-size 1  # baseline
python -m experiments.cli ale-lite --config iterate --num-seeds 3 --batch-size 3  # Multi-AIDE

See notes/MULTI_AIDE_ANALYSIS.md for detailed analysis and when to use each approach.

Run Analysis & Visualization

AIDE can dump optimization runs to JSON for offline analysis and visualization.

Dumping and Loading Runs

from aide.agent import AIDE, dump_run, load_run

# Save a run (includes full tree, metrics, source code)
dump_run(agent, "my_run.json")

# Restore an agent from checkpoint to continue execution
agent, last_step = load_run(
    path="my_run.json",
    generate_chat=openai_chat,
    eval_chat_json=openai_chat_json,
)

Or use the environment variable with test_sorting.py:

AIDE_RUN_PATH=sorting_run.json python tests/test_sorting.py

Loading and Analyzing Runs

from viz.run_model import load_run, get_best_node, get_path_to_root, compute_run_stats

run = load_run("sorting_run.json")

# Get statistics
stats = compute_run_stats(run)
print(f"Total nodes: {stats['total_nodes']}, Success rate: {stats['success_rate']:.1%}")

# Find best solution
best = get_best_node(run)
print(f"Best metric: {best.metric}")

# Trace path from root to best
path = get_path_to_root(run, best.id)
for node in path:
    print(f"Step {node.step}: {node.stage} -> metric={node.metric}")

Generating Visualizations

from viz.run_model import load_run
from viz.viz_run import plot_metric_over_steps, plot_tree, save_metric_plot, save_tree_plot

run = load_run("sorting_run.json")

# Interactive plotting
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
plot_metric_over_steps(run, ax=ax)
plt.show()

# Or save directly
save_metric_plot(run, "metric_over_steps.png")
save_tree_plot(run, "tree.png")

Using the Inspection Script

# Generate plots and print analysis
python examples/inspect_sorting_run.py sorting_run.json --save-plots

This produces:

  • sorting_run_metric.png - Metric over steps scatter plot
  • sorting_run_tree.png - Search tree structure visualization

Visualization Dependencies

uv pip install matplotlib networkx

Known Issues

Docker Permission Errors

PermissionError(13, 'Permission denied') - docker.sock

Fix: Ensure you're in the docker group and the group is active:

sudo usermod -aG docker $USER
newgrp docker  # Apply without logout

Parallel Runs Getting Stuck

In rare cases with very high parallelization, tasks may hang due to:

  • API rate limiting: OpenAI 429 errors cause exponential backoff across all workers
  • Docker resource exhaustion: Too many concurrent containers

Symptoms: Processes at 0% CPU, no output for 5+ minutes, memory static.

Fix: If experiencing issues, reduce parallelism or check API rate limits. In practice, --parallel 10 works reliably for most setups.

Docker Daemon Contention (Multiple Experiments)

When running multiple experiments simultaneously with high parallelism (e.g., 2 experiments with parallel=30), many ale_bench.start calls can overwhelm the Docker daemon socket.

Solution: Session creation is automatically serialized using a host-wide file lock at /tmp/aide_ale_docker_start.lock. This only affects initial Docker session startup—once sessions are created, evaluations run fully in parallel as before. The only cost is a few extra seconds at experiment startup.

To customize the lock file path (e.g., for per-user isolation):

export AIDE_ALE_DOCKER_LOCK="/tmp/my_custom_lock.lock"

Private Eval Throttling

Private evaluations are expensive (~30-60 seconds each) and can overwhelm the Docker daemon if too many run simultaneously. Private evals are automatically throttled to 8 concurrent evaluations across all processes on the host using an N-slot file-based semaphore.

To customize the concurrency limit:

export AIDE_ALE_PRIVATE_MAX_PARALLEL=2  # default: 2 concurrent private evals (optimal for accuracy)

Output Buffering with newgrp

Running via newgrp docker <<EOF ... EOF may cause stdout buffering issues where output doesn't appear.

Fix: Use sg docker -c "command" instead, or redirect to a log file:

nohup sg docker -c "./.venv/bin/python3 -u -m experiments.cli ale-lite ..." > run.log 2>&1 &

Research Operations

Lessons learned from running ALE-Bench experiments.

Parallelization Strategy

90 parallel runs tested on weco-gpu

Monitoring Long Runs

# Watch progress
tail -f results/ale-bench/experiments/*/exp_*/progress.json

# Check if stuck (no CPU = waiting on I/O)
ps aux | grep "experiments.cli" | grep -v grep

# Kill stuck experiment
pkill -f "experiments.cli ale-lite"

Recovery from Partial Failures

If an experiment fails mid-run:

  1. Completed tasks have ale_{task_id}_aide_run.json files
  2. Re-run only failed tasks with --only {task_id}
  3. Results go to a new timestamped folder
# Re-run single failed task
./.venv/bin/python3 -m experiments.cli ale-lite --config iterate --only ahc039

API Cost Estimation

[TODO: Figure out the precise cost]

Pre-flight Checklist

Before starting a multi-hour experiment:

  • docker ps works without sudo
  • LITELLM_AIDE_API_KEY is set
  • Log output to file, not just terminal
  • Use screen or tmux for long runs

Notes Directory

The notes/ directory contains AI-generated documentation and investigation reports. These are working documents created during development and debugging sessions.

File Description
ABLATION_RESULTS.md num_drafts ablation study results and analysis
ALE_BENCH_DETAILED.md Advanced ALE-Bench options and analysis scripts
DOCKER_ISSUE_REPORT.md Docker container stampede investigation and RAM disk solution
EXPERIMENT_SCHEMA.md Experiment data organization and JSON schemas
HISTORICAL_FINDINGS.md Multi-AIDE analysis, num_drafts ablation findings
MIGRATION_STATUS.md Experiment schema migration tracking
MULTI_AIDE_ANALYSIS.md Multi-AIDE batch-based parent selection analysis
PER_DOLLAR_O3_VS_CODEX.md Cost checkpoint analysis comparing o3 vs gpt-5.1-codex

Additional notes for specific experiments and investigations are also available in the directory.

These notes may be useful for understanding past issues and their resolutions.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published