diff --git a/M0_trace_bench_plan.md b/M0_trace_bench_plan.md
new file mode 100644
index 0000000..fc45092
--- /dev/null
+++ b/M0_trace_bench_plan.md
@@ -0,0 +1,476 @@
+# M0 — Trace‑Bench Technical Plan (Two Approaches + Explicit Acceptance)
+
+## Milestone definitions (fixed)
+
+- **M0**: Technical plan + locked contracts (**this document**).
+- **M1**: **Full Trace‑Bench API implementation** + **minimal runnable coverage** for each bench type, trainer type, and parameter/optimization target type.
+- **M2**: **Full coverage** across benches/trainers/tasks/parameters with **efficient** matrix execution + aggregation (meet coverage targets).
+- **M3**: **UI + MLflow + TensorBoard + Gradio** (operational wiring, not placeholders).
+- **M4**: **GEPA + Curriculum** trainers integration.
+
+---
+
+## Executive Summary
+
+This M0 provides two implementation approaches (A/B) and locks the core contracts needed to de-risk M1–M4:
+
+- Run/job identity (no overwrites; deterministic job IDs)
+- Matrix semantics (tasks × trainers × parameter variants × seeds)
+- Canonical artifacts schema (filesystem is source of truth)
+- Task & trainer discovery contracts (stable IDs + validation)
+- Parameter pass‑through contracts (compatible with Trace/OpenTrace Trainer/Optimizer/Guide/Logger APIs)
+- Security rules for environment capture (allowlist + redaction)
+- All validation logic/config/notebooks are delivered and reviewed via **PRs to Trace-Bench repo** (no out-of-band validation).
+
+Trace‑Bench **does not implement trainers**: it **consumes** trainer algorithms from Trace/OpenTrace and focuses on orchestration, validation, reproducibility, and reporting.
+
+---
+
+## 0) Purpose of M0
+
+M0 is **not** an implementation milestone. It is a **contract-locking** milestone:
+
+- Define the plan variants, acceptance targets, and validation approach
+- Lock the run/job/matrix/artifacts/discovery/security contracts so M1 can implement with minimal ambiguity
+- Ensure the plan is aligned with the current Trace‑Bench + Trace code realities (trainer.train API, task loaders, LLM4AD runner knobs)
+
+---
+
+## 1) Two Plan Variants (Pick A or B)
+
+Both variants respect the fixed milestone definitions above. The difference is how much breadth is proven **early in M1** vs deferred to M2.
+
+### Plan A — Lean / staged (recommended: Plan A+ Pareto from Plan B)
+Plan A+ keeps the smaller M1 surface, but **adds a bounded compatibility harness** (low cost / high gain).
+
+- **M1**: Implement full API + prove minimal runnable coverage **and** a bounded matrix smoke:
+  - 1 internal example task bundle
+  - 1 LLM4AD task bundle
+  - 1 VeriBench task bundle **if** entrypoint is available (otherwise “skipped with reason” is valid)
+  - OpenTrace examples smoke (import/`--help`) wired in CI
+  - Run each supported trainer at least once with at least one non-default parameter
+  - **Minimal matrix smoke (bounded):** 2 tasks × 2 trainers × 1 seed (4 jobs) end-to-end
+  - **Edge hardening:** `validate --strict` fails fast on unknown kwargs, missing trainable params, and task build errors
+- **M2**: Expand to full coverage targets + efficiency improvements (parallelism, resume, aggregation)
+- **M3**: UI + MLflow/TB + Gradio
+- **M4**: GEPA + Curriculum
+
+### Plan B — Compatibility‑first (more breadth earlier, higher integration risk)
+- **M1**: Everything in Plan A+, plus higher-risk breadth:
+  - broader early discovery/coverage (especially VeriBench task inventory)
+  - larger matrices (more tasks/trainers/seeds) earlier
+- **M2–M4**: same as Plan A
+
+**Trade-off**: Plan A+ minimizes early integration complexity while still catching most incompatibilities early (via strict validate + 2×2 smoke). Plan B buys earlier breadth at the cost of significantly higher M1 integration risk.
+
+---
+
+## 2) Scope and Coverage Targets (Explicit)
+
+### LLM4AD coverage (M2 acceptance target)
+- Target: **≥80% functional in Real mode** over the *current discovered LLM4AD task inventory*.
+- “Functional” means:
+  - job runs to completion (or hits configured timeout) **without crashing the runner**
+  - at least one training/optimization step is executed
+  - canonical artifacts are written (`results.json`, `events.jsonl`, `results.csv`, `summary.json`)
+- “Optimizing” means (M2):
+  - on a defined subset of tasks, **best_score > initial_score + ε** (ε default: 1e‑9) OR non‑finite → finite transition.
+
+Proposed “optimizing” subset (10 tasks; tunable):
+- `circle_packing`
+- `optimization_knapsack_construct`
+- `optimization_bp_1d_construct`
+- `optimization_tsp_construct`
+- `optimization_cvrp_construct`
+- `optimization_jssp_construct`
+- `optimization_qap_construct`
+- `optimization_set_cover_construct`
+- `optimization_vrptw_construct`
+- `optimization_cflp_construct`
+
+### VeriBench coverage (M2 acceptance target)
+- Target: **≥80% functional in Real mode** over the *current discovered VeriBench task inventory*.
+- Dependency: Trace team provides the canonical entrypoint/task index (kept in scope).
+- If entrypoint is not available in a given environment, jobs must be marked `skipped` with a structured reason (not a crash).
+
+### OpenTrace examples (CI enforced from M1; acceptance target in M2)
+- Target (SMART): **No unexpected failures in CI / 100% smoke tests pass**.
+- Rule: every example must be either:
+  - **PASS**: imports (and `--help` works if applicable) under `TRACE_BENCH_SMOKE=1`, or
+  - **EXCEPTIONNAL SKIP (explicit)**: listed in a small `smoke_skip_allowlist.yaml` with a clear reason (missing optional dependency/dataset/credential).
+- This prevents “silent skips” while keeping CI lightweight.
+
+- CI requirement:
+  - import every script in `OpenTrace/examples/` in a bounded subprocess
+  - for argparse scripts: run `python <script> --help`
+  - set `TRACE_BENCH_SMOKE=1` for smoke runs; examples should early-exit if they do heavy work at import time
+
+---
+
+## 3) Repository Grounding (Current Reality)
+
+### LLM4AD tasks and runner knobs
+- LLM4AD tasks are discoverable under `LLM4AD/benchmark_tasks/` and/or `LLM4AD/benchmark_tasks/index.json` (when present).
+- The existing benchmark runner supports these knobs (must be representable in Trace‑Bench config and passed through):
+  - `threads`
+  - `optimizer_kwargs` (JSON dict merge)
+  - `eval_kwargs` (JSON dict passed to evaluator)
+  - `ps-*` knobs: `ps-steps`, `ps-batches`, `ps-candidates`, `ps-proposals`, `ps-mem-update`
+  - `gepa-*` knobs: `gepa-iters`, `gepa-train-bs`, `gepa-merge-every`, `gepa-pareto-subset`
+
+### Trace trainer entrypoint contract
+Trace’s high-level entrypoint is (conceptually):
+- `trainer.train(model=..., algorithm=..., optimizer=..., guide=..., logger=..., optimizer_kwargs=..., guide_kwargs=..., logger_kwargs=..., **trainer_kwargs)`
+- `model` can be a `ParameterNode` (wrapped into a single-node module) or a Trace `Module`.
+
+**Implication for Trace‑Bench**: config must support:
+- algorithm selection (trainer)
+- optimizer/guide/logger selection
+- optimizer/guide/logger kwargs
+- arbitrary trainer kwargs pass-through (with schema validation when available)
+
+---
+
+## 4) Contracts Locked in M0
+
+### 4.1 Run vs Job identity (no overwrites)
+- **Run** = one invocation of `trace-bench run --config ...`
+- **Job** = one concrete combination: `(task_id, trainer_id, params_variant, seed)`
+
+ID scheme:
+- `run_id`: `YYYYMMDD-HHMMSS-<short_hash>` where the hash is derived from:
+  - normalized config snapshot + git SHA
+- `job_id`: deterministic hash of:
+  - `task_id + trainer_id + resolved_kwargs + seed`
+
+Guarantees:
+- no overwrite of past results
+- stable aggregation and comparisons
+- reproducible job naming
+
+### 4.2 Matrix semantics
+Matrix expansion is the cartesian product:
+
+`jobs = tasks × trainers × params_variants × seeds`
+
+Rules:
+- each job is independent
+- failures do not stop the run unless `fail_fast: true`
+- aggregation happens at run-level (`results.csv`, `summary.json`)
+
+### 4.3 Canonical artifacts (filesystem is source of truth)
+Filesystem under `runs/<run_id>/` is canonical. MLflow/TB mirror it (never replace it).
+
+Canonical layout:
+
+```
+runs/<run_id>/
+  meta/
+    config.snapshot.yaml
+    env.json
+    git.json
+    manifest.json
+  jobs/
+    <job_id>/
+      job_meta.json
+      results.json
+      events.jsonl
+      artifacts/
+      tb/
+  results.csv
+  summary.json
+```
+
+Minimum `results.csv` columns:
+- `run_id`, `job_id`, `task_id`, `suite`, `trainer_id`, `seed`
+- `status` (`ok` | `failed` | `skipped`)
+- `score_initial`, `score_final`, `score_best` (when available)
+- `time_seconds`
+- `resolved_trainer_kwargs` (JSON), `resolved_optimizer_kwargs` (JSON), `eval_kwargs` (JSON)
+- `feedback` (string; error or evaluator feedback summary)
+- `tb_logdir` (relative path)
+
+### 4.4 env.json security policy
+- allowlist-only environment capture (no full dumps)
+- redact any key containing `KEY`, `TOKEN`, `SECRET`, `PASSWORD` (case-insensitive)
+- record key runtime facts (python version, platform, package versions, git sha, selected config)
+
+### 4.5 Discovery contracts
+**TaskSpec**
+- stable `task_id` with namespace prefix (e.g., `llm4ad:circle_packing`, `veribench:<name>`, `example:<name>`, `internal:<name>`)
+- suite label (`llm4ad`, `veribench`, `trace_examples`, `internal`)
+- loader/factory reference
+- can build a runnable job bundle
+
+**TrainerSpec**
+- stable `trainer_id`
+- resolution source (Trace core, Trace features, OpenTrace)
+- parameter schema (best-effort; required when available)
+
+---
+
+## 5) Architecture (Minimal Surface, Contract‑driven)
+
+**Target modules** (M1 implementation target):
+- `trace_bench/config.py` — RunConfig parsing + validation (YAML/JSON)
+- `trace_bench/registry.py` — task & trainer discovery + stable IDs
+- `trace_bench/matrix.py` — matrix expansion + `run_id`/`job_id` + manifest
+- `trace_bench/runner.py` — orchestration (stub/real, per-job isolation, timeouts)
+- `trace_bench/artifacts.py` — canonical artifact writers (env redaction, job_meta, events.jsonl, results.json)
+- `trace_bench/results.py` — run-level aggregation into `results.csv` + `summary.json`
+- `trace_bench/logging.py` — MLflow/TB adapters mirroring the filesystem (M3 full wiring)
+- `trace_bench/cli.py` — `trace-bench` entrypoint + subcommands
+- `trace_bench/ui/` — Gradio UI package (M3)
+
+**Public API**
+```python
+from trace_bench.config import load_config
+from trace_bench.runner import BenchRunner
+
+cfg = load_config("configs/m1_validation.yaml")
+summary = BenchRunner(cfg).run()
+```
+
+---
+
+## 6) CLI and Config Schema (including LLM4AD knob coverage)
+
+### CLI
+```
+trace-bench list-tasks [--bench llm4ad|veribench|trace_examples|internal] [--pattern <glob>]
+trace-bench list-trainers
+trace-bench validate --config <path> [--strict]
+trace-bench run --config <path> --runs-dir <abs_path>
+trace-bench ui --runs-dir <abs_path>
+```
+
+### RunConfig YAML (conceptual)
+
+```yaml
+runs_dir: /abs/path/runs
+mode: real                # real | stub
+seeds: [123]
+max_workers: 1            # job-level parallelism (matrix execution)
+fail_fast: false
+
+tasks:
+  - id: llm4ad:circle_packing
+    eval_kwargs:
+      timeout_seconds: 10
+  - id: internal:toy_numeric_param
+  - id: internal:toy_code_param
+
+trainers:
+  - id: PrioritySearch
+    params_variants:
+      - threads: 2
+        ps_steps: 2
+        ps_batches: 2
+        ps_candidates: 3
+        ps_proposals: 3
+        ps_mem_update: 2
+    optimizer: OPROv2
+    optimizer_kwargs: {}
+    guide: LLMJudge
+    guide_kwargs: {}
+    logger: TensorboardLogger
+    logger_kwargs: {}
+
+  - id: GEPA-Base
+    params_variants:
+      - threads: 2
+        gepa_iters: 2
+        gepa_train_bs: 2
+        gepa_merge_every: 2
+        gepa_pareto_subset: 3
+    optimizer: OPROv2
+    optimizer_kwargs: {}
+
+eval_kwargs:
+  timeout_seconds: 10
+```
+
+**Parameter coverage rule (M1 contract)**:
+- Trace‑Bench must accept **all current LLM4AD runner knobs** (threads, optimizer_kwargs, eval_kwargs, ps_*, gepa_*) as pass-through trainer kwargs, and record the resolved kwargs in:
+  - `meta/manifest.json`
+  - `jobs/<job_id>/job_meta.json`
+  - `results.csv` (`resolved_trainer_kwargs`, etc.)
+
+**Alias / compatibility mapping (best-effort, validated in M1 notebook)**:
+- For LLM4AD-style configs, the runner must support reading the conventional keys:
+  - `threads`, `optimizer_kwargs`, `eval_kwargs`
+  - `ps_steps`, `ps_batches`, `ps_candidates`, `ps_proposals`, `ps_mem_update`
+  - `gepa_iters`, `gepa_train_bs`, `gepa_merge_every`, `gepa_pareto_subset`
+…and forward them to the selected trainer algorithm. If a trainer rejects a kwarg, validation must fail with a clear message naming the offending kwarg.
+
+---
+
+## 7) Compatibility Notes (Trace / Trainers / Optimizers / Guides)
+
+The implementation must explicitly handle these realities:
+
+1. **Trainer resolution across modules**
+   - Some trainers live in `opto.trainer.algorithms` (e.g., `BasicSearchAlgorithm`, `UCBSearchAlgorithm`, `BeamsearchAlgorithm`)
+   - Some live in `opto.features.*` (e.g., PrioritySearch)
+   - Trace‑Bench registry must resolve trainer IDs from multiple sources (and record where they came from).
+
+2. **ParameterNode vs Module differences**
+   - Trace’s `trainer.train` wraps a `ParameterNode` into a single-node model and uses a default optimizer (`OPROv2` for ParameterNode, `OptoPrimeV2` for Module).
+   - Trace‑Bench validation must ensure “trainable parameters exist”; otherwise jobs are marked `failed` with a precise reason.
+
+3. **Timeout behavior**
+   - Per-evaluation timeouts belong in `eval_kwargs.timeout_seconds`.
+   - Runner must also enforce a job-level hard timeout (process/thread) to avoid hangs.
+
+4. **Logging**
+   - Trace loggers are pluggable; the runner must set TensorBoard logdir per job (`jobs/<job_id>/tb/`).
+
+---
+
+## 8) Tests / CI (Milestone placement)
+
+### M1 (minimum)
+- Unit:
+  - config parsing + validation
+  - matrix expansion
+  - artifact writer schema (meta + per-job)
+  - LLM4AD knob parsing coverage
+- Integration:
+  - run minimal internal examples (multiple optimization target types)
+  - run 1 LLM4AD task (bounded eval_kwargs)
+  - VeriBench: discovery + run 1 task if available, else `skipped` with reason
+- CI:
+  - OpenTrace examples smoke (import/--help) bounded by timeout
+  - **PR CI policy:** keep PR CI **offline/stub** by default (no paid LLM calls).
+  - **Real-LLM sanity check (separate workflow):**
+    - Add a workflow (nightly + manual trigger) that runs **one tiny real job**:
+      - 1 task × 1 trainer × 1 seed, minimal steps/iters, short timeout
+    - It runs **only when secrets exist** (e.g., OpenRouter key via env vars).
+    - This is the “real world” check without making PRs expensive.
+
+### M2 (expanded)
+- matrix performance features (bounded parallelism, skip-existing/resume semantics)
+- broad suite coverage to meet ≥80% functional targets
+- for VeriBench, it will be important to use the latest version from https://github.com/xuanfeiren/Trace-Bench (if not yet merged to Trace-Bench)
+
+---
+
+## 9) Notebooks (Deliverables from M1 onward)
+
+Rule: each milestone delivers a notebook that is:
+- committed with **executed outputs** (so reviewers can inspect results immediately)
+- includes an **“Open in Colab”** badge in the first markdown cell
+- writes to a deterministic `runs_dir` and commits a small artifact snapshot:
+  - `results.csv`, `summary.json`, plus one representative `jobs/<job_id>/results.json`
+
+**Notebook rule (no “smoke-only” => real by default)**
+- Notebooks default to **`mode: real`** and use the configured LLM backend (OpenRouter recommended).
+- If no key is present, the notebook must:
+  - print a clear warning, and
+  - switch to stub mode if explictely validated by user, and
+  - label outputs as **STUB** (so results are not mistaken for real runs during our test for validation).
+
+### Notebooks
+- **M1**: `notebooks/01_m1_minimal_api.ipynb`
+- **M2**: `notebooks/02_m2_coverage.ipynb`
+- **M3**: `notebooks/03_m3_ui_mlflow_gradio.ipynb`
+- **M4**: `notebooks/04_m4_gepa_curriculum.ipynb`
+
+---
+
+## 10) Acceptance Criteria (SMART, verifiable)
+
+### M0 (this document)
+- Milestone definitions are consistent with delivery expectations (M0..M4 fixed mapping).
+- Contracts are explicitly stated and unambiguous:
+  - run_id/job_id
+  - matrix semantics
+  - canonical artifacts schema
+  - discovery contracts
+  - env.json security policy
+  - LLM4AD knob coverage contract
+
+### M1 (full API + minimal runnable coverage)
+Goal: implement Trace‑Bench (CLI + Python API) and prove the runner can execute at least one job of each required type.
+
+**Validation must demonstrate all of:**
+
+1) **CLI surface**
+- `trace-bench list-tasks` works for `llm4ad`, `veribench`, `trace_examples`, `internal`
+- `trace-bench list-trainers` lists available trainer IDs
+- `trace-bench validate --config configs/m1_validation.yaml --strict`:
+  - tasks resolve and build runnable bundles
+  - trainers resolve
+  - optimizer/guide/logger resolve
+  - matrix expands deterministically and writes `meta/manifest.json`
+  - unknown kwargs / missing trainables / task-build failures are surfaced as clear validation errors (no silent ignore)
+
+2) **Minimal runnable suite coverage**
+- Internal examples: at least these optimization target types are executed:
+  - trainable **code** ParameterNode
+  - trainable **numeric** ParameterNode
+  - multi-parameter **Module**
+  - **non-trainable** parameter case is rejected cleanly (job `failed` with explicit reason)
+- LLM4AD: run **≥1 task** with bounded `eval_kwargs.timeout_seconds`
+  - **M1 notebook:** real-mode by default (uses key if present); artifacts show `mode=real`.
+  - **PR CI:** stub/offline is acceptable to control cost.
+- VeriBench: run **≥1 task** if entrypoint available; otherwise job is `skipped` with structured reason
+- OpenTrace examples: smoke import/`--help` wired in CI (bounded by timeout)
+
+2b) **Bounded matrix smoke (Plan A+ Pareto)**
+- Execute exactly **2 tasks × 2 trainers × 1 seed** (4 jobs) and verify `results.csv` has 4 rows and `summary.json` aggregates them.
+
+3) **Trainer coverage**
+- Each “supported” trainer ID is executed at least once with at least one non-default kwarg.
+- At least one run uses explicit optimizer override and records `resolved_optimizer_kwargs`.
+
+4) **Artifacts**
+- canonical layout exists and is populated:
+  - `meta/config.snapshot.yaml`, `meta/manifest.json`, `meta/env.json` (redacted)
+  - at least one `jobs/<job_id>/results.json` and `jobs/<job_id>/events.jsonl`
+  - `results.csv` + `summary.json`
+
+5) **Notebook**
+- `notebooks/01_m1_minimal_api.ipynb` is committed, executed, includes “Open in Colab” badge, and shows inspection of produced artifacts.
+
+### M2 (full coverage + efficiency)
+Goal: expand breadth and efficiency while meeting coverage targets.
+
+- Matrix engine supports:
+  - multi-task × multi-trainer × params_variants × seeds
+  - per-job isolation, partial failure tolerance
+  - aggregation into `results.csv` + `summary.json`
+  - skip-existing/resume semantics (re-running does not overwrite; it reuses existing completed jobs)
+- Coverage targets achieved:
+  - LLM4AD: ≥80% functional in Real mode over discovered tasks; optimizing subset shows improvement definition satisfied
+  - VeriBench: ≥80% functional in Real mode over discovered tasks (when available)
+- OpenTrace examples smoke: **100% pass** in CI (PASS or explicit allowlisted SKIP)
+
+### M3 (UI + MLflow + Gradio)
+Goal: operational interface + monitoring.
+
+- Gradio UI supports configuring runner options (or selecting existing configs), launches runs, shows recent runs, and browses results
+- MLflow is wired and shows run/job tags and metrics
+- TensorBoard logdir and open command are visible per job
+- UI reads filesystem canonical artifacts (`manifest.json`, `results.csv`, `summary.json`) and does not rely on MLflow as source-of-truth
+
+### M4 (GEPA + Curriculum)
+Goal: trainer integration and demonstration.
+
+- GEPA + Curriculum trainers are integrated (when provided) and runnable in both stub and real modes
+- Demonstrated on ≥2 tasks with artifacts and logs consistent with schema
+
+---
+
+## 11) Next Step
+
+Adopt **Plan A+ (Pareto)**:
+- keep M1 small, but include:
+  - the **compatibility harness** (`configs/m1_validation.yaml`)
+  - the **bounded 2×2 matrix smoke**
+  - **strict edge validation** (`validate --strict`)
+- defer **broad VeriBench discovery/coverage** and **large matrices** to M2.
+
+Then lock:
+- `configs/m1_validation.yaml` (the minimal M1 proof configuration)
+- the list of trainer IDs considered “supported” for M1 (as exposed by `trace-bench list-trainers`)
\ No newline at end of file
diff --git a/notebooks/01_smoke_runner.ipynb b/notebooks/01_smoke_runner.ipynb
new file mode 100644
index 0000000..2ed2841
--- /dev/null
+++ b/notebooks/01_smoke_runner.ipynb
@@ -0,0 +1,639 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DcU2Bd6P8uak"
+      },
+      "source": [
+        "# Trace-Bench Smoke Runner (Stub + Real)\n",
+        "\n",
+        "This notebook validates Trace-Bench in two modes:\n",
+        "\n",
+        "- **StubLLM**: deterministic, no API keys\n",
+        "- **Real LLM**: requires a user-provided API key (via Colab Secrets)\n",
+        "\n",
+        "It also shows the standardized run artifacts produced by the CLI."
+      ],
+      "id": "DcU2Bd6P8uak"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "vxSrp4Wx8uaq"
+      },
+      "source": [
+        "## Expected Outputs (Quick Verification)\n",
+        "\n",
+        "You should see the following signals if the notebook is working correctly:\n",
+        "\n",
+        "- **Stub smoke run** completes with a new `runs/<run_id>/` folder.\n",
+        "- `config.snapshot.yaml`, `env.json`, `results.csv`, `events.jsonl` exist in that folder.\n",
+        "- `results.csv` shows at least one row with `task=example:greeting_stub` and `status=trained`.\n",
+        "- **Real-LLM smoke** completes (if API key is set) and `results.csv` shows `status=trained`.\n",
+        "- `pytest -q` ends with `passed` (LLM4AD optimizer tests run only when `OPENAI_API_KEY` is set)."
+      ],
+      "id": "vxSrp4Wx8uaq"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 8,
+      "metadata": {
+        "id": "8A_a5t0y8uar",
+        "outputId": "e97f68d0-2698-4d0e-821f-0a931a4c04db",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n",
+            "Runs dir: /content/drive/MyDrive/bench/2026-02-09/trace_bench\n"
+          ]
+        }
+      ],
+      "source": [
+        "# Mount Drive (optional) + compute persistent runs_dir\n",
+        "from datetime import date\n",
+        "from pathlib import Path\n",
+        "import os\n",
+        "\n",
+        "try:\n",
+        "    from google.colab import drive\n",
+        "    drive.mount(\"/content/drive\")\n",
+        "except Exception:\n",
+        "    pass\n",
+        "\n",
+        "\n",
+        "def bench_dir(project=\"bench\", sub=\"trace_bench\", local=\"/content/bench\"):\n",
+        "    drive = Path(\"/content/drive/MyDrive\")\n",
+        "    root = drive if drive.is_dir() else Path(local)\n",
+        "    out = root / project / date.today().isoformat() / sub\n",
+        "    out.mkdir(parents=True, exist_ok=True)\n",
+        "    return str(out)\n",
+        "\n",
+        "RUNS_DIR = bench_dir()\n",
+        "os.environ[\"RUNS_DIR\"] = RUNS_DIR\n",
+        "print(\"Runs dir:\", RUNS_DIR)"
+      ],
+      "id": "8A_a5t0y8uar"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 9,
+      "metadata": {
+        "id": "Xy6CvfPB8uat",
+        "outputId": "ff026dfb-2ccb-430c-e56e-c84e4d2ce777",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Cloning into 'Trace-Bench'...\n",
+            "remote: Enumerating objects: 302, done.\u001b[K\n",
+            "remote: Counting objects: 100% (302/302), done.\u001b[K\n",
+            "remote: Compressing objects: 100% (209/209), done.\u001b[K\n",
+            "remote: Total 302 (delta 37), reused 270 (delta 35), pack-reused 0 (from 0)\u001b[K\n",
+            "Receiving objects: 100% (302/302), 3.83 MiB | 22.68 MiB/s, done.\n",
+            "Resolving deltas: 100% (37/37), done.\n",
+            "Cloning into 'OpenTrace'...\n",
+            "remote: Enumerating objects: 228, done.\u001b[K\n",
+            "remote: Counting objects: 100% (228/228), done.\u001b[K\n",
+            "remote: Compressing objects: 100% (205/205), done.\u001b[K\n",
+            "remote: Total 228 (delta 17), reused 114 (delta 13), pack-reused 0 (from 0)\u001b[K\n",
+            "Receiving objects: 100% (228/228), 4.73 MiB | 10.09 MiB/s, done.\n",
+            "Resolving deltas: 100% (17/17), done.\n",
+            "/content/Trace-Bench/Trace-Bench\n",
+            "Hit:1 https://cli.github.com/packages stable InRelease\n",
+            "Hit:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease\n",
+            "Hit:3 https://r2u.stat.illinois.edu/ubuntu jammy InRelease\n",
+            "Hit:4 http://security.ubuntu.com/ubuntu jammy-security InRelease\n",
+            "Hit:5 http://archive.ubuntu.com/ubuntu jammy InRelease\n",
+            "Hit:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease\n",
+            "Hit:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease\n",
+            "Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease\n",
+            "Hit:9 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease\n",
+            "Reading package lists... Done\n",
+            "W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)\n",
+            "Reading package lists... Done\n",
+            "Building dependency tree... Done\n",
+            "Reading state information... Done\n",
+            "graphviz is already the newest version (2.42.2-6ubuntu0.1).\n",
+            "0 upgraded, 0 newly installed, 0 to remove and 54 not upgraded.\n",
+            "Requirement already satisfied: pip in /usr/local/lib/python3.12/dist-packages (26.0.1)\n",
+            "Requirement already satisfied: pyyaml in /usr/local/lib/python3.12/dist-packages (6.0.3)\n",
+            "Requirement already satisfied: pytest in /usr/local/lib/python3.12/dist-packages (8.4.2)\n",
+            "Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (2.0.2)\n",
+            "Requirement already satisfied: matplotlib in /usr/local/lib/python3.12/dist-packages (3.10.0)\n",
+            "Requirement already satisfied: graphviz in /usr/local/lib/python3.12/dist-packages (0.21)\n",
+            "Requirement already satisfied: litellm==1.75.0 in /usr/local/lib/python3.12/dist-packages (1.75.0)\n",
+            "Requirement already satisfied: aiohttp>=3.10 in /usr/local/lib/python3.12/dist-packages (from litellm==1.75.0) (3.13.3)\n",
+            "Requirement already satisfied: click in /usr/local/lib/python3.12/dist-packages (from litellm==1.75.0) (8.3.1)\n",
+            "Requirement already satisfied: httpx>=0.23.0 in /usr/local/lib/python3.12/dist-packages (from litellm==1.75.0) (0.28.1)\n",
+            "Requirement already satisfied: importlib-metadata>=6.8.0 in /usr/local/lib/python3.12/dist-packages (from litellm==1.75.0) (8.7.1)\n",
+            "Requirement already satisfied: jinja2<4.0.0,>=3.1.2 in /usr/local/lib/python3.12/dist-packages (from litellm==1.75.0) (3.1.6)\n",
+            "Requirement already satisfied: jsonschema<5.0.0,>=4.22.0 in /usr/local/lib/python3.12/dist-packages (from litellm==1.75.0) (4.26.0)\n",
+            "Requirement already satisfied: openai>=1.68.2 in /usr/local/lib/python3.12/dist-packages (from litellm==1.75.0) (2.16.0)\n",
+            "Requirement already satisfied: pydantic<3.0.0,>=2.5.0 in /usr/local/lib/python3.12/dist-packages (from litellm==1.75.0) (2.12.3)\n",
+            "Requirement already satisfied: python-dotenv>=0.2.0 in /usr/local/lib/python3.12/dist-packages (from litellm==1.75.0) (1.2.1)\n",
+            "Requirement already satisfied: tiktoken>=0.7.0 in /usr/local/lib/python3.12/dist-packages (from litellm==1.75.0) (0.12.0)\n",
+            "Requirement already satisfied: tokenizers in /usr/local/lib/python3.12/dist-packages (from litellm==1.75.0) (0.22.2)\n",
+            "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/dist-packages (from jinja2<4.0.0,>=3.1.2->litellm==1.75.0) (3.0.3)\n",
+            "Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.12/dist-packages (from jsonschema<5.0.0,>=4.22.0->litellm==1.75.0) (25.4.0)\n",
+            "Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.12/dist-packages (from jsonschema<5.0.0,>=4.22.0->litellm==1.75.0) (2025.9.1)\n",
+            "Requirement already satisfied: referencing>=0.28.4 in /usr/local/lib/python3.12/dist-packages (from jsonschema<5.0.0,>=4.22.0->litellm==1.75.0) (0.37.0)\n",
+            "Requirement already satisfied: rpds-py>=0.25.0 in /usr/local/lib/python3.12/dist-packages (from jsonschema<5.0.0,>=4.22.0->litellm==1.75.0) (0.30.0)\n",
+            "Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.12/dist-packages (from pydantic<3.0.0,>=2.5.0->litellm==1.75.0) (0.7.0)\n",
+            "Requirement already satisfied: pydantic-core==2.41.4 in /usr/local/lib/python3.12/dist-packages (from pydantic<3.0.0,>=2.5.0->litellm==1.75.0) (2.41.4)\n",
+            "Requirement already satisfied: typing-extensions>=4.14.1 in /usr/local/lib/python3.12/dist-packages (from pydantic<3.0.0,>=2.5.0->litellm==1.75.0) (4.15.0)\n",
+            "Requirement already satisfied: typing-inspection>=0.4.2 in /usr/local/lib/python3.12/dist-packages (from pydantic<3.0.0,>=2.5.0->litellm==1.75.0) (0.4.2)\n",
+            "Requirement already satisfied: iniconfig>=1 in /usr/local/lib/python3.12/dist-packages (from pytest) (2.3.0)\n",
+            "Requirement already satisfied: packaging>=20 in /usr/local/lib/python3.12/dist-packages (from pytest) (26.0)\n",
+            "Requirement already satisfied: pluggy<2,>=1.5 in /usr/local/lib/python3.12/dist-packages (from pytest) (1.6.0)\n",
+            "Requirement already satisfied: pygments>=2.7.2 in /usr/local/lib/python3.12/dist-packages (from pytest) (2.19.2)\n",
+            "Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (1.3.3)\n",
+            "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (0.12.1)\n",
+            "Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (4.61.1)\n",
+            "Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (1.4.9)\n",
+            "Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (11.3.0)\n",
+            "Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (3.3.2)\n",
+            "Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (2.9.0.post0)\n",
+            "Requirement already satisfied: aiohappyeyeballs>=2.5.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp>=3.10->litellm==1.75.0) (2.6.1)\n",
+            "Requirement already satisfied: aiosignal>=1.4.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp>=3.10->litellm==1.75.0) (1.4.0)\n",
+            "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.12/dist-packages (from aiohttp>=3.10->litellm==1.75.0) (1.8.0)\n",
+            "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.12/dist-packages (from aiohttp>=3.10->litellm==1.75.0) (6.7.1)\n",
+            "Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp>=3.10->litellm==1.75.0) (0.4.1)\n",
+            "Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp>=3.10->litellm==1.75.0) (1.22.0)\n",
+            "Requirement already satisfied: idna>=2.0 in /usr/local/lib/python3.12/dist-packages (from yarl<2.0,>=1.17.0->aiohttp>=3.10->litellm==1.75.0) (3.11)\n",
+            "Requirement already satisfied: anyio in /usr/local/lib/python3.12/dist-packages (from httpx>=0.23.0->litellm==1.75.0) (4.12.1)\n",
+            "Requirement already satisfied: certifi in /usr/local/lib/python3.12/dist-packages (from httpx>=0.23.0->litellm==1.75.0) (2026.1.4)\n",
+            "Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.12/dist-packages (from httpx>=0.23.0->litellm==1.75.0) (1.0.9)\n",
+            "Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.12/dist-packages (from httpcore==1.*->httpx>=0.23.0->litellm==1.75.0) (0.16.0)\n",
+            "Requirement already satisfied: zipp>=3.20 in /usr/local/lib/python3.12/dist-packages (from importlib-metadata>=6.8.0->litellm==1.75.0) (3.23.0)\n",
+            "Requirement already satisfied: distro<2,>=1.7.0 in /usr/local/lib/python3.12/dist-packages (from openai>=1.68.2->litellm==1.75.0) (1.9.0)\n",
+            "Requirement already satisfied: jiter<1,>=0.10.0 in /usr/local/lib/python3.12/dist-packages (from openai>=1.68.2->litellm==1.75.0) (0.13.0)\n",
+            "Requirement already satisfied: sniffio in /usr/local/lib/python3.12/dist-packages (from openai>=1.68.2->litellm==1.75.0) (1.3.1)\n",
+            "Requirement already satisfied: tqdm>4 in /usr/local/lib/python3.12/dist-packages (from openai>=1.68.2->litellm==1.75.0) (4.67.2)\n",
+            "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.7->matplotlib) (1.17.0)\n",
+            "Requirement already satisfied: regex>=2022.1.18 in /usr/local/lib/python3.12/dist-packages (from tiktoken>=0.7.0->litellm==1.75.0) (2025.11.3)\n",
+            "Requirement already satisfied: requests>=2.26.0 in /usr/local/lib/python3.12/dist-packages (from tiktoken>=0.7.0->litellm==1.75.0) (2.32.4)\n",
+            "Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests>=2.26.0->tiktoken>=0.7.0->litellm==1.75.0) (3.4.4)\n",
+            "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests>=2.26.0->tiktoken>=0.7.0->litellm==1.75.0) (2.5.0)\n",
+            "Requirement already satisfied: huggingface-hub<2.0,>=0.16.4 in /usr/local/lib/python3.12/dist-packages (from tokenizers->litellm==1.75.0) (1.3.7)\n",
+            "Requirement already satisfied: filelock in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=0.16.4->tokenizers->litellm==1.75.0) (3.20.3)\n",
+            "Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=0.16.4->tokenizers->litellm==1.75.0) (2025.3.0)\n",
+            "Requirement already satisfied: hf-xet<2.0.0,>=1.2.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=0.16.4->tokenizers->litellm==1.75.0) (1.2.0)\n",
+            "Requirement already satisfied: shellingham in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=0.16.4->tokenizers->litellm==1.75.0) (1.5.4)\n",
+            "Requirement already satisfied: typer-slim in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=0.16.4->tokenizers->litellm==1.75.0) (0.21.1)\n"
+          ]
+        }
+      ],
+      "source": [
+        "# Clone repos side-by-side (Trace-Bench + OpenTrace)\n",
+        "!git clone --depth 1 --branch runner-foundation https://github.com/guru-code-expert/Trace-Bench.git\n",
+        "!git clone --depth 1 --branch experimental https://github.com/guru-code-expert/OpenTrace.git\n",
+        "\n",
+        "%cd Trace-Bench\n",
+        "\n",
+        "# System + Python deps\n",
+        "!apt-get update -y && apt-get install -y graphviz\n",
+        "!python -m pip install -U pip\n",
+        "!python -m pip install pyyaml pytest numpy matplotlib graphviz litellm==1.75.0"
+      ],
+      "id": "Xy6CvfPB8uat"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 10,
+      "metadata": {
+        "id": "6fLQyxR38uau",
+        "outputId": "e4d89814-3933-4348-8cd7-d1db138926b9",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "circle_packing\n",
+            "online_bin_packing_local\n",
+            "optimization/tsp_gls_2O\n",
+            "optimization/set_cover_construct\n",
+            "optimization/tsp_construct\n",
+            "optimization/bp_2d_construct\n",
+            "optimization/online_bin_packing_2O\n",
+            "optimization/cflp_construct\n",
+            "optimization/vrptw_construct\n",
+            "optimization/online_bin_packing\n",
+            "optimization/knapsack_construct\n",
+            "optimization/pymoo_moead\n",
+            "optimization/cvrp_construct\n",
+            "optimization/jssp_construct\n",
+            "optimization/bp_1d_construct\n",
+            "optimization/admissible_set\n",
+            "optimization/qap_construct\n",
+            "optimization/ovrp_construct\n",
+            "optimization/co_bench/open_shop_scheduling_co_bench\n",
+            "optimization/co_bench/generalised_assignment_problem_co_bench\n",
+            "optimization/co_bench/flow_shop_scheduling_co_bench\n",
+            "optimization/co_bench/set_partitioning_co_bench\n",
+            "optimization/co_bench/maximal_independent_set_co_bench\n",
+            "optimization/co_bench/container_loading_co_bench\n",
+            "optimization/co_bench/equitable_partitioning_problem_co_bench\n",
+            "optimization/co_bench/p_median_uncapacitated_co_bench\n",
+            "optimization/co_bench/crew_scheduling_co_bench\n",
+            "optimization/co_bench/euclidean_steiner_problem_co_bench\n",
+            "optimization/co_bench/unconstrained_guillotine_cutting_co_bench\n",
+            "optimization/co_bench/packing_unequal_circles_co_bench\n"
+          ]
+        }
+      ],
+      "source": [
+        "# Optional: list tasks (external bench discovery)\n",
+        "!python -m trace_bench list-tasks --root LLM4AD/benchmark_tasks | head -n 30"
+      ],
+      "id": "6fLQyxR38uau"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 11,
+      "metadata": {
+        "id": "lyHDNzOC8uav",
+        "outputId": "fea6dd89-2662-46b9-df08-095808314570",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Warning: You are using PrioritySearch trainer, which is an experimental feature. Please report any issues you encounter.\n"
+          ]
+        }
+      ],
+      "source": [
+        "%%bash\n",
+        "cd /content/Trace-Bench\n",
+        "\n",
+        "# Stub smoke (internal example task for deterministic output)\n",
+        "PYTHONPATH=/content/OpenTrace:$PYTHONPATH python -m trace_bench run --config configs/smoke.yaml --runs-dir \"$RUNS_DIR\""
+      ],
+      "id": "lyHDNzOC8uav"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 12,
+      "metadata": {
+        "id": "1tLDrSpE8uav",
+        "outputId": "2625198d-591f-403a-82d1-605542e74834",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "/content/drive/MyDrive/bench/2026-02-09/trace_bench/6ce874e0-aff9-497d-8781-a5f8d8bde1c1\n",
+            "run_id: 6ce874e0-aff9-497d-8781-a5f8d8bde1c1\n",
+            "runs_dir: /content/drive/MyDrive/bench/2026-02-09/trace_bench\n",
+            "mode: stub\n",
+            "seed: 123\n",
+            "tasks:\n",
+            "- example:greeting_stub\n",
+            "trainers:\n",
+            "- PrioritySearch\n",
+            "eval_kwargs: {}\n",
+            "trainer_kwargs: {}\n",
+            "\n",
+            "dict_keys(['captured_at', 'env', 'runtime', 'git'])\n"
+          ]
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "                     timestamp                   task         trainer  \\\n",
+              "0  2026-02-09T10:30:50.025240Z  example:greeting_stub  PrioritySearch   \n",
+              "\n",
+              "        status  score feedback  \n",
+              "0  train_error    1.0  Correct  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-a7dc4519-f5b5-432a-8240-2ff9b6f8aecb\" class=\"colab-df-container\">\n",
+              "    <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>timestamp</th>\n",
+              "      <th>task</th>\n",
+              "      <th>trainer</th>\n",
+              "      <th>status</th>\n",
+              "      <th>score</th>\n",
+              "      <th>feedback</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2026-02-09T10:30:50.025240Z</td>\n",
+              "      <td>example:greeting_stub</td>\n",
+              "      <td>PrioritySearch</td>\n",
+              "      <td>train_error</td>\n",
+              "      <td>1.0</td>\n",
+              "      <td>Correct</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "    <div class=\"colab-df-buttons\">\n",
+              "\n",
+              "  <div class=\"colab-df-container\">\n",
+              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-a7dc4519-f5b5-432a-8240-2ff9b6f8aecb')\"\n",
+              "            title=\"Convert this dataframe to an interactive table.\"\n",
+              "            style=\"display:none;\">\n",
+              "\n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
+              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
+              "  </svg>\n",
+              "    </button>\n",
+              "\n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-buttons div {\n",
+              "      margin-bottom: 4px;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "    <script>\n",
+              "      const buttonEl =\n",
+              "        document.querySelector('#df-a7dc4519-f5b5-432a-8240-2ff9b6f8aecb button.colab-df-convert');\n",
+              "      buttonEl.style.display =\n",
+              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "      async function convertToInteractive(key) {\n",
+              "        const element = document.querySelector('#df-a7dc4519-f5b5-432a-8240-2ff9b6f8aecb');\n",
+              "        const dataTable =\n",
+              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                    [key], {});\n",
+              "        if (!dataTable) return;\n",
+              "\n",
+              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "          + ' to learn more about interactive tables.';\n",
+              "        element.innerHTML = '';\n",
+              "        dataTable['output_type'] = 'display_data';\n",
+              "        await google.colab.output.renderOutput(dataTable, element);\n",
+              "        const docLink = document.createElement('div');\n",
+              "        docLink.innerHTML = docLinkHtml;\n",
+              "        element.appendChild(docLink);\n",
+              "      }\n",
+              "    </script>\n",
+              "  </div>\n",
+              "\n",
+              "\n",
+              "    </div>\n",
+              "  </div>\n"
+            ],
+            "application/vnd.google.colaboratory.intrinsic+json": {
+              "type": "dataframe",
+              "summary": "{\n  \"name\": \"pd\",\n  \"rows\": 1,\n  \"fields\": [\n    {\n      \"column\": \"timestamp\",\n      \"properties\": {\n        \"dtype\": \"object\",\n        \"num_unique_values\": 1,\n        \"samples\": [\n          \"2026-02-09T10:30:50.025240Z\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"task\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 1,\n        \"samples\": [\n          \"example:greeting_stub\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"trainer\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 1,\n        \"samples\": [\n          \"PrioritySearch\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"status\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 1,\n        \"samples\": [\n          \"train_error\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"score\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": null,\n        \"min\": 1.0,\n        \"max\": 1.0,\n        \"num_unique_values\": 1,\n        \"samples\": [\n          1.0\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"feedback\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 1,\n        \"samples\": [\n          \"Correct\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
+            }
+          },
+          "metadata": {},
+          "execution_count": 12
+        }
+      ],
+      "source": [
+        "# Inspect latest run artifacts\n",
+        "import glob, json, pathlib, pandas as pd\n",
+        "\n",
+        "latest = sorted(glob.glob(f\"{RUNS_DIR}/*\"))[-1]\n",
+        "p = pathlib.Path(latest)\n",
+        "print(p)\n",
+        "\n",
+        "print((p / \"config.snapshot.yaml\").read_text()[:400])\n",
+        "print(json.loads((p / \"env.json\").read_text()).keys())\n",
+        "\n",
+        "pd.read_csv(p / \"results.csv\").head()"
+      ],
+      "id": "1tLDrSpE8uav"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 13,
+      "metadata": {
+        "id": "mjSpgCdw8uaw",
+        "outputId": "a8ca2f75-6410-4f20-f077-14f735fce522",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Warning: You are using PrioritySearch trainer, which is an experimental feature. Please report any issues you encounter.\n"
+          ]
+        }
+      ],
+      "source": [
+        "%%bash\n",
+        "cd /content/Trace-Bench\n",
+        "\n",
+        "# Optional: external LLM4AD smoke (may yield low score if template fails)\n",
+        "cat > configs/smoke_llm4ad.yaml <<'YAML'\n",
+        "runs_dir: runs\n",
+        "mode: stub\n",
+        "seed: 123\n",
+        "tasks:\n",
+        "  - circle_packing\n",
+        "trainers:\n",
+        "  - PrioritySearch\n",
+        "YAML\n",
+        "\n",
+        "PYTHONPATH=/content/OpenTrace:$PYTHONPATH python -m trace_bench run --config configs/smoke_llm4ad.yaml --runs-dir \"$RUNS_DIR\""
+      ],
+      "id": "mjSpgCdw8uaw"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "zRzh_Gpr8uaw"
+      },
+      "source": [
+        "## Real LLM (requires API key)\n",
+        "\n",
+        "Add `OPENAI_API_KEY` in **Colab Secrets** and run the cells below."
+      ],
+      "id": "zRzh_Gpr8uaw"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 14,
+      "metadata": {
+        "id": "1trklSUW8uax"
+      },
+      "outputs": [],
+      "source": [
+        "# Load API key from Colab Secrets\n",
+        "from google.colab import userdata\n",
+        "import os\n",
+        "\n",
+        "key = userdata.get(\"OPENAI_API_KEY\")\n",
+        "if not key:\n",
+        "    raise RuntimeError(\"Missing OPENAI_API_KEY secret in Colab\")\n",
+        "\n",
+        "os.environ[\"OPENAI_API_KEY\"] = key\n",
+        "os.environ[\"TRACE_DEFAULT_LLM_BACKEND\"] = \"LiteLLM\"\n",
+        "os.environ[\"TRACE_LITELLM_MODEL\"] = \"gpt-4o-mini\""
+      ],
+      "id": "1trklSUW8uax"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 15,
+      "metadata": {
+        "id": "eVjhvIwF8uax",
+        "outputId": "c11ca237-b974-45d1-808d-cfca56f0feef",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Warning: You are using PrioritySearch trainer, which is an experimental feature. Please report any issues you encounter.\n"
+          ]
+        }
+      ],
+      "source": [
+        "%%bash\n",
+        "cd /content/Trace-Bench\n",
+        "\n",
+        "# Real-LLM smoke (internal example task)\n",
+        "PYTHONPATH=/content/OpenTrace:$PYTHONPATH python -m trace_bench run --config configs/smoke_real.yaml --runs-dir \"$RUNS_DIR\""
+      ],
+      "id": "eVjhvIwF8uax"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 16,
+      "metadata": {
+        "id": "I_mtiks-8uax",
+        "outputId": "97c370e7-8fbf-4477-e48a-9a7372091f19",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            ".........................                                                [100%]\n",
+            "=============================== warnings summary ===============================\n",
+            "tests/test_lite_optimize_llm4ad.py::test_lite_optimize_llm4ad_task[task0]\n",
+            "tests/test_lite_optimize_llm4ad.py::test_lite_optimize_llm4ad_task[task1]\n",
+            "  /usr/local/lib/python3.12/dist-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:\n",
+            "    PydanticSerializationUnexpectedValue(Expected 9 fields but got 6: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{\\n\"reas...: None}, annotations=[]), input_type=Message])\n",
+            "    PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...ider_specific_fields={}), input_type=Choices])\n",
+            "    return self.__pydantic_serializer__.to_python(\n",
+            "\n",
+            "-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html\n",
+            "25 passed, 2 warnings in 13.44s\n"
+          ]
+        }
+      ],
+      "source": [
+        "%%bash\n",
+        "cd /content/Trace-Bench\n",
+        "\n",
+        "# Pytest (LLM4AD optimizer test runs only if OPENAI_API_KEY is set)\n",
+        "PYTHONPATH=/content/OpenTrace:$PYTHONPATH python -m pytest -q"
+      ],
+      "id": "I_mtiks-8uax"
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "name": "python",
+      "version": "3.10"
+    },
+    "colab": {
+      "provenance": []
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 5
+}
\ No newline at end of file