diff --git a/M0_trace_bench_plan.md b/M0_trace_bench_plan.md new file mode 100644 index 0000000..fc45092 --- /dev/null +++ b/M0_trace_bench_plan.md @@ -0,0 +1,476 @@ +# M0 — Trace‑Bench Technical Plan (Two Approaches + Explicit Acceptance) + +## Milestone definitions (fixed) + +- **M0**: Technical plan + locked contracts (**this document**). +- **M1**: **Full Trace‑Bench API implementation** + **minimal runnable coverage** for each bench type, trainer type, and parameter/optimization target type. +- **M2**: **Full coverage** across benches/trainers/tasks/parameters with **efficient** matrix execution + aggregation (meet coverage targets). +- **M3**: **UI + MLflow + TensorBoard + Gradio** (operational wiring, not placeholders). +- **M4**: **GEPA + Curriculum** trainers integration. + +--- + +## Executive Summary + +This M0 provides two implementation approaches (A/B) and locks the core contracts needed to de-risk M1–M4: + +- Run/job identity (no overwrites; deterministic job IDs) +- Matrix semantics (tasks × trainers × parameter variants × seeds) +- Canonical artifacts schema (filesystem is source of truth) +- Task & trainer discovery contracts (stable IDs + validation) +- Parameter pass‑through contracts (compatible with Trace/OpenTrace Trainer/Optimizer/Guide/Logger APIs) +- Security rules for environment capture (allowlist + redaction) +- All validation logic/config/notebooks are delivered and reviewed via **PRs to Trace-Bench repo** (no out-of-band validation). + +Trace‑Bench **does not implement trainers**: it **consumes** trainer algorithms from Trace/OpenTrace and focuses on orchestration, validation, reproducibility, and reporting. + +--- + +## 0) Purpose of M0 + +M0 is **not** an implementation milestone. It is a **contract-locking** milestone: + +- Define the plan variants, acceptance targets, and validation approach +- Lock the run/job/matrix/artifacts/discovery/security contracts so M1 can implement with minimal ambiguity +- Ensure the plan is aligned with the current Trace‑Bench + Trace code realities (trainer.train API, task loaders, LLM4AD runner knobs) + +--- + +## 1) Two Plan Variants (Pick A or B) + +Both variants respect the fixed milestone definitions above. The difference is how much breadth is proven **early in M1** vs deferred to M2. + +### Plan A — Lean / staged (recommended: Plan A+ Pareto from Plan B) +Plan A+ keeps the smaller M1 surface, but **adds a bounded compatibility harness** (low cost / high gain). + +- **M1**: Implement full API + prove minimal runnable coverage **and** a bounded matrix smoke: + - 1 internal example task bundle + - 1 LLM4AD task bundle + - 1 VeriBench task bundle **if** entrypoint is available (otherwise “skipped with reason” is valid) + - OpenTrace examples smoke (import/`--help`) wired in CI + - Run each supported trainer at least once with at least one non-default parameter + - **Minimal matrix smoke (bounded):** 2 tasks × 2 trainers × 1 seed (4 jobs) end-to-end + - **Edge hardening:** `validate --strict` fails fast on unknown kwargs, missing trainable params, and task build errors +- **M2**: Expand to full coverage targets + efficiency improvements (parallelism, resume, aggregation) +- **M3**: UI + MLflow/TB + Gradio +- **M4**: GEPA + Curriculum + +### Plan B — Compatibility‑first (more breadth earlier, higher integration risk) +- **M1**: Everything in Plan A+, plus higher-risk breadth: + - broader early discovery/coverage (especially VeriBench task inventory) + - larger matrices (more tasks/trainers/seeds) earlier +- **M2–M4**: same as Plan A + +**Trade-off**: Plan A+ minimizes early integration complexity while still catching most incompatibilities early (via strict validate + 2×2 smoke). Plan B buys earlier breadth at the cost of significantly higher M1 integration risk. + +--- + +## 2) Scope and Coverage Targets (Explicit) + +### LLM4AD coverage (M2 acceptance target) +- Target: **≥80% functional in Real mode** over the *current discovered LLM4AD task inventory*. +- “Functional” means: + - job runs to completion (or hits configured timeout) **without crashing the runner** + - at least one training/optimization step is executed + - canonical artifacts are written (`results.json`, `events.jsonl`, `results.csv`, `summary.json`) +- “Optimizing” means (M2): + - on a defined subset of tasks, **best_score > initial_score + ε** (ε default: 1e‑9) OR non‑finite → finite transition. + +Proposed “optimizing” subset (10 tasks; tunable): +- `circle_packing` +- `optimization_knapsack_construct` +- `optimization_bp_1d_construct` +- `optimization_tsp_construct` +- `optimization_cvrp_construct` +- `optimization_jssp_construct` +- `optimization_qap_construct` +- `optimization_set_cover_construct` +- `optimization_vrptw_construct` +- `optimization_cflp_construct` + +### VeriBench coverage (M2 acceptance target) +- Target: **≥80% functional in Real mode** over the *current discovered VeriBench task inventory*. +- Dependency: Trace team provides the canonical entrypoint/task index (kept in scope). +- If entrypoint is not available in a given environment, jobs must be marked `skipped` with a structured reason (not a crash). + +### OpenTrace examples (CI enforced from M1; acceptance target in M2) +- Target (SMART): **No unexpected failures in CI / 100% smoke tests pass**. +- Rule: every example must be either: + - **PASS**: imports (and `--help` works if applicable) under `TRACE_BENCH_SMOKE=1`, or + - **EXCEPTIONNAL SKIP (explicit)**: listed in a small `smoke_skip_allowlist.yaml` with a clear reason (missing optional dependency/dataset/credential). +- This prevents “silent skips” while keeping CI lightweight. + +- CI requirement: + - import every script in `OpenTrace/examples/` in a bounded subprocess + - for argparse scripts: run `python \n", + " \n", + "\n", + "\n", + " \n", + " \n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"pd\",\n \"rows\": 1,\n \"fields\": [\n {\n \"column\": \"timestamp\",\n \"properties\": {\n \"dtype\": \"object\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"2026-02-09T10:30:50.025240Z\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"task\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"example:greeting_stub\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"trainer\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"PrioritySearch\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"status\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"train_error\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"score\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 1.0,\n \"max\": 1.0,\n \"num_unique_values\": 1,\n \"samples\": [\n 1.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"feedback\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"Correct\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 12 + } + ], + "source": [ + "# Inspect latest run artifacts\n", + "import glob, json, pathlib, pandas as pd\n", + "\n", + "latest = sorted(glob.glob(f\"{RUNS_DIR}/*\"))[-1]\n", + "p = pathlib.Path(latest)\n", + "print(p)\n", + "\n", + "print((p / \"config.snapshot.yaml\").read_text()[:400])\n", + "print(json.loads((p / \"env.json\").read_text()).keys())\n", + "\n", + "pd.read_csv(p / \"results.csv\").head()" + ], + "id": "1tLDrSpE8uav" + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "id": "mjSpgCdw8uaw", + "outputId": "a8ca2f75-6410-4f20-f077-14f735fce522", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Warning: You are using PrioritySearch trainer, which is an experimental feature. Please report any issues you encounter.\n" + ] + } + ], + "source": [ + "%%bash\n", + "cd /content/Trace-Bench\n", + "\n", + "# Optional: external LLM4AD smoke (may yield low score if template fails)\n", + "cat > configs/smoke_llm4ad.yaml <<'YAML'\n", + "runs_dir: runs\n", + "mode: stub\n", + "seed: 123\n", + "tasks:\n", + " - circle_packing\n", + "trainers:\n", + " - PrioritySearch\n", + "YAML\n", + "\n", + "PYTHONPATH=/content/OpenTrace:$PYTHONPATH python -m trace_bench run --config configs/smoke_llm4ad.yaml --runs-dir \"$RUNS_DIR\"" + ], + "id": "mjSpgCdw8uaw" + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zRzh_Gpr8uaw" + }, + "source": [ + "## Real LLM (requires API key)\n", + "\n", + "Add `OPENAI_API_KEY` in **Colab Secrets** and run the cells below." + ], + "id": "zRzh_Gpr8uaw" + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "id": "1trklSUW8uax" + }, + "outputs": [], + "source": [ + "# Load API key from Colab Secrets\n", + "from google.colab import userdata\n", + "import os\n", + "\n", + "key = userdata.get(\"OPENAI_API_KEY\")\n", + "if not key:\n", + " raise RuntimeError(\"Missing OPENAI_API_KEY secret in Colab\")\n", + "\n", + "os.environ[\"OPENAI_API_KEY\"] = key\n", + "os.environ[\"TRACE_DEFAULT_LLM_BACKEND\"] = \"LiteLLM\"\n", + "os.environ[\"TRACE_LITELLM_MODEL\"] = \"gpt-4o-mini\"" + ], + "id": "1trklSUW8uax" + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "id": "eVjhvIwF8uax", + "outputId": "c11ca237-b974-45d1-808d-cfca56f0feef", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Warning: You are using PrioritySearch trainer, which is an experimental feature. Please report any issues you encounter.\n" + ] + } + ], + "source": [ + "%%bash\n", + "cd /content/Trace-Bench\n", + "\n", + "# Real-LLM smoke (internal example task)\n", + "PYTHONPATH=/content/OpenTrace:$PYTHONPATH python -m trace_bench run --config configs/smoke_real.yaml --runs-dir \"$RUNS_DIR\"" + ], + "id": "eVjhvIwF8uax" + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "id": "I_mtiks-8uax", + "outputId": "97c370e7-8fbf-4477-e48a-9a7372091f19", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "......................... [100%]\n", + "=============================== warnings summary ===============================\n", + "tests/test_lite_optimize_llm4ad.py::test_lite_optimize_llm4ad_task[task0]\n", + "tests/test_lite_optimize_llm4ad.py::test_lite_optimize_llm4ad_task[task1]\n", + " /usr/local/lib/python3.12/dist-packages/pydantic/main.py:464: UserWarning: Pydantic serializer warnings:\n", + " PydanticSerializationUnexpectedValue(Expected 9 fields but got 6: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{\\n\"reas...: None}, annotations=[]), input_type=Message])\n", + " PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...ider_specific_fields={}), input_type=Choices])\n", + " return self.__pydantic_serializer__.to_python(\n", + "\n", + "-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html\n", + "25 passed, 2 warnings in 13.44s\n" + ] + } + ], + "source": [ + "%%bash\n", + "cd /content/Trace-Bench\n", + "\n", + "# Pytest (LLM4AD optimizer test runs only if OPENAI_API_KEY is set)\n", + "PYTHONPATH=/content/OpenTrace:$PYTHONPATH python -m pytest -q" + ], + "id": "I_mtiks-8uax" + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.10" + }, + "colab": { + "provenance": [] + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file