M1: Runner API, canonical artifacts, CLI, and notebook #5

guru-code-expert · 2026-02-10T06:17:04Z

Implements the M1 milestone for Trace-Bench:

CLI surface:

trace-bench list-tasks, list-trainers, validate --config --strict, run, ui
Strict validation: trainer kwarg checking, optimizer/guide/logger resolution, trainable parameter detection, matrix expansion with manifest output

Runner & training:

BenchRunner with deterministic SHA256-based job IDs
Algorithm-aware kwarg mapping (PrioritySearch vs GEPA-Base/UCB/Beam)
DummyLLM stub mode for offline testing
Training error capture in feedback field

Canonical artifact layout:

meta/config.snapshot.yaml, manifest.json, env.json (redacted), git.json
Per-job: job_meta.json, results.json, events.jsonl, artifacts/, tb/
Run-level: results.csv (16 columns) + summary.json

Task coverage:

4 internal types (code_param, numeric_param, multi_param, non_trainable)
trace_examples:greeting_stub
llm4ad:circle_packing (bounded timeout)
veribench:smoke_placeholder (NotImplementedError stub)

Trainer coverage:

PrioritySearch + GEPA-Base exercised in real mode
GEPA-UCB + GEPA-Beam configured (M4 scope)

Tests: 30 pass, 2 skipped (m0 smoke, m1 artifacts, matrix e2e, internal tasks, opentrace examples, trainer config, veribench CLI)

Notebook: 01_m1_minimal_api.ipynb with Colab badge, auto-detect API key (real/stub mode), 2x2 matrix smoke (4/4 ok), executed outputs committed.

Implements the M1 milestone for Trace-Bench: CLI surface: - trace-bench list-tasks, list-trainers, validate --config --strict, run, ui - Strict validation: trainer kwarg checking, optimizer/guide/logger resolution, trainable parameter detection, matrix expansion with manifest output Runner & training: - BenchRunner with deterministic SHA256-based job IDs - Algorithm-aware kwarg mapping (PrioritySearch vs GEPA-Base/UCB/Beam) - DummyLLM stub mode for offline testing - Training error capture in feedback field Canonical artifact layout: - meta/config.snapshot.yaml, manifest.json, env.json (redacted), git.json - Per-job: job_meta.json, results.json, events.jsonl, artifacts/, tb/ - Run-level: results.csv (16 columns) + summary.json Task coverage: - 4 internal types (code_param, numeric_param, multi_param, non_trainable) - trace_examples:greeting_stub - llm4ad:circle_packing (bounded timeout) - veribench:smoke_placeholder (NotImplementedError stub) Trainer coverage: - PrioritySearch + GEPA-Base exercised in real mode - GEPA-UCB + GEPA-Beam configured (M4 scope) Tests: 30 pass, 2 skipped (m0 smoke, m1 artifacts, matrix e2e, internal tasks, opentrace examples, trainer config, veribench CLI) Notebook: 01_m1_minimal_api.ipynb with Colab badge, auto-detect API key (real/stub mode), 2x2 matrix smoke (4/4 ok), executed outputs committed.

This reverts commit 51622f2.

guru-code-expert added 8 commits February 10, 2026 11:12

notebook: use OPENROUTER_API_KEY

f2858e5

m1: align validation, veribench skip, and trainer discovery

8374498

Update 01_m1_minimal_api.ipynb

51622f2

Revert "Update 01_m1_minimal_api.ipynb"

61713b9

This reverts commit 51622f2.

FIX M1-critical items

6c588da

Update 01_m1_minimal_api.ipynb

bd1188e

Update 01_m1_minimal_api.ipynb

cade4ea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

M1: Runner API, canonical artifacts, CLI, and notebook #5

M1: Runner API, canonical artifacts, CLI, and notebook #5

Uh oh!

guru-code-expert commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

M1: Runner API, canonical artifacts, CLI, and notebook #5

Are you sure you want to change the base?

M1: Runner API, canonical artifacts, CLI, and notebook #5

Uh oh!

Conversation

guru-code-expert commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant