Skip to content

Conversation

@guru-code-expert
Copy link

Implements the M1 milestone for Trace-Bench:

CLI surface:

  • trace-bench list-tasks, list-trainers, validate --config --strict, run, ui
  • Strict validation: trainer kwarg checking, optimizer/guide/logger resolution, trainable parameter detection, matrix expansion with manifest output

Runner & training:

  • BenchRunner with deterministic SHA256-based job IDs
  • Algorithm-aware kwarg mapping (PrioritySearch vs GEPA-Base/UCB/Beam)
  • DummyLLM stub mode for offline testing
  • Training error capture in feedback field

Canonical artifact layout:

  • meta/config.snapshot.yaml, manifest.json, env.json (redacted), git.json
  • Per-job: job_meta.json, results.json, events.jsonl, artifacts/, tb/
  • Run-level: results.csv (16 columns) + summary.json

Task coverage:

  • 4 internal types (code_param, numeric_param, multi_param, non_trainable)
  • trace_examples:greeting_stub
  • llm4ad:circle_packing (bounded timeout)
  • veribench:smoke_placeholder (NotImplementedError stub)

Trainer coverage:

  • PrioritySearch + GEPA-Base exercised in real mode
  • GEPA-UCB + GEPA-Beam configured (M4 scope)

Tests: 30 pass, 2 skipped (m0 smoke, m1 artifacts, matrix e2e, internal tasks, opentrace examples, trainer config, veribench CLI)

Notebook: 01_m1_minimal_api.ipynb with Colab badge, auto-detect API key (real/stub mode), 2x2 matrix smoke (4/4 ok), executed outputs committed.

Implements the M1 milestone for Trace-Bench:

CLI surface:
- trace-bench list-tasks, list-trainers, validate --config --strict, run, ui
- Strict validation: trainer kwarg checking, optimizer/guide/logger resolution,
  trainable parameter detection, matrix expansion with manifest output

Runner & training:
- BenchRunner with deterministic SHA256-based job IDs
- Algorithm-aware kwarg mapping (PrioritySearch vs GEPA-Base/UCB/Beam)
- DummyLLM stub mode for offline testing
- Training error capture in feedback field

Canonical artifact layout:
- meta/config.snapshot.yaml, manifest.json, env.json (redacted), git.json
- Per-job: job_meta.json, results.json, events.jsonl, artifacts/, tb/
- Run-level: results.csv (16 columns) + summary.json

Task coverage:
- 4 internal types (code_param, numeric_param, multi_param, non_trainable)
- trace_examples:greeting_stub
- llm4ad:circle_packing (bounded timeout)
- veribench:smoke_placeholder (NotImplementedError stub)

Trainer coverage:
- PrioritySearch + GEPA-Base exercised in real mode
- GEPA-UCB + GEPA-Beam configured (M4 scope)

Tests: 30 pass, 2 skipped (m0 smoke, m1 artifacts, matrix e2e, internal tasks,
opentrace examples, trainer config, veribench CLI)

Notebook: 01_m1_minimal_api.ipynb with Colab badge, auto-detect API key
(real/stub mode), 2x2 matrix smoke (4/4 ok), executed outputs committed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant