Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,11 @@ __pycache__/
external/*
**/uv.lock
*.egg-info/
**/.venv/
**/.venv/
.env
runs/
runs_test/
notebooks/01_smoke_runner_with_output.ipynb
notebooks/01_m1_minimal_api_with_output.ipynb
/.tmp_runs_run
/.tmp_runs_validate
68 changes: 66 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,70 @@ Currently, we are adding problems/domains one folder at a time.

The instructions to run each task are located inside the task folder.

## Quick Start (Runner/CLI)

```bash
# M1 review checklist (recommended order)
# 1) List tasks (LLM4AD + example stubs)
trace-bench list-tasks --root LLM4AD/benchmark_tasks

# 2) Validate a config
trace-bench validate --config configs/smoke.yaml

# 3) Run Stub smoke (deterministic, no keys)
trace-bench run --config configs/smoke.yaml --runs-dir runs

# 4) Run Real smoke (requires OPENAI_API_KEY)
trace-bench run --config configs/smoke_real.yaml --runs-dir runs

# 5) Run tests (disable external plugin autoload)
PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 pytest -q

# List tasks (LLM4AD + example stubs)
trace-bench list-tasks --root LLM4AD/benchmark_tasks

# Validate a config
trace-bench validate --config configs/smoke.yaml

# Run a smoke benchmark
trace-bench run --config configs/smoke.yaml

# Launch UI (stub)
trace-bench ui --runs-dir runs
```

Expected run artifacts:
- `runs/<run_id>/config.snapshot.yaml`
- `runs/<run_id>/env.json`
- `runs/<run_id>/results.csv`
- `runs/<run_id>/events.jsonl`
- `runs/<run_id>/summary.json`
- `runs/<run_id>/tb/`

## M1 Dependencies (Required for Full Pass)

System:
- Graphviz (system package)

Python:
- `graphviz`, `pyyaml`, `pytest`, `numpy`, `matplotlib`, `litellm==1.75.0`

OpenTrace examples strict smoke (for 100% pass):
- `datasets`, `textgrad`, `dspy`, `autogen`, `python-dotenv`

## OpenTrace Examples Smoke (100% Pass Mode)

To enforce 100% example smoke in CI, run:
```bash
TRACE_BENCH_STRICT_EXAMPLES=1 PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 pytest -q
```
Without strict mode, the smoke test skips only when optional deps are missing.

## VeriBench Status (In Scope, Pending Input)

VeriBench is in scope but requires the Trace team to provide the task entrypoint/task list.
CLI flags are ready (`--bench veribench`); when the entrypoint is unavailable, tasks are skipped with a structured reason rather than raising.

## Problem Sets

### General Problem Sets
Expand All @@ -27,9 +91,9 @@ Current implementation of graph is a single node.

**Supported Algorithms:** PrioritySearch, GEPA-Base, GEPA-UCB, GEPA-Beam

📖 **[See detailed usage guide →](LM4AD/readme.md)**
**See detailed usage guide:** `LM4AD/readme.md`

## Agent Architecture
- ReAct agent

All the libraries from other repos are stored and managed in the `external` folder -- this folder will be created if one of the `install.sh` script is run inside the task folder.
All the libraries from other repos are stored and managed in the `external` folder -- this folder will be created if one of the `install.sh` script is run inside the task folder.
24 changes: 24 additions & 0 deletions configs/m1_matrix_smoke.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
runs_dir: runs
mode: stub
seeds: [123]
max_workers: 1
fail_fast: false

tasks:
- id: internal:numeric_param
- id: llm4ad:circle_packing
eval_kwargs:
timeout_seconds: 10

trainers:
- id: PrioritySearch
params_variants:
- ps_steps: 1
ps_batches: 1

- id: GEPA-Base
params_variants:
- gepa_iters: 1
gepa_train_bs: 2
gepa_merge_every: 2
gepa_pareto_subset: 2
55 changes: 55 additions & 0 deletions configs/m1_validation.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
runs_dir: runs
mode: stub
seeds: [123]
max_workers: 1
fail_fast: false

tasks:
- id: internal:code_param
- id: internal:numeric_param
- id: internal:multi_param
- id: internal:non_trainable
- id: trace_examples:greeting_stub
- id: llm4ad:circle_packing
eval_kwargs:
timeout_seconds: 10
- id: veribench:smoke_placeholder

trainers:
- id: PrioritySearch
params_variants:
- threads: 2
ps_steps: 1
ps_batches: 1
ps_candidates: 2
ps_proposals: 2
ps_mem_update: 1

- id: GEPA-Base
params_variants:
- threads: 2
gepa_iters: 1
gepa_train_bs: 2
gepa_merge_every: 2
gepa_pareto_subset: 2
optimizer: OPROv2
optimizer_kwargs: {}

- id: GEPA-UCB
params_variants:
- threads: 2
gepa_iters: 1
gepa_train_bs: 2
gepa_merge_every: 2
gepa_pareto_subset: 2

- id: GEPA-Beam
params_variants:
- threads: 2
gepa_iters: 1
gepa_train_bs: 2
gepa_merge_every: 2
gepa_pareto_subset: 2

eval_kwargs:
timeout_seconds: 10
12 changes: 12 additions & 0 deletions configs/smoke.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
runs_dir: runs
mode: stub
seeds: [123]

tasks:
- id: internal:numeric_param

trainers:
- id: PrioritySearch
params_variants:
- ps_steps: 1
ps_batches: 1
12 changes: 12 additions & 0 deletions configs/smoke_real.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
runs_dir: runs
mode: real
seeds: [123]

tasks:
- id: trace_examples:greeting_stub

trainers:
- id: PrioritySearch
params_variants:
- ps_steps: 1
ps_batches: 1
Loading