| version | last_updated |
|---|---|
1.0 |
2025-12-17 |
Evaluate a directory of agent inference results.
# Evaluate all problems in a run directory
slop-code eval outputs/my_run
# Evaluate specific problems
slop-code eval outputs/my_run --problem file_backup --problem trajectory_api
# Evaluate with parallel workers
slop-code eval outputs/my_run --num-workers 4slop-code eval [OPTIONS] RUN_DIR| Argument | Required | Description |
|---|---|---|
RUN_DIR |
Yes | Path to the run directory (outputs/<model_name>/<run_name>) |
| Option | Type | Default | Description |
|---|---|---|---|
--problem |
string | (all) | Name of specific problems to evaluate (repeatable) |
--pass-policy |
enum | ALL_CASES |
Policy to determine if checkpoint passed |
-e, --env-config |
path | <run>/environment.yaml |
Path to environment configuration |
--live-progress/--no-live-progress |
flag | false | Enable live progress display |
-proc, --num-workers |
int | 1 | Number of parallel evaluation workers |
--overwrite |
flag | false | Re-evaluate problems with existing results |
| Value | Description |
|---|---|
any |
Pass if at least one case passes |
any-case |
Same as any |
all-cases |
Pass only if all test cases pass |
all-non-error-cases |
Pass if all non-error cases pass |
core-cases |
Pass if all core cases pass |
any-core-cases |
Pass if any core case passes |
all-core-cases |
Same as core-cases |
The eval command:
- Discovers all problem directories within
AGENT_RUN_DIR - Skips problems that already have
evaluation.jsonfiles (unless--overwrite) - Re-evaluates if problem configuration has changed since last evaluation
- Writes evaluation results to each checkpoint directory
- Generates
checkpoint_results.jsonlreport at run level
When no --problem flags are specified and --overwrite is not set, the command automatically skips problems where:
- All checkpoints have
evaluation.jsonfiles - The problem configuration hasn't changed since evaluation
To force re-evaluation, use --overwrite or specify the problem explicitly with --problem.
After evaluation, each checkpoint directory contains:
evaluation.json- Detailed evaluation results- Test case reports
At the run level:
checkpoint_results.jsonl- Consolidated report with one line per checkpoint
Basic evaluation:
slop-code eval outputs/claude_code_run_20251217Evaluate with custom environment:
slop-code eval outputs/my_run -e configs/environments/docker-python3.12-uv.yamlForce re-evaluation of all problems:
slop-code eval outputs/my_run --overwriteParallel evaluation with progress:
slop-code eval outputs/my_run --num-workers 8 --live-progress- eval-problem - Evaluate a single problem
- eval-snapshot - Evaluate a single snapshot
- run - Run agents (includes automatic evaluation)