| version | last_updated |
|---|---|
1.0 |
2025-12-22 |
This section documents all commands available in the slop-code CLI.
| Command | Description |
|---|---|
run |
Run agents on benchmarks with unified config system |
eval |
Evaluate a directory of agent inference results |
eval-problem |
Evaluate a single problem directory |
eval-snapshot |
Evaluate a single snapshot directory |
infer-problem |
Run inference on a single problem |
metrics |
Calculate metrics (static, judge, carry-forward, variance) |
utils |
Utility commands for maintenance and data processing |
docker |
Docker image building utilities |
problems |
Problem inspection and registry commands |
tools |
Interactive tools and case runners |
viz |
Visualization tools (diff viewer) |
These options are available on all commands:
| Option | Type | Default | Description |
|---|---|---|---|
-v, --verbose |
flag | 0 | Increase verbosity (repeatable) |
--seed |
int | 42 | Random seed |
-p, --problem-path |
path | problems/ |
Path to problem directory |
--overwrite |
flag | false | Overwrite existing output directory |
--debug |
flag | false | Enable debugging mode |
--snapshot-dir-name |
string | snapshot |
Name of the snapshot directory |
Running agents:
# Run with config file
slop-code run --config my_run.yaml --problem file_backup
# Run with CLI flags
slop-code run --model anthropic/sonnet-4.5 --problem file_backupEvaluating results:
# Evaluate all problems in a run
slop-code eval outputs/my_run
# Evaluate a single problem
slop-code eval-problem outputs/my_run/file_backup
# Evaluate a single snapshot
slop-code eval-snapshot outputs/my_run/file_backup/checkpoint_1/snapshot \
-o outputs/eval -p file_backup -c 1 -e configs/environments/docker-python3.12-uv.yaml# Calculate static code quality metrics
slop-code metrics static outputs/my_run
# Run LLM judge evaluation
slop-code metrics judge outputs/my_run -r configs/rubrics/slop.jsonl -m anthropic/sonnet-4.5
# Compute variance across runs
slop-code metrics variance base outputs/runs -o outputs/variance# Backfill reports for existing runs
slop-code utils backfill-reports outputs/my_run
# Combine results from multiple runs
slop-code utils combine-results outputs/all_runs -o outputs/combined.jsonl# Build base image
slop-code docker build-base configs/environments/docker-python3.12-uv.yaml
# Build agent image
slop-code docker build-agent configs/agents/claude_code-2.0.51.yaml configs/environments/docker-python3.12-uv.yaml# List all problems
slop-code problems ls
# Check problem conversion status
slop-code problems status file_backup# Run pytest tests for a snapshot
slop-code tools run-case -s outputs/snapshot -p file_backup -c 1 -e configs/environments/docker-python3.12-uv.yaml# Launch diff viewer for a run
slop-code viz diff outputs/my_run| Document | Description |
|---|---|
| run.md | Comprehensive guide to slop-code run with configuration system |
| eval.md | Evaluating run directories |
| eval-problem.md | Evaluating single problems |
| eval-snapshot.md | Evaluating single snapshots |
| infer-problem.md | Running inference on single problems |
| metrics.md | All metrics subcommands |
| utils.md | All utility subcommands |
| docker.md | Docker image management |
| problems.md | Problem inspection commands |
| tools.md | Interactive tools |
| viz.md | Visualization tools |
- Agent Configuration - Agent setup and configuration
- Evaluation System - How evaluation works
- Problem Authoring - Creating and configuring problems