SlopCodeBench measures code erosion under iterative specification refinement — quantifying how code quality degrades as AI agents iterate on programming problems. It helps evaluate how agents handle evolving requirements.
- Python 3.12+
- Docker (installed and running)
- 8GB+ RAM recommended
- 10GB+ disk space for Docker images and workspaces
- API keys for the agents you want to test
First run takes 5-10 minutes because Docker images need to build. Subsequent runs are much faster (typically 1-2 minutes per problem depending on the agent and model).
Yes, Docker is required for isolated execution environments. This ensures reproducibility and prevents agents from affecting your local system.
This controls the extended thinking token budget for models that support extended thinking:
low= 10,000 tokensmedium= 20,000 tokenshigh= 40,000 tokens
Specifies the agent version to use (e.g., version=2.0.51). If omitted, the latest version is used. This is useful for reproducing results or testing specific agent versions.
Yes, you can specify multiple --problem flags:
uv run slop-code run --problem file_backup --problem execution_serverUse the --agent flag with one of the supported agents:
claude_codecodexgeminiminisweopencodeopenhands
See Agent Guide for configuration details.
Results are saved to outputs/{model_name}/{agent}-{prompt}_{params}_{timestamp}/
Each run creates:
results.json- Evaluation resultsoverall_quality.json- Code quality metricssubmissions/- Agent-generated codeworkspaces/- Execution environments
- Correctness: Does the solution pass the test cases? (Pass/Fail)
- Quality: How clean, maintainable, and well-structured is the code? (Metrics like complexity, duplication, etc.)
Pass policies determine what counts as "passing" an evaluation:
ALL_CASES- All test cases must passANY_CASE- At least one test case must passMAJORITY- More than 50% of test cases must pass
Checkpoints represent stages of specification refinement. Problems have multiple checkpoints, each adding requirements or complexity. This tests how code quality evolves as specifications change.
Yes! See the Problem Tutorial for a step-by-step guide on creating problems with custom test cases.
Follow the Problem Tutorial which walks through creating a complete problem in ~30 minutes. Also see the Contributing Guide.
Good problems:
- Take ~40 hours to solve well (not just make it work)
- Have clear, incremental checkpoints
- Test realistic refactoring scenarios
- Encourage flexible, maintainable code over quick hacks
See Problem Design Guide for details.
It depends on your use case:
- Claude Code: Best for complex refactoring, supports extended thinking
- Codex: OpenAI's code model
- Gemini: Google's model with long context
- MiniSWE: Lightweight agent for quick testing
- OpenCode: DeepSeek-based agent
- OpenHands: Multi-tool agent
See agent-specific docs in docs/agents/agents/.
Set environment variables for your provider:
export ANTHROPIC_API_KEY="your-key"
export OPENAI_API_KEY="your-key"
export GOOGLE_API_KEY="your-key"Or see Credentials Guide for file-based configuration.
Yes! See the Agent Implementation Guide and Contributing Guide.
# Clean Docker cache and rebuild
docker system prune -a
uv run slop-code docker build-base --environment configs/environments/docker-python3.12-uv.yamlDocker is not running or not installed. Install Docker Desktop and ensure it's running:
docker ps # Should list running containersVerify your environment variable is set:
echo $ANTHROPIC_API_KEY # Should print your keyDocker images can take significant space. Clean up:
docker system prune -a # Remove unused images
docker volume prune # Remove unused volumesCheck the expected outputs in the problem's test cases. Some test failures may be due to:
- Formatting differences (whitespace, newlines)
- Numeric precision (for floating point)
- Output order (for unordered results)
See Troubleshooting Guide for handling edge cases.
- Correctness: Pass/fail for each test case
- Complexity: Cyclomatic complexity, nesting depth
- Code Quality: Duplication, code smells, maintainability
- Size: Lines of code, file count
- Deltas: Changes between checkpoints
See Metrics Guide for complete list.
slop-code metrics judge \
--rubric configs/rubrics/llm_judge.jsonl \
--model <model on openrouter> \
--criteria-template configs/rubrics/templates/criteria_with_pn.j2It depends on the problem and checkpoint. See Interpreting Results for guidance on what different metric values mean.
- Design the problem (see Design Philosophy)
- Write pytest tests (see Tutorial)
- Test thoroughly (see Checklist)
- Submit a PR (see Contributing Guide)
This is early-stage software. Review times vary. We'll work with you to iterate on the problem design.
Absolutely! Documentation improvements are very welcome. See the Contributing Guide.
Yes, you can run evaluations with multiple workers:
slop-code eval outputs/my_run --num-workers 4This speeds up evaluation by running multiple problems or checkpoints in parallel.
Custom metrics require modifying the metrics system. This is not yet documented. See existing metrics in src/slop_code/metrics/ for examples.
Yes, you can integrate SlopCodeBench into your CI/CD pipeline. Results are saved in JSON format for easy parsing.
Use the same:
- Agent version (via
version=X.Y.Z) - Model
- Environment configuration
- Problem set
- Random seed (if applicable)
- Check the full documentation
- Open a GitHub Issue
- See Known Issues for current limitations