Welcome to the SlopCodeBench documentation. This guide will help you navigate the documentation and find what you need.
New to SlopCodeBench? Start with FAQ.md for an overview and answers to common questions.
What do you want to do?
- Run an agent on problems → agents/ → commands/run.md
- Create a new problem → contributing-problems/ → problems/tutorial.md
- Evaluate results → evaluation/ → commands/eval.md
- Understand code quality metrics → metrics/
- Troubleshoot issues → FAQ.md → KNOWN_ISSUES.md
The documentation is organized into focused groups, each covering a specific aspect of SlopCodeBench.
Configure and run AI agents with API credentials, models, and providers
Learn how to set up agents, configure API credentials, select models, and run benchmarks. Includes detailed documentation for six agent implementations (Claude Code, Codex, Gemini, MiniSWE, OpenCode, OpenHands).
Key files:
- agents/README.md - Quick start and configuration flow
- agents/providers.md - Provider definitions and credential sources
- agents/models.md - Model catalog and pricing
- agents/agents/ - Individual agent implementation guides
Reference documentation for all CLI commands
Complete reference for SlopCodeBench's command-line interface, including agent execution, evaluation, metrics, utilities, Docker image building, and visualization tools.
Key files:
- commands/README.md - Command summary and index
- commands/run.md - Agent execution with unified config system
- commands/eval.md - Evaluate run directories
- commands/metrics.md - Metrics subcommands (static, judge, variance)
- commands/viz.md - Visualization tools (diff viewer)
Guide to problem authoring, structure, and configuration
Everything you need to create effective benchmark problems, from basic structure to advanced features like checkpoints, test case groups, and regression testing.
Key files:
- problems/README.md - Main problem authoring guide
- problems/tutorial.md - Step-by-step tutorial for your first problem
- problems/config-schema.md - Complete field-by-field config reference
- problems/examples/ - Worked examples with walkthroughs
Design philosophy and validation for problem contributions
Understand best practices for creating high-quality benchmark problems, including design philosophy, worked examples, and a pre-submission checklist.
Key files:
- contributing-problems/README.md - Problem design philosophy
- contributing-problems/example_process.md - Worked example (calculator problem)
- contributing-problems/checklist.md - Pre-submission validation checklist
Pytest-based test execution and results reporting
Deep dive into how SlopCodeBench executes pytest tests against submissions, categorizes results by type, and generates structured reports.
Key files:
- evaluation/README.md - Quick start and overview
- evaluation/architecture.md - Pytest runner and execution flow
- evaluation/configuration.md - Problem and checkpoint configuration
- evaluation/reporting.md - Results, pass policies, and export formats
Workspace management, runtime execution, and snapshots
Understand how SlopCodeBench manages workspaces, executes code in different environments (local, Docker), handles assets, and captures snapshots.
Key files:
- execution/README.md - Overview of Workspace, Session, and SubmissionRuntime
- execution/manager.md - Session lifecycle and workspace orchestration
- execution/docker.md - Container execution and networking
- execution/snapshots.md - Capturing and comparing workspace state
Code quality measurement and analysis
Learn about SlopCodeBench's 7-category metric system for measuring code quality, including interpretation, configuration, and output formats.
Key files:
- metrics/README.md - Overview and metric categories
- metrics/interpreting-results.md - Explanation of each metric
- metrics/configuration.md - Thresholds and AST-grep rules
Frequently asked questions covering getting started, running agents, evaluation, metrics, troubleshooting, and contributing. Start here if you're new to SlopCodeBench.
Current limitations and known bugs organized by problem. Check this if you encounter unexpected behavior.
- Read contributing-problems/README.md for design philosophy
- Follow problems/tutorial.md for step-by-step guidance
- Reference problems/config-schema.md for configuration details
- Review contributing-problems/checklist.md before submission
- Review FAQ.md for system overview
- Set up credentials following agents/README.md
- Configure your agent in agents/models.md
- Run with commands/run.md
- Evaluate runs using commands/eval.md
- Understand the evaluation system in evaluation/
- Analyze code quality with metrics/
- Interpret results using metrics/interpreting-results.md
- Write custom pytest tests following problems/pytest/
- Add new execution environments with execution/extending.md
- Implement custom agents following agents/agent-class.md
- README.md files in each directory provide quick-start guides and navigation
- Detailed guides cover specific topics in depth
- Cross-references link related documentation
- Examples are provided throughout with real-world scenarios
- Troubleshooting sections appear in most modules for common issues
- Check FAQ.md for common questions
- Review KNOWN_ISSUES.md for current limitations
- Look for troubleshooting sections in relevant documentation groups
- Check evaluation/troubleshooting.md for pytest-related issues