Overview of how SCBench uses pytest to evaluate agent submissions.
SCBench uses pytest as its evaluation framework because:
- Standard tooling - Familiar to Python developers
- Rich ecosystem - Fixtures, parametrization, markers, plugins
- Flexible assertions - Clear failure messages
- Isolated execution - Tests run independently
┌─────────────────────────────────────────────────────────────┐
│ PytestRunner │
├─────────────────────────────────────────────────────────────┤
│ │
│ 1. Copy tests from problem/tests/ to workspace │
│ 2. Generate pytest.ini with markers │
│ 3. Execute via uvx for isolation │
│ 4. Parse CTRF + pytest-json-report │
│ 5. Categorize results by GroupType │
│ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Test Execution │
├─────────────────────────────────────────────────────────────┤
│ │
│ pytest tests/ │
│ --entrypoint="python main.py" │
│ --checkpoint=checkpoint_1 │
│ --ctrf=.scbench/ctrf-report.json │
│ --json-report │
│ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Test Results │
├─────────────────────────────────────────────────────────────┤
│ │
│ TestResult( │
│ name="test_basic_case", │
│ passed=True, │
│ group_type=GroupType.CORE, │
│ duration=0.5, │
│ ) │
│ │
└─────────────────────────────────────────────────────────────┘
Every problem's tests/conftest.py must provide:
@pytest.fixture(scope="session")
def entrypoint_argv(request):
"""Command to invoke the submission."""
return shlex.split(request.config.getoption("--entrypoint"))
@pytest.fixture(scope="session")
def checkpoint_name(request):
"""Current checkpoint being evaluated."""
return request.config.getoption("--checkpoint")Tests are categorized using pytest markers:
| Marker | GroupType | Purpose |
|---|---|---|
| (none) | CORE | Must pass - essential functionality |
@pytest.mark.functionality |
FUNCTIONALITY | Nice to have - advanced features |
@pytest.mark.error |
ERROR | Error handling - edge cases |
@pytest.mark.regression |
REGRESSION | Prior checkpoint tests |
Tests must follow the naming convention:
test_checkpoint_1.pyfor checkpoint 1test_checkpoint_2.pyfor checkpoint 2- etc.
This naming is used to:
- Determine which tests belong to which checkpoint
- Include prior checkpoint tests when
include_prior_tests: true
-
Session Creation
- Create workspace with agent's submission
- Copy test files from problem directory
-
Test Selection
- Include
test_checkpoint_N.pyfor current checkpoint - If
include_prior_tests: true, include earlier checkpoints
- Include
-
pytest.ini Generation
- Register built-in markers (error, functionality, regression)
- Register custom markers from
config.yaml
-
Test Execution
- Run via
uvxfor environment isolation - Pass
--entrypointand--checkpointoptions - Generate CTRF and pytest-json reports
- Run via
-
Result Parsing
- Parse test outcomes from reports
- Categorize by GroupType based on markers
- Prior checkpoint tests become REGRESSION type
Tests run in an isolated environment:
- uvx: Ensures clean Python environment
- Session scope: Fixtures shared within session
- Workspace: Agent submission in isolated directory
{
"results": {
"tests": [
{
"name": "test_basic_case",
"status": "passed",
"duration": 0.5,
"filePath": "tests/test_checkpoint_1.py",
"tags": ["functionality"]
}
]
}
}{
"tests": [
{
"nodeid": "tests/test_checkpoint_1.py::test_basic_case",
"outcome": "passed",
"duration": 0.5,
"markers": ["functionality"]
}
]
}- conftest Patterns - Fixture patterns and examples
- Markers - Built-in and custom markers
- Test Data - Organizing test data
- Runner Internals - Technical reference
- Quick Reference - Templates and commands
- Tutorial - Create your first problem
- Examples - Real problem walkthroughs