This guide covers how to create evaluation problems for SCBench using pytest-based tests.
A problem is a specification-driven coding challenge that agents must solve. Each problem:
- Defines checkpoints - progressive milestones that build on each other
- Uses pytest - standard Python testing framework for evaluation
- Tests agent behavior - not just final output, but how agents handle evolving requirements
Create a problem in 3 steps:
problems/my_problem/
├── config.yaml # Problem metadata
├── checkpoint_1.md # Spec for checkpoint 1
├── checkpoint_2.md # Spec for checkpoint 2
└── tests/
├── conftest.py # Required fixtures
├── test_checkpoint_1.py
└── test_checkpoint_2.py
version: 1
name: my_problem
description: Short description of the problem
entry_file: main.py
timeout: 20
tags:
- cli
checkpoints:
checkpoint_1:
version: 1
order: 1
state: Core Testsconftest.py (required fixtures):
import shlex
import pytest
def pytest_addoption(parser):
parser.addoption("--entrypoint", required=True)
parser.addoption("--checkpoint", required=True)
@pytest.fixture(scope="session")
def entrypoint_argv(request):
return shlex.split(request.config.getoption("--entrypoint"))
@pytest.fixture(scope="session")
def checkpoint_name(request):
return request.config.getoption("--checkpoint")test_checkpoint_1.py:
import subprocess
import pytest
def test_basic(entrypoint_argv):
"""Core test - must pass."""
result = subprocess.run(entrypoint_argv, capture_output=True, text=True)
assert result.returncode == 0
@pytest.mark.functionality
def test_optional_feature(entrypoint_argv):
"""Functionality test - nice to have."""
...
@pytest.mark.error
def test_invalid_input(entrypoint_argv):
"""Error test - must handle gracefully."""
...problems/{problem_name}/
├── config.yaml # Problem metadata and checkpoint definitions
├── checkpoint_1.md # Specification for checkpoint 1
├── checkpoint_2.md # Specification for checkpoint 2
├── tests/
│ ├── conftest.py # Pytest fixtures (entrypoint_argv, checkpoint_name)
│ ├── test_checkpoint_1.py # Tests for checkpoint 1
│ ├── test_checkpoint_2.py # Tests for checkpoint 2
│ ├── data/ # Test case data (optional)
│ │ └── checkpoint_1/
│ │ ├── core/
│ │ └── errors/
│ └── assets/ # Static files for tests (optional)
└── static_assets/ # Files available to tests (optional)
Tests are categorized using pytest markers:
| Marker | GroupType | Description |
|---|---|---|
| (none) | CORE | Must pass - essential functionality |
@pytest.mark.functionality |
FUNCTIONALITY | Nice to have - advanced features |
@pytest.mark.error |
ERROR | Error handling - edge cases |
@pytest.mark.regression |
REGRESSION | Prior checkpoint tests |
Prior checkpoint tests automatically become REGRESSION tests.
- Quick Reference - Copy-paste templates and commands
- Tutorial - Create your first problem step-by-step
- Structure - Directory layout and file purposes
- Config Schema - Complete configuration reference
- Checkpoints - Designing checkpoint progression
- Overview - How the pytest evaluation works
- conftest Patterns - Fixture patterns
- Markers - Built-in and custom markers
- Test Data - Organizing test data
- Runner Internals - Technical reference
- Advanced Fixtures - Advanced fixture patterns
- Stateful Testing - State across checkpoints
- Complex Parametrization - Advanced case loading
- Debugging Workflows - Test debugging and troubleshooting
- CLI Testing - Testing command-line tools
- Stream Testing - Testing stdin/stdout
- Parametrized Cases - Loading external test cases
- Regression Testing - Testing backward compatibility
- file_backup - CLI with parametrized cases
- etl_pipeline - Stream-based inline tests
- layered_config_synthesizer - Advanced patterns
- Common Issues - Debugging test failures