Pytest Evaluation System

Overview of how SCBench uses pytest to evaluate agent submissions.

Why Pytest?

SCBench uses pytest as its evaluation framework because:

Standard tooling - Familiar to Python developers
Rich ecosystem - Fixtures, parametrization, markers, plugins
Flexible assertions - Clear failure messages
Isolated execution - Tests run independently

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                     PytestRunner                             │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  1. Copy tests from problem/tests/ to workspace              │
│  2. Generate pytest.ini with markers                         │
│  3. Execute via uvx for isolation                            │
│  4. Parse CTRF + pytest-json-report                          │
│  5. Categorize results by GroupType                          │
│                                                              │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│                    Test Execution                            │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  pytest tests/                                               │
│    --entrypoint="python main.py"                             │
│    --checkpoint=checkpoint_1                                 │
│    --ctrf=.scbench/ctrf-report.json                         │
│    --json-report                                             │
│                                                              │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│                    Test Results                              │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  TestResult(                                                 │
│    name="test_basic_case",                                   │
│    passed=True,                                              │
│    group_type=GroupType.CORE,                                │
│    duration=0.5,                                             │
│  )                                                           │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Key Concepts

Required Fixtures

Every problem's tests/conftest.py must provide:

@pytest.fixture(scope="session")
def entrypoint_argv(request):
    """Command to invoke the submission."""
    return shlex.split(request.config.getoption("--entrypoint"))

@pytest.fixture(scope="session")
def checkpoint_name(request):
    """Current checkpoint being evaluated."""
    return request.config.getoption("--checkpoint")

Test Markers

Tests are categorized using pytest markers:

Marker	GroupType	Purpose
(none)	CORE	Must pass - essential functionality
`@pytest.mark.functionality`	FUNCTIONALITY	Nice to have - advanced features
`@pytest.mark.error`	ERROR	Error handling - edge cases
`@pytest.mark.regression`	REGRESSION	Prior checkpoint tests

Test File Naming

Tests must follow the naming convention:

test_checkpoint_1.py for checkpoint 1
test_checkpoint_2.py for checkpoint 2
etc.

This naming is used to:

Determine which tests belong to which checkpoint
Include prior checkpoint tests when include_prior_tests: true

Execution Flow

Session Creation
- Create workspace with agent's submission
- Copy test files from problem directory
Test Selection
- Include test_checkpoint_N.py for current checkpoint
- If include_prior_tests: true, include earlier checkpoints
pytest.ini Generation
- Register built-in markers (error, functionality, regression)
- Register custom markers from config.yaml
Test Execution
- Run via uvx for environment isolation
- Pass --entrypoint and --checkpoint options
- Generate CTRF and pytest-json reports
Result Parsing
- Parse test outcomes from reports
- Categorize by GroupType based on markers
- Prior checkpoint tests become REGRESSION type

Test Isolation

Tests run in an isolated environment:

uvx: Ensures clean Python environment
Session scope: Fixtures shared within session
Workspace: Agent submission in isolated directory

Report Formats

CTRF Report

{
  "results": {
    "tests": [
      {
        "name": "test_basic_case",
        "status": "passed",
        "duration": 0.5,
        "filePath": "tests/test_checkpoint_1.py",
        "tags": ["functionality"]
      }
    ]
  }
}

pytest-json-report

{
  "tests": [
    {
      "nodeid": "tests/test_checkpoint_1.py::test_basic_case",
      "outcome": "passed",
      "duration": 0.5,
      "markers": ["functionality"]
    }
  ]
}

Documentation

conftest Patterns - Fixture patterns and examples
Markers - Built-in and custom markers
Test Data - Organizing test data
Runner Internals - Technical reference

Next Steps

Quick Reference - Templates and commands
Tutorial - Create your first problem
Examples - Real problem walkthroughs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pytest Evaluation System

Why Pytest?

Architecture Overview

Key Concepts

Required Fixtures

Test Markers

Test File Naming

Execution Flow

Test Isolation

Report Formats

CTRF Report

pytest-json-report

Documentation

Next Steps

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Pytest Evaluation System

Why Pytest?

Architecture Overview

Key Concepts

Required Fixtures

Test Markers

Test File Naming

Execution Flow

Test Isolation

Report Formats

CTRF Report

pytest-json-report

Documentation

Next Steps