Name	Name	Last commit message	Last commit date
parent directory ..
examples	examples
patterns	patterns
pytest	pytest
README.md	README.md
checkpoints.md	checkpoints.md
config-schema.md	config-schema.md
quick-reference.md	quick-reference.md
structure.md	structure.md
troubleshooting.md	troubleshooting.md
tutorial.md	tutorial.md

Creating Problems for SCBench

This guide covers how to create evaluation problems for SCBench using pytest-based tests.

What is a Problem?

A problem is a specification-driven coding challenge that agents must solve. Each problem:

Defines checkpoints - progressive milestones that build on each other
Uses pytest - standard Python testing framework for evaluation
Tests agent behavior - not just final output, but how agents handle evolving requirements

Quick Start

Create a problem in 3 steps:

Step 1: Create Directory Structure

problems/my_problem/
├── config.yaml           # Problem metadata
├── checkpoint_1.md       # Spec for checkpoint 1
├── checkpoint_2.md       # Spec for checkpoint 2
└── tests/
    ├── conftest.py       # Required fixtures
    ├── test_checkpoint_1.py
    └── test_checkpoint_2.py

Step 2: Write Minimal config.yaml

version: 1
name: my_problem
description: Short description of the problem
entry_file: main.py
timeout: 20
tags:
  - cli

checkpoints:
  checkpoint_1:
    version: 1
    order: 1
    state: Core Tests

Step 3: Create Test Files

conftest.py (required fixtures):

import shlex
import pytest

def pytest_addoption(parser):
    parser.addoption("--entrypoint", required=True)
    parser.addoption("--checkpoint", required=True)

@pytest.fixture(scope="session")
def entrypoint_argv(request):
    return shlex.split(request.config.getoption("--entrypoint"))

@pytest.fixture(scope="session")
def checkpoint_name(request):
    return request.config.getoption("--checkpoint")

test_checkpoint_1.py:

import subprocess
import pytest

def test_basic(entrypoint_argv):
    """Core test - must pass."""
    result = subprocess.run(entrypoint_argv, capture_output=True, text=True)
    assert result.returncode == 0

@pytest.mark.functionality
def test_optional_feature(entrypoint_argv):
    """Functionality test - nice to have."""
    ...

@pytest.mark.error
def test_invalid_input(entrypoint_argv):
    """Error test - must handle gracefully."""
    ...

Problem Structure Overview

problems/{problem_name}/
├── config.yaml              # Problem metadata and checkpoint definitions
├── checkpoint_1.md          # Specification for checkpoint 1
├── checkpoint_2.md          # Specification for checkpoint 2
├── tests/
│   ├── conftest.py          # Pytest fixtures (entrypoint_argv, checkpoint_name)
│   ├── test_checkpoint_1.py # Tests for checkpoint 1
│   ├── test_checkpoint_2.py # Tests for checkpoint 2
│   ├── data/                # Test case data (optional)
│   │   └── checkpoint_1/
│   │       ├── core/
│   │       └── errors/
│   └── assets/              # Static files for tests (optional)
└── static_assets/           # Files available to tests (optional)

Test Categorization

Tests are categorized using pytest markers:

Marker	GroupType	Description
(none)	CORE	Must pass - essential functionality
`@pytest.mark.functionality`	FUNCTIONALITY	Nice to have - advanced features
`@pytest.mark.error`	ERROR	Error handling - edge cases
`@pytest.mark.regression`	REGRESSION	Prior checkpoint tests

Prior checkpoint tests automatically become REGRESSION tests.

Documentation Guide

Getting Started (15 min)

Quick Reference - Copy-paste templates and commands
Tutorial - Create your first problem step-by-step

Core Concepts (30 min)

Structure - Directory layout and file purposes
Config Schema - Complete configuration reference
Checkpoints - Designing checkpoint progression

Pytest System

Overview - How the pytest evaluation works
conftest Patterns - Fixture patterns
Markers - Built-in and custom markers
Test Data - Organizing test data
Runner Internals - Technical reference
Advanced Fixtures - Advanced fixture patterns
Stateful Testing - State across checkpoints
Complex Parametrization - Advanced case loading
Debugging Workflows - Test debugging and troubleshooting

Testing Patterns

CLI Testing - Testing command-line tools
Stream Testing - Testing stdin/stdout
Parametrized Cases - Loading external test cases
Regression Testing - Testing backward compatibility

Examples

file_backup - CLI with parametrized cases
etl_pipeline - Stream-based inline tests
layered_config_synthesizer - Advanced patterns

Troubleshooting

Common Issues - Debugging test failures

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Creating Problems for SCBench

What is a Problem?

Quick Start

Step 1: Create Directory Structure

Step 2: Write Minimal config.yaml

Step 3: Create Test Files

Problem Structure Overview

Test Categorization

Documentation Guide

Getting Started (15 min)

Core Concepts (30 min)

Pytest System

Testing Patterns

Examples

Troubleshooting

FilesExpand file tree

problems

Directory actions

More options

Directory actions

More options

Latest commit

History

problems

Folders and files

parent directory

README.md

Creating Problems for SCBench

What is a Problem?

Quick Start

Step 1: Create Directory Structure

Step 2: Write Minimal config.yaml

Step 3: Create Test Files

Problem Structure Overview

Test Categorization

Documentation Guide

Getting Started (15 min)

Core Concepts (30 min)

Pytest System

Testing Patterns

Examples

Troubleshooting