Visual guide to problem directory structure with annotations explaining each file's purpose.
Every problem follows this hierarchy:
Problem
├── Configuration (config.yaml)
├── Specifications (checkpoint_N.md)
├── Tests (tests/)
│ ├── conftest.py (fixtures)
│ ├── test_checkpoint_N.py (test files)
│ └── data/ (optional test data)
└── Assets (optional shared files)
problems/{problem_name}/
│
├── config.yaml ─────────────────────────────
│ version: 1 │ Problem Configuration │
│ name: problem_name │ • Unique identifier │
│ entry_file: main.py │ • Entry point │
│ checkpoints: │ • Checkpoint definitions │
│ checkpoint_1: │ • Static assets │
│ version: 1 │ • Test dependencies │
│ order: 1 └─────────────────────────────
│ state: Core Tests
│
├── checkpoint_1.md ─────────────────────────────
│ # Checkpoint 1: Feature │ Specification │
│ Build a tool that... │ • What agents must build │
│ │ • Requirements │
│ │ • Examples │
│ └─────────────────────────────
│
├── checkpoint_2.md (Additional checkpoints)
│
├── tests/ ═════════════════════════════
│ │ ║ Pytest Test Directory ║
│ │ ═════════════════════════════
│ │
│ ├── conftest.py ─────────────────────────────
│ │ def pytest_addoption(): │ Pytest Configuration │
│ │ def entrypoint_argv(): │ • Required fixtures │
│ │ def checkpoint_name(): │ • Optional helpers │
│ │ └─────────────────────────────
│ │
│ ├── test_checkpoint_1.py ─────────────────────────────
│ │ def test_core(): │ Checkpoint 1 Tests │
│ │ @pytest.mark.error │ • Core tests (unmarked) │
│ │ def test_error(): │ • Functionality tests │
│ │ @pytest.mark.functionality│ • Error tests │
│ │ def test_feature(): └─────────────────────────────
│ │
│ ├── test_checkpoint_2.py (Additional test files)
│ │
│ ├── data/ ─────────────────────────────
│ │ └── checkpoint_1/ │ External Test Data │
│ │ ├── core/ │ • Optional │
│ │ │ ├── case_1/ │ • For parametrized tests │
│ │ │ │ ├── case.yaml│ • Input/expected pairs │
│ │ │ │ └── expected.json │
│ │ │ └── case_2/ └─────────────────────────────
│ │ └── errors/
│ │
│ └── assets/ ─────────────────────────────
│ └── fixtures.json │ Test Assets │
│ │ • Shared test files │
│ │ • Available via fixture │
│ └─────────────────────────────
│
├── static_assets/ ─────────────────────────────
│ └── reference_data/ │ Static Assets │
│ └── data.csv │ • Shared across tests │
│ │ • Defined in config.yaml │
│ │ • Mounted for all tests │
│ └─────────────────────────────
│
└── solutions/ ─────────────────────────────
└── reference/ │ Reference Solution │
└── main.py │ • For problem development│
│ • Not used in evaluation │
└─────────────────────────────
Problem metadata and checkpoint definitions.
version: 1
name: my_problem # Must match directory name
description: Short description
entry_file: main.py # Agent creates this file
timeout: 20 # Default test timeout
tags:
- cli
- json
checkpoints:
checkpoint_1:
version: 1 # Increment when tests change
order: 1 # Execution order
state: Core Tests # Development state
checkpoint_2:
version: 1
order: 2
state: Core Tests
include_prior_tests: true # Run checkpoint_1 tests too
# Optional
static_assets:
files:
path: static_assets/files
test_dependencies:
- pyyaml
- requests
markers:
slow:
description: slow tests
group: FUNCTIONALITYSpecification for each checkpoint. Agents receive this to understand what to build.
# Checkpoint N: Feature Name
Brief description of what to build.
## Requirements
1. Requirement one
2. Requirement two
## Interface
How to invoke the tool or API.
## Examples
Input/output examples help agents understand the expected behavior.
## Error Handling
What errors to handle and how.Pytest configuration with required fixtures.
"""Pytest configuration for my_problem evaluation."""
import shlex
from pathlib import Path
import pytest
def pytest_addoption(parser):
"""Register required CLI options."""
parser.addoption("--entrypoint", required=True)
parser.addoption("--checkpoint", required=True)
@pytest.fixture(scope="session")
def entrypoint_argv(request):
"""Submission command as argv list."""
return shlex.split(request.config.getoption("--entrypoint"))
@pytest.fixture(scope="session")
def checkpoint_name(request):
"""Current checkpoint name."""
return request.config.getoption("--checkpoint")
# Optional fixtures
@pytest.fixture(scope="session")
def assets_dir():
"""Path to test assets."""
return Path(__file__).parent / "assets"Test file for each checkpoint. Naming convention: test_{checkpoint_name}.py
"""Tests for checkpoint_1."""
import subprocess
import pytest
def test_core_feature(entrypoint_argv):
"""Core: Essential functionality."""
result = subprocess.run(entrypoint_argv, capture_output=True, text=True)
assert result.returncode == 0
@pytest.mark.functionality
def test_advanced_feature(entrypoint_argv):
"""Functionality: Nice to have."""
...
@pytest.mark.error
def test_error_handling(entrypoint_argv):
"""Error: Handles invalid input."""
...External test case data for parametrized tests.
tests/data/
└── checkpoint_1/
├── core/
│ ├── case_name/
│ │ ├── case.yaml # Test input
│ │ └── expected.json # Expected output
│ └── another_case/
│ ├── case.yaml
│ └── expected.json
└── errors/
└── invalid_input/
├── case.yaml
└── expected.yaml # Error expectations
| Marker | GroupType | Purpose |
|---|---|---|
| (none) | CORE | Must pass - essential functionality |
@pytest.mark.functionality |
FUNCTIONALITY | Nice to have - advanced features |
@pytest.mark.error |
ERROR | Error handling - edge cases |
@pytest.mark.regression |
REGRESSION | Prior checkpoint tests |
All test data defined in Python. Good for small test suites.
def test_example(entrypoint_argv):
input_data = {"key": "value"}
result = subprocess.run(
entrypoint_argv,
input=json.dumps(input_data),
capture_output=True,
text=True,
)
assert result.returncode == 0External test data loaded dynamically. Good for many test cases.
CHECKPOINT_DIR = Path(__file__).parent / "data" / "checkpoint_1"
def load_cases(group_dir):
cases = []
for case_dir in sorted(group_dir.iterdir()):
if case_dir.is_dir():
case = yaml.safe_load((case_dir / "case.yaml").read_text())
expected = json.loads((case_dir / "expected.json").read_text())
cases.append({"id": case_dir.name, "case": case, "expected": expected})
return cases
CORE_CASES = load_cases(CHECKPOINT_DIR / "core")
@pytest.mark.parametrize("case", CORE_CASES, ids=[c["id"] for c in CORE_CASES])
def test_core_cases(entrypoint_argv, case):
result = run_case(entrypoint_argv, case)
assert result == case["expected"]Common utilities in a separate module.
tests/
├── conftest.py
├── case_utils.py # Shared helpers
├── test_checkpoint_1.py
└── test_checkpoint_2.py
# case_utils.py
def run_command(entrypoint_argv, input_data):
return subprocess.run(
entrypoint_argv,
input=json.dumps(input_data),
capture_output=True,
text=True,
)
# test_checkpoint_1.py
from case_utils import run_command
def test_example(entrypoint_argv):
result = run_command(entrypoint_argv, {"key": "value"})
assert result.returncode == 0problems/etl_pipeline/
├── config.yaml
│ checkpoints:
│ checkpoint_1:
│ version: 1
│ order: 1
│ state: Core Tests
│ checkpoint_2:
│ version: 1
│ order: 2
│ state: Core Tests
│ include_prior_tests: true
│ checkpoint_3:
│ version: 1
│ order: 3
│ state: Core Tests
│ include_prior_tests: true
│
├── checkpoint_1.md # Basic parsing
├── checkpoint_2.md # Add filtering
├── checkpoint_3.md # Add aggregation
│
└── tests/
├── conftest.py
├── test_checkpoint_1.py # Parsing tests
├── test_checkpoint_2.py # Filtering tests
└── test_checkpoint_3.py # Aggregation tests
When evaluating checkpoint_3:
test_checkpoint_1.pyruns as REGRESSION teststest_checkpoint_2.pyruns as REGRESSION teststest_checkpoint_3.pyruns as CORE/FUNCTIONALITY/ERROR tests
- Problem:
snake_case(e.g.,file_backup,etl_pipeline) - Tests: always
tests/ - Data: always
data/
- Config: always
config.yaml - Specs:
checkpoint_N.md(e.g.,checkpoint_1.md) - Tests:
test_checkpoint_N.py(e.g.,test_checkpoint_1.py) - Fixtures: always
conftest.py
- Core:
test_*(no marker) - Functionality:
test_*with@pytest.mark.functionality - Error:
test_*with@pytest.mark.error
- Quick Reference - Templates and commands
- Config Schema - Full configuration reference
- Markers - Test categorization details
- Examples - Real problem walkthroughs