Skip to content

Getting Started

CodeGraphTheory edited this page Jan 20, 2026 · 1 revision

Getting Started with mcpbr

Welcome! This guide will help you run your first MCP server benchmark in under 5 minutes.

Prerequisites

Before you begin, ensure you have:

  • Python 3.11+ installed
  • Docker running on your machine
  • Anthropic API key (get one at console.anthropic.com)
  • Claude Code CLI installed (pip install claude-code)

Quick Environment Check

# Check Python version
python --version  # Should be 3.11 or higher

# Check Docker
docker info  # Should show Docker daemon info

# Check Claude CLI
which claude  # Should show path to claude binary

Step 1: Install mcpbr

pip install mcpbr

Verify installation:

mcpbr --version

Step 2: Set Your API Key

export ANTHROPIC_API_KEY="sk-ant-..."

💡 Tip: Add this to your ~/.bashrc or ~/.zshrc to persist across sessions.


Step 3: Initialize Configuration

mcpbr init

This creates mcpbr.yaml with example configuration. Let's examine it:

mcp_server:
  command: "npx"
  args:
    - "-y"
    - "@modelcontextprotocol/server-filesystem"
    - "{workdir}"
  env: {}

provider: "anthropic"
agent_harness: "claude-code"
model: "sonnet"
dataset: "SWE-bench/SWE-bench_Lite"
sample_size: 10
timeout_seconds: 300
max_concurrent: 4

Step 4: Run Your First Benchmark

mcpbr run -c mcpbr.yaml -n 5 -v

What this does:

  • -c mcpbr.yaml: Use the configuration file
  • -n 5: Run on 5 tasks (small sample for quick test)
  • -v: Verbose output to see progress

⏱️ Estimated time: 5-10 minutes for 5 tasks


Step 5: Understand the Output

You'll see real-time progress:

mcpbr Evaluation
  Config: mcpbr.yaml
  Provider: anthropic
  Model: sonnet
  Dataset: SWE-bench/SWE-bench_Lite
  Sample size: 5

Loading dataset: SWE-bench/SWE-bench_Lite
Evaluating 5 tasks
14:23:15 [MCP] Starting mcp run for astropy-12907:mcp
14:23:22 astropy-12907:mcp    > TodoWrite
14:23:26 astropy-12907:mcp    > Glob
...

And a final summary:

Evaluation Results

                 Summary
+-----------------+-----------+----------+
| Metric          | MCP Agent | Baseline |
+-----------------+-----------+----------+
| Resolved        | 2/5       | 1/5      |
| Resolution Rate | 40.0%     | 20.0%    |
+-----------------+-----------+----------+

Improvement: +100.0%

Step 6: Explore Results

View Detailed Results

mcpbr run -c mcpbr.yaml -n 5 -o results.json -r report.md

Outputs:

  • results.json: Complete results with all metrics
  • report.md: Human-readable markdown report

View Per-Task Logs

mcpbr run -c mcpbr.yaml -n 5 --log-dir logs/

Creates logs/ directory with detailed JSON logs for each task run.


Next Steps

Try Different Benchmarks

# Run CyberGym (security vulnerabilities)
mcpbr run -c mcpbr.yaml --benchmark cybergym --level 2 -n 5

# Run TerminalBench (DevOps tasks)
mcpbr run -c mcpbr.yaml --benchmark terminalbench -n 5

# List available benchmarks
mcpbr benchmarks

Customize Your MCP Server

Edit mcpbr.yaml to test your own MCP server:

mcp_server:
  command: "python"
  args:
    - "-m"
    - "my_custom_mcp_server"
    - "--workspace"
    - "{workdir}"
  env:
    LOG_LEVEL: "debug"

Run Specific Tasks

# Run on specific task IDs
mcpbr run -c mcpbr.yaml -t astropy__astropy-12907 -t django__django-11099

Increase Sample Size

# Run on more tasks for statistically significant results
mcpbr run -c mcpbr.yaml -n 50

Common Issues

Docker Permission Denied

# On Linux, add your user to docker group
sudo usermod -aG docker $USER
# Log out and back in

Pre-built Image Not Found

  • This is normal - mcpbr will build from scratch
  • First run takes longer (~5-10min per task)
  • Subsequent runs are faster

API Rate Limiting

  • Reduce max_concurrent in config
  • Add delays between runs

See Troubleshooting for more help.


Understanding Results

What Makes a Task "Resolved"?

For SWE-bench:

  • ✅ Agent generates a patch
  • ✅ Patch applies cleanly to repository
  • ✅ All "fail-to-pass" tests now pass
  • ✅ All "pass-to-pass" tests still pass

For CyberGym:

  • ✅ Agent generates a PoC exploit
  • ✅ PoC crashes the vulnerable version
  • ✅ PoC does NOT crash the patched version

Interpreting Improvement Percentage

Improvement: +60.0%

This means the MCP agent's resolution rate is 60% higher than baseline.

Example:

  • Baseline: 20% (1 out of 5 tasks)
  • MCP: 32% (1.6 out of 5 tasks, rounded)
  • Improvement: (32 - 20) / 20 = +60%

Best Practices

1. Start Small

  • Test with n=5 first
  • Verify your setup works
  • Then scale up to n=25 or more

2. Use Version Control

  • Track your mcpbr.yaml configurations
  • Save results with git tags
  • Compare results over time

3. Run Multiple Iterations

  • Results can vary due to LLM non-determinism
  • Run 3-5 times and average results
  • Report confidence intervals

4. Monitor Costs

  • Each task uses API calls
  • SWE-bench tasks: ~$0.10-0.50 per task
  • Set budgets and monitor usage

Learn More


Get Help


Ready to dive deeper? Check out Core Concepts to understand how mcpbr works under the hood!