Getting Started

Getting Started with mcpbr

Welcome! This guide will help you run your first MCP server benchmark in under 5 minutes.

Prerequisites

Before you begin, ensure you have:

✅ Python 3.11+ installed
✅ Docker running on your machine
✅ Anthropic API key (get one at console.anthropic.com)
✅ Claude Code CLI installed (pip install claude-code)

Quick Environment Check

# Check Python version
python --version  # Should be 3.11 or higher

# Check Docker
docker info  # Should show Docker daemon info

# Check Claude CLI
which claude  # Should show path to claude binary

Step 1: Install mcpbr

pip install mcpbr

Verify installation:

mcpbr --version

Step 2: Set Your API Key

export ANTHROPIC_API_KEY="sk-ant-..."

💡 Tip: Add this to your ~/.bashrc or ~/.zshrc to persist across sessions.

Step 3: Initialize Configuration

mcpbr init

This creates mcpbr.yaml with example configuration. Let's examine it:

mcp_server:
  command: "npx"
  args:
    - "-y"
    - "@modelcontextprotocol/server-filesystem"
    - "{workdir}"
  env: {}

provider: "anthropic"
agent_harness: "claude-code"
model: "sonnet"
dataset: "SWE-bench/SWE-bench_Lite"
sample_size: 10
timeout_seconds: 300
max_concurrent: 4

Step 4: Run Your First Benchmark

mcpbr run -c mcpbr.yaml -n 5 -v

What this does:

-c mcpbr.yaml: Use the configuration file
-n 5: Run on 5 tasks (small sample for quick test)
-v: Verbose output to see progress

⏱️ Estimated time: 5-10 minutes for 5 tasks

Step 5: Understand the Output

You'll see real-time progress:

mcpbr Evaluation
  Config: mcpbr.yaml
  Provider: anthropic
  Model: sonnet
  Dataset: SWE-bench/SWE-bench_Lite
  Sample size: 5

Loading dataset: SWE-bench/SWE-bench_Lite
Evaluating 5 tasks
14:23:15 [MCP] Starting mcp run for astropy-12907:mcp
14:23:22 astropy-12907:mcp    > TodoWrite
14:23:26 astropy-12907:mcp    > Glob
...

And a final summary:

Evaluation Results

                 Summary
+-----------------+-----------+----------+
| Metric          | MCP Agent | Baseline |
+-----------------+-----------+----------+
| Resolved        | 2/5       | 1/5      |
| Resolution Rate | 40.0%     | 20.0%    |
+-----------------+-----------+----------+

Improvement: +100.0%

Step 6: Explore Results

View Detailed Results

mcpbr run -c mcpbr.yaml -n 5 -o results.json -r report.md

Outputs:

results.json: Complete results with all metrics
report.md: Human-readable markdown report

View Per-Task Logs

mcpbr run -c mcpbr.yaml -n 5 --log-dir logs/

Creates logs/ directory with detailed JSON logs for each task run.

Next Steps

Try Different Benchmarks

# Run CyberGym (security vulnerabilities)
mcpbr run -c mcpbr.yaml --benchmark cybergym --level 2 -n 5

# Run TerminalBench (DevOps tasks)
mcpbr run -c mcpbr.yaml --benchmark terminalbench -n 5

# List available benchmarks
mcpbr benchmarks

Customize Your MCP Server

Edit mcpbr.yaml to test your own MCP server:

mcp_server:
  command: "python"
  args:
    - "-m"
    - "my_custom_mcp_server"
    - "--workspace"
    - "{workdir}"
  env:
    LOG_LEVEL: "debug"

Run Specific Tasks

# Run on specific task IDs
mcpbr run -c mcpbr.yaml -t astropy__astropy-12907 -t django__django-11099

Increase Sample Size

# Run on more tasks for statistically significant results
mcpbr run -c mcpbr.yaml -n 50

Common Issues

Docker Permission Denied

# On Linux, add your user to docker group
sudo usermod -aG docker $USER
# Log out and back in

Pre-built Image Not Found

This is normal - mcpbr will build from scratch
First run takes longer (~5-10min per task)
Subsequent runs are faster

API Rate Limiting

Reduce max_concurrent in config
Add delays between runs

See Troubleshooting for more help.

Understanding Results

What Makes a Task "Resolved"?

For SWE-bench:

✅ Agent generates a patch
✅ Patch applies cleanly to repository
✅ All "fail-to-pass" tests now pass
✅ All "pass-to-pass" tests still pass

For CyberGym:

✅ Agent generates a PoC exploit
✅ PoC crashes the vulnerable version
✅ PoC does NOT crash the patched version

Interpreting Improvement Percentage

Improvement: +60.0%

This means the MCP agent's resolution rate is 60% higher than baseline.

Example:

Baseline: 20% (1 out of 5 tasks)
MCP: 32% (1.6 out of 5 tasks, rounded)
Improvement: (32 - 20) / 20 = +60%

Best Practices

1. Start Small

Test with n=5 first
Verify your setup works
Then scale up to n=25 or more

2. Use Version Control

Track your mcpbr.yaml configurations
Save results with git tags
Compare results over time

3. Run Multiple Iterations

Results can vary due to LLM non-determinism
Run 3-5 times and average results
Report confidence intervals

4. Monitor Costs

Each task uses API calls
SWE-bench tasks: ~$0.10-0.50 per task
Set budgets and monitor usage

Learn More

📚 Configuration Reference - All config options
🛠️ CLI Reference - Complete command documentation
🏗️ Architecture - How mcpbr works
🎯 Best Practices - Advanced tips and tricks

Get Help

💬 GitHub Discussions - Ask questions
🐛 Issue Tracker - Report problems
📖 FAQ - Common questions

Ready to dive deeper? Check out Core Concepts to understand how mcpbr works under the hood!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting Started

Getting Started with mcpbr

Prerequisites

Quick Environment Check

Step 1: Install mcpbr

Step 2: Set Your API Key

Step 3: Initialize Configuration

Step 4: Run Your First Benchmark

Step 5: Understand the Output

Step 6: Explore Results

View Detailed Results

View Per-Task Logs

Next Steps

Try Different Benchmarks

Customize Your MCP Server

Run Specific Tasks

Increase Sample Size

Common Issues

Docker Permission Denied

Pre-built Image Not Found

API Rate Limiting

Understanding Results

What Makes a Task "Resolved"?

Interpreting Improvement Percentage

Best Practices

1. Start Small

2. Use Version Control

3. Run Multiple Iterations

4. Monitor Costs

Learn More

Get Help

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally