-
Notifications
You must be signed in to change notification settings - Fork 8
Getting Started
CodeGraphTheory edited this page Jan 20, 2026
·
1 revision
Welcome! This guide will help you run your first MCP server benchmark in under 5 minutes.
Before you begin, ensure you have:
- ✅ Python 3.11+ installed
- ✅ Docker running on your machine
- ✅ Anthropic API key (get one at console.anthropic.com)
- ✅ Claude Code CLI installed (
pip install claude-code)
# Check Python version
python --version # Should be 3.11 or higher
# Check Docker
docker info # Should show Docker daemon info
# Check Claude CLI
which claude # Should show path to claude binarypip install mcpbrVerify installation:
mcpbr --versionexport ANTHROPIC_API_KEY="sk-ant-..."💡 Tip: Add this to your ~/.bashrc or ~/.zshrc to persist across sessions.
mcpbr initThis creates mcpbr.yaml with example configuration. Let's examine it:
mcp_server:
command: "npx"
args:
- "-y"
- "@modelcontextprotocol/server-filesystem"
- "{workdir}"
env: {}
provider: "anthropic"
agent_harness: "claude-code"
model: "sonnet"
dataset: "SWE-bench/SWE-bench_Lite"
sample_size: 10
timeout_seconds: 300
max_concurrent: 4mcpbr run -c mcpbr.yaml -n 5 -vWhat this does:
-
-c mcpbr.yaml: Use the configuration file -
-n 5: Run on 5 tasks (small sample for quick test) -
-v: Verbose output to see progress
⏱️ Estimated time: 5-10 minutes for 5 tasks
You'll see real-time progress:
mcpbr Evaluation
Config: mcpbr.yaml
Provider: anthropic
Model: sonnet
Dataset: SWE-bench/SWE-bench_Lite
Sample size: 5
Loading dataset: SWE-bench/SWE-bench_Lite
Evaluating 5 tasks
14:23:15 [MCP] Starting mcp run for astropy-12907:mcp
14:23:22 astropy-12907:mcp > TodoWrite
14:23:26 astropy-12907:mcp > Glob
...
And a final summary:
Evaluation Results
Summary
+-----------------+-----------+----------+
| Metric | MCP Agent | Baseline |
+-----------------+-----------+----------+
| Resolved | 2/5 | 1/5 |
| Resolution Rate | 40.0% | 20.0% |
+-----------------+-----------+----------+
Improvement: +100.0%
mcpbr run -c mcpbr.yaml -n 5 -o results.json -r report.mdOutputs:
-
results.json: Complete results with all metrics -
report.md: Human-readable markdown report
mcpbr run -c mcpbr.yaml -n 5 --log-dir logs/Creates logs/ directory with detailed JSON logs for each task run.
# Run CyberGym (security vulnerabilities)
mcpbr run -c mcpbr.yaml --benchmark cybergym --level 2 -n 5
# Run TerminalBench (DevOps tasks)
mcpbr run -c mcpbr.yaml --benchmark terminalbench -n 5
# List available benchmarks
mcpbr benchmarksEdit mcpbr.yaml to test your own MCP server:
mcp_server:
command: "python"
args:
- "-m"
- "my_custom_mcp_server"
- "--workspace"
- "{workdir}"
env:
LOG_LEVEL: "debug"# Run on specific task IDs
mcpbr run -c mcpbr.yaml -t astropy__astropy-12907 -t django__django-11099# Run on more tasks for statistically significant results
mcpbr run -c mcpbr.yaml -n 50# On Linux, add your user to docker group
sudo usermod -aG docker $USER
# Log out and back in- This is normal - mcpbr will build from scratch
- First run takes longer (~5-10min per task)
- Subsequent runs are faster
- Reduce
max_concurrentin config - Add delays between runs
See Troubleshooting for more help.
For SWE-bench:
- ✅ Agent generates a patch
- ✅ Patch applies cleanly to repository
- ✅ All "fail-to-pass" tests now pass
- ✅ All "pass-to-pass" tests still pass
For CyberGym:
- ✅ Agent generates a PoC exploit
- ✅ PoC crashes the vulnerable version
- ✅ PoC does NOT crash the patched version
Improvement: +60.0%
This means the MCP agent's resolution rate is 60% higher than baseline.
Example:
- Baseline: 20% (1 out of 5 tasks)
- MCP: 32% (1.6 out of 5 tasks, rounded)
- Improvement: (32 - 20) / 20 = +60%
- Test with
n=5first - Verify your setup works
- Then scale up to
n=25or more
- Track your
mcpbr.yamlconfigurations - Save results with git tags
- Compare results over time
- Results can vary due to LLM non-determinism
- Run 3-5 times and average results
- Report confidence intervals
- Each task uses API calls
- SWE-bench tasks: ~$0.10-0.50 per task
- Set budgets and monitor usage
- 📚 Configuration Reference - All config options
- 🛠️ CLI Reference - Complete command documentation
- 🏗️ Architecture - How mcpbr works
- 🎯 Best Practices - Advanced tips and tricks
- 💬 GitHub Discussions - Ask questions
- 🐛 Issue Tracker - Report problems
- 📖 FAQ - Common questions
Ready to dive deeper? Check out Core Concepts to understand how mcpbr works under the hood!