-
Notifications
You must be signed in to change notification settings - Fork 8
FAQ
mcpbr (Model Context Protocol Benchmark Runner) is a framework for evaluating MCP servers against real-world software engineering benchmarks. It provides apples-to-apples comparisons between tool-assisted and baseline agent performance.
mcpbr helps you:
- Prove your MCP server actually improves agent performance
- Measure improvement objectively with hard numbers
- Compare different MCP server configurations
- Benchmark against standard datasets (SWE-bench, CyberGym, etc.)
Yes! mcpbr is open-source under the MIT License. However, you'll need to pay for:
- API calls to LLM providers (e.g., Anthropic)
- Cloud compute if running in the cloud
Currently:
- ✅ SWE-bench: Bug fixing in real GitHub repos
- ✅ CyberGym: Security vulnerability PoC generation
- ✅ TerminalBench: DevOps and shell scripting tasks
- 🚧 MCPToolBench++: Coming soon (comprehensive MCP tool evaluation)
- Python: 3.11 or higher
- Docker: Latest stable version
- RAM: 8GB minimum (16GB recommended)
- Disk: 50GB free space (for Docker images)
- OS: Linux, macOS, or Windows (with WSL2)
pip install mcpbrSee Getting Started for detailed instructions.
Yes, you need an API key from your LLM provider:
- Anthropic: Get one at console.anthropic.com
- OpenAI: Coming soon
- Google Gemini: Coming soon
No, Docker is required. mcpbr uses Docker to create isolated, reproducible environments for each benchmark task.
Docker ensures:
- Reproducibility: Same environment every time
- Isolation: Tasks don't interfere with each other
- Accuracy: Dependencies match the original task requirements
- Safety: Prevents potential damage to your system
It varies:
- Small sample (n=5): 5-10 minutes
- Medium sample (n=25): 30-60 minutes
- Full dataset (n=300): Several hours to days
Factors:
- Sample size
- Task complexity
- Model speed
- Docker image availability (pre-built vs building from scratch)
- Concurrency setting
Approximate costs per task:
- SWE-bench: $0.10 - $0.50 per task
- CyberGym: $0.15 - $0.40 per task
- TerminalBench: $0.05 - $0.20 per task
Example: Running 25 SWE-bench tasks ≈ $2.50 - $12.50
Costs depend on:
- Model used (Opus > Sonnet > Haiku)
- Task complexity
- Number of iterations
- Whether MCP server helps or hinders
Yes! Use the max_concurrent setting:
max_concurrent: 4 # Run 4 tasks simultaneouslyRecommendations:
- Start with
2-4for initial testing - Increase based on your machine's resources
- Watch Docker memory usage
- Higher concurrency = higher API rate limit risk
Use the -t flag:
mcpbr run -c config.yaml -t astropy__astropy-12907 -t django__django-11099Not automatically (yet). Workarounds:
- Use
--log-dirto save progress - Check which tasks completed
- Run remaining tasks with
-tflags
Coming soon: Resume functionality (#TBD)
For SWE-bench:
- Agent generates a patch
- Patch applies cleanly
- All required tests pass
For CyberGym:
- Agent generates a PoC exploit
- PoC crashes vulnerable version
- PoC doesn't crash patched version
Improvement = (MCP Rate - Baseline Rate) / Baseline Rate
Example:
- Baseline: 20% (5/25 tasks)
- MCP: 32% (8/25 tasks)
- Improvement: (32 - 20) / 20 = +60%
It depends on the benchmark:
- +20% to +50%: Good improvement
- +50% to +100%: Excellent improvement
- +100%+: Outstanding improvement
- Negative: MCP server is hurting performance
Context matters:
- Some MCP servers excel at specific task types
- Cross-benchmark analysis helps identify strengths
Possible reasons:
- Tool overhead: MCP calls add latency
- Distraction: Too many tools confuse the agent
- Wrong tools: Tools not suited for these tasks
- Configuration: MCP server misconfigured
- Timeout: Agent spent too long on tools
Debugging tips:
- Check
--log-diroutput - Look at tool usage patterns
- Compare task types (coding vs security vs DevOps)
- Try different MCP server configurations
Minimum viable: n=10 (quick validation) Statistically sound: n=25-50 (recommended) Research grade: n=100+ (publication quality)
More tasks = more reliable results, but higher cost.
Linux:
sudo usermod -aG docker $USER
# Log out and back inmacOS/Windows: Ensure Docker Desktop is running
This is normal! mcpbr will build from scratch:
- First run: ~5-10 minutes per task
- Uses generic Python environment
- Still produces valid results
Solutions:
- Reduce
max_concurrentin config - Add delays between runs
- Upgrade your API plan
- Use a different model tier
Increase timeout in config:
timeout_seconds: 600 # 10 minutes instead of 5Or use a faster model (Sonnet instead of Opus).
Clean up:
# Remove mcpbr containers
mcpbr cleanup
# Remove all stopped containers
docker container prune
# Remove unused images
docker image prune -aInstall it:
pip install claude-code
# Verify
which claudeNot directly (yet). Coming in Phase 2:
- Custom benchmark YAML support
- See roadmap issue #TBD
Workaround: Extend the Benchmark protocol in your fork.
Currently only Anthropic is supported. Coming soon:
- OpenAI (GPT-4, GPT-3.5)
- Google Gemini
- Alibaba Qwen
See issue #229.
See Contributing Guide and the benchmark protocol in src/mcpbr/benchmarks/base.py.
Yes! Example GitHub Actions workflow:
- name: Run mcpbr benchmark
run: |
mcpbr run -c config.yaml -n 5 -o results.json
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}Coming soon: Official GitHub Action (#TBD)
- Use Haiku model (cheaper, faster)
- Start with small samples (n=5-10)
- Reduce timeout for simple tasks
- Use pre-built images when available
- Cache Docker images locally
- Run during off-peak hours (if provider has pricing tiers)
- Increase max_concurrent
- Use Haiku model
- Enable Docker image caching (coming soon, #228)
- Use SSD storage for Docker
- Run on powerful hardware (more CPU/RAM)
- Search the FAQ (you're here!)
- Check Troubleshooting
- Search GitHub Issues
- Ask in Discussions
- Open a new issue if you found a bug
Open a feature request with:
- Clear description
- Use case
- Example or mockup (if applicable)
See Contributing Guide! We welcome:
- Bug reports and fixes
- Feature suggestions and implementations
- Documentation improvements
- Example configurations
- Benchmark datasets
mcpbr v1.0 Roadmap - 200+ planned features!
Coming soon! Follow the project for updates.
- 💬 GitHub Discussions
- 📧 Contact maintainers
- 🐛 Report an issue
Can't find your answer? Help us improve this FAQ by suggesting a question!