-
Notifications
You must be signed in to change notification settings - Fork 8
Home
mcpbr (Model Context Protocol Benchmark Runner) is a comprehensive framework for evaluating MCP servers against real-world software engineering benchmarks. It provides apples-to-apples comparisons between tool-assisted and baseline agent performance.
π Learning
- Getting Started - Your first benchmark in 5 minutes
- Core Concepts - Understanding mcpbr's architecture
- Tutorials - Step-by-step guides for common workflows
π οΈ Building
- Contributing Guide - How to contribute to mcpbr
- Development Setup - Setting up your dev environment
- Architecture Deep Dive - How mcpbr works internally
π Using
- Configuration Reference - All configuration options
- CLI Reference - Complete command documentation
- Benchmark Guide - Working with different benchmarks
- Best Practices - Tips for effective benchmarking
π€ Community
- Code of Conduct - Our commitment to an inclusive community
- FAQ - Frequently asked questions
- Troubleshooting - Common issues and solutions
MCP servers promise to make LLMs better at coding tasks. But how do you prove it? How do you measure improvement objectively?
mcpbr runs controlled experiments:
- β Same model, same tasks, same environment
- β Only variable: your MCP server
- β Real GitHub issues from SWE-bench (not toy examples)
- β Reproducible results via Docker containers
Hard numbers showing whether your MCP server improves agent performance.
Evaluation Results Summary
+-----------------+-----------+----------+
| Metric | MCP Agent | Baseline |
+-----------------+-----------+----------+
| Resolved | 8/25 | 5/25 |
| Resolution Rate | 32.0% | 20.0% |
+-----------------+-----------+----------+
Improvement: +60.0%
| Benchmark | Type | Focus | Status |
|---|---|---|---|
| SWE-bench | Bug Fixing | Real GitHub issues requiring patches | β Stable |
| CyberGym | Security | PoC exploit generation for vulnerabilities | β Stable |
| TerminalBench | DevOps | Shell scripting and system administration | β Beta |
| MCPToolBench++ | Tool Use | MCP-specific tool evaluation | π§ Planned |
We're building the defacto standard for MCP server benchmarking!
- π¬ GitHub Discussions - Ask questions, share ideas
- π Issue Tracker - Report bugs, request features
- π Roadmap - See what's coming next
- π 200+ features planned for v1.0!
- π Report Bugs - Found an issue? Open a bug report
- π‘ Suggest Features - Have an idea? Open a feature request
- π Improve Docs - Help make our documentation better
- π§ Submit PRs - Check out good first issues
- β Star the Repo - Show your support!
- β¨ MCPToolBench++ integration planned - Comprehensive MCP tool use evaluation (see #223)
- π Enhanced analytics coming - Tool coverage, error patterns, performance profiling
- π Multi-benchmark runs - Run multiple benchmarks in one command
- π― Cross-benchmark analysis - Compare MCP effectiveness across different task types
- Added CyberGym and TerminalBench support
- Introduced benchmark abstraction layer
- Improved error handling and logging
# Install
pip install mcpbr
# Initialize configuration
mcpbr init
# Run your first benchmark
mcpbr run -c mcpbr.yaml -n 5 -vSee Getting Started for detailed instructions.
This project exists thanks to all the people who contribute!
Want to see your name here? Check out our Contributing Guide!
MIT License - see LICENSE for details.
Built with β€οΈ by the mcpbr community
GitHub β’ PyPI β’ Documentation