FAQ

Frequently Asked Questions (FAQ)

General Questions

What is mcpbr?

mcpbr (Model Context Protocol Benchmark Runner) is a framework for evaluating MCP servers against real-world software engineering benchmarks. It provides apples-to-apples comparisons between tool-assisted and baseline agent performance.

Why should I use mcpbr?

mcpbr helps you:

Prove your MCP server actually improves agent performance
Measure improvement objectively with hard numbers
Compare different MCP server configurations
Benchmark against standard datasets (SWE-bench, CyberGym, etc.)

Is mcpbr free to use?

Yes! mcpbr is open-source under the MIT License. However, you'll need to pay for:

API calls to LLM providers (e.g., Anthropic)
Cloud compute if running in the cloud

What benchmarks does mcpbr support?

Currently:

✅ SWE-bench: Bug fixing in real GitHub repos
✅ CyberGym: Security vulnerability PoC generation
✅ TerminalBench: DevOps and shell scripting tasks
🚧 MCPToolBench++: Coming soon (comprehensive MCP tool evaluation)

Setup and Installation

What are the system requirements?

Python: 3.11 or higher
Docker: Latest stable version
RAM: 8GB minimum (16GB recommended)
Disk: 50GB free space (for Docker images)
OS: Linux, macOS, or Windows (with WSL2)

How do I install mcpbr?

pip install mcpbr

See Getting Started for detailed instructions.

Do I need an API key?

Yes, you need an API key from your LLM provider:

Anthropic: Get one at console.anthropic.com
OpenAI: Coming soon
Google Gemini: Coming soon

Can I use mcpbr without Docker?

No, Docker is required. mcpbr uses Docker to create isolated, reproducible environments for each benchmark task.

Why is Docker required?

Docker ensures:

Reproducibility: Same environment every time
Isolation: Tasks don't interfere with each other
Accuracy: Dependencies match the original task requirements
Safety: Prevents potential damage to your system

Usage Questions

How long does a benchmark run take?

It varies:

Small sample (n=5): 5-10 minutes
Medium sample (n=25): 30-60 minutes
Full dataset (n=300): Several hours to days

Factors:

Sample size
Task complexity
Model speed
Docker image availability (pre-built vs building from scratch)
Concurrency setting

How much does it cost to run benchmarks?

Approximate costs per task:

SWE-bench: $0.10 - $0.50 per task
CyberGym: $0.15 - $0.40 per task
TerminalBench: $0.05 - $0.20 per task

Example: Running 25 SWE-bench tasks ≈ $2.50 - $12.50

Costs depend on:

Model used (Opus > Sonnet > Haiku)
Task complexity
Number of iterations
Whether MCP server helps or hinders

Can I run benchmarks in parallel?

Yes! Use the max_concurrent setting:

max_concurrent: 4  # Run 4 tasks simultaneously

Recommendations:

Start with 2-4 for initial testing
Increase based on your machine's resources
Watch Docker memory usage
Higher concurrency = higher API rate limit risk

How do I run on specific tasks?

Use the -t flag:

mcpbr run -c config.yaml -t astropy__astropy-12907 -t django__django-11099

Can I resume a failed run?

Not automatically (yet). Workarounds:

Use --log-dir to save progress
Check which tasks completed
Run remaining tasks with -t flags

Coming soon: Resume functionality (#TBD)

Results and Interpretation

What does "resolved" mean?

For SWE-bench:

Agent generates a patch
Patch applies cleanly
All required tests pass

For CyberGym:

Agent generates a PoC exploit
PoC crashes vulnerable version
PoC doesn't crash patched version

How is improvement calculated?

Improvement = (MCP Rate - Baseline Rate) / Baseline Rate

Example:

Baseline: 20% (5/25 tasks)
MCP: 32% (8/25 tasks)
Improvement: (32 - 20) / 20 = +60%

What's a good improvement percentage?

It depends on the benchmark:

+20% to +50%: Good improvement
+50% to +100%: Excellent improvement
+100%+: Outstanding improvement
Negative: MCP server is hurting performance

Context matters:

Some MCP servers excel at specific task types
Cross-benchmark analysis helps identify strengths

Why did my MCP agent perform worse than baseline?

Possible reasons:

Tool overhead: MCP calls add latency
Distraction: Too many tools confuse the agent
Wrong tools: Tools not suited for these tasks
Configuration: MCP server misconfigured
Timeout: Agent spent too long on tools

Debugging tips:

Check --log-dir output
Look at tool usage patterns
Compare task types (coding vs security vs DevOps)
Try different MCP server configurations

How many tasks should I run for reliable results?

Minimum viable: n=10 (quick validation) Statistically sound: n=25-50 (recommended) Research grade: n=100+ (publication quality)

More tasks = more reliable results, but higher cost.

Troubleshooting

"Docker permission denied" error

Linux:

sudo usermod -aG docker $USER
# Log out and back in

macOS/Windows: Ensure Docker Desktop is running

"Pre-built image not found" message

This is normal! mcpbr will build from scratch:

First run: ~5-10 minutes per task
Uses generic Python environment
Still produces valid results

"API rate limit exceeded" error

Solutions:

Reduce max_concurrent in config
Add delays between runs
Upgrade your API plan
Use a different model tier

Tasks timeout frequently

Increase timeout in config:

timeout_seconds: 600  # 10 minutes instead of 5

Or use a faster model (Sonnet instead of Opus).

Docker out of disk space

Clean up:

# Remove mcpbr containers
mcpbr cleanup

# Remove all stopped containers
docker container prune

# Remove unused images
docker image prune -a

Can't find Claude CLI

Install it:

pip install claude-code

# Verify
which claude

Advanced Topics

Can I use my own benchmarks?

Not directly (yet). Coming in Phase 2:

Custom benchmark YAML support
See roadmap issue #TBD

Workaround: Extend the Benchmark protocol in your fork.

Can I use different LLM providers?

Currently only Anthropic is supported. Coming soon:

OpenAI (GPT-4, GPT-3.5)
Google Gemini
Alibaba Qwen

See issue #229.

How do I contribute a new benchmark?

See Contributing Guide and the benchmark protocol in src/mcpbr/benchmarks/base.py.

Can I run mcpbr in CI/CD?

Yes! Example GitHub Actions workflow:

- name: Run mcpbr benchmark
  run: |
    mcpbr run -c config.yaml -n 5 -o results.json
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Coming soon: Official GitHub Action (#TBD)

How do I optimize for cost?

Use Haiku model (cheaper, faster)
Start with small samples (n=5-10)
Reduce timeout for simple tasks
Use pre-built images when available
Cache Docker images locally
Run during off-peak hours (if provider has pricing tiers)

How do I optimize for speed?

Increase max_concurrent
Use Haiku model
Enable Docker image caching (coming soon, #228)
Use SSD storage for Docker
Run on powerful hardware (more CPU/RAM)

Community and Support

How do I get help?

Search the FAQ (you're here!)
Check Troubleshooting
Search GitHub Issues
Ask in Discussions
Open a new issue if you found a bug

How do I request a feature?

Open a feature request with:

Clear description
Use case
Example or mockup (if applicable)

How can I contribute?

See Contributing Guide! We welcome:

Bug reports and fixes
Feature suggestions and implementations
Documentation improvements
Example configurations
Benchmark datasets

Where's the roadmap?

mcpbr v1.0 Roadmap - 200+ planned features!

Is there a Discord/Slack?

Coming soon! Follow the project for updates.

Still have questions?

💬 GitHub Discussions
📧 Contact maintainers
🐛 Report an issue

Can't find your answer? Help us improve this FAQ by suggesting a question!

FAQ

Frequently Asked Questions (FAQ)

General Questions

What is mcpbr?

Why should I use mcpbr?

Is mcpbr free to use?

What benchmarks does mcpbr support?

Setup and Installation

What are the system requirements?

How do I install mcpbr?

Do I need an API key?

Can I use mcpbr without Docker?

Why is Docker required?

Usage Questions

How long does a benchmark run take?

How much does it cost to run benchmarks?

Can I run benchmarks in parallel?

How do I run on specific tasks?

Can I resume a failed run?

Results and Interpretation

What does "resolved" mean?

How is improvement calculated?

What's a good improvement percentage?

Why did my MCP agent perform worse than baseline?

How many tasks should I run for reliable results?

Troubleshooting

"Docker permission denied" error

"Pre-built image not found" message

"API rate limit exceeded" error

Tasks timeout frequently

Docker out of disk space

Can't find Claude CLI

Advanced Topics

Can I use my own benchmarks?

Can I use different LLM providers?

How do I contribute a new benchmark?

Can I run mcpbr in CI/CD?

How do I optimize for cost?

How do I optimize for speed?

Community and Support

How do I get help?

How do I request a feature?

How can I contribute?

Where's the roadmap?

Is there a Discord/Slack?

Still have questions?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally