Skip to content
CodeGraphTheory edited this page Jan 20, 2026 · 1 revision

Frequently Asked Questions (FAQ)

General Questions

What is mcpbr?

mcpbr (Model Context Protocol Benchmark Runner) is a framework for evaluating MCP servers against real-world software engineering benchmarks. It provides apples-to-apples comparisons between tool-assisted and baseline agent performance.

Why should I use mcpbr?

mcpbr helps you:

  • Prove your MCP server actually improves agent performance
  • Measure improvement objectively with hard numbers
  • Compare different MCP server configurations
  • Benchmark against standard datasets (SWE-bench, CyberGym, etc.)

Is mcpbr free to use?

Yes! mcpbr is open-source under the MIT License. However, you'll need to pay for:

  • API calls to LLM providers (e.g., Anthropic)
  • Cloud compute if running in the cloud

What benchmarks does mcpbr support?

Currently:

  • SWE-bench: Bug fixing in real GitHub repos
  • CyberGym: Security vulnerability PoC generation
  • TerminalBench: DevOps and shell scripting tasks
  • 🚧 MCPToolBench++: Coming soon (comprehensive MCP tool evaluation)

Setup and Installation

What are the system requirements?

  • Python: 3.11 or higher
  • Docker: Latest stable version
  • RAM: 8GB minimum (16GB recommended)
  • Disk: 50GB free space (for Docker images)
  • OS: Linux, macOS, or Windows (with WSL2)

How do I install mcpbr?

pip install mcpbr

See Getting Started for detailed instructions.

Do I need an API key?

Yes, you need an API key from your LLM provider:

Can I use mcpbr without Docker?

No, Docker is required. mcpbr uses Docker to create isolated, reproducible environments for each benchmark task.

Why is Docker required?

Docker ensures:

  • Reproducibility: Same environment every time
  • Isolation: Tasks don't interfere with each other
  • Accuracy: Dependencies match the original task requirements
  • Safety: Prevents potential damage to your system

Usage Questions

How long does a benchmark run take?

It varies:

  • Small sample (n=5): 5-10 minutes
  • Medium sample (n=25): 30-60 minutes
  • Full dataset (n=300): Several hours to days

Factors:

  • Sample size
  • Task complexity
  • Model speed
  • Docker image availability (pre-built vs building from scratch)
  • Concurrency setting

How much does it cost to run benchmarks?

Approximate costs per task:

  • SWE-bench: $0.10 - $0.50 per task
  • CyberGym: $0.15 - $0.40 per task
  • TerminalBench: $0.05 - $0.20 per task

Example: Running 25 SWE-bench tasks ≈ $2.50 - $12.50

Costs depend on:

  • Model used (Opus > Sonnet > Haiku)
  • Task complexity
  • Number of iterations
  • Whether MCP server helps or hinders

Can I run benchmarks in parallel?

Yes! Use the max_concurrent setting:

max_concurrent: 4  # Run 4 tasks simultaneously

Recommendations:

  • Start with 2-4 for initial testing
  • Increase based on your machine's resources
  • Watch Docker memory usage
  • Higher concurrency = higher API rate limit risk

How do I run on specific tasks?

Use the -t flag:

mcpbr run -c config.yaml -t astropy__astropy-12907 -t django__django-11099

Can I resume a failed run?

Not automatically (yet). Workarounds:

  1. Use --log-dir to save progress
  2. Check which tasks completed
  3. Run remaining tasks with -t flags

Coming soon: Resume functionality (#TBD)


Results and Interpretation

What does "resolved" mean?

For SWE-bench:

  • Agent generates a patch
  • Patch applies cleanly
  • All required tests pass

For CyberGym:

  • Agent generates a PoC exploit
  • PoC crashes vulnerable version
  • PoC doesn't crash patched version

How is improvement calculated?

Improvement = (MCP Rate - Baseline Rate) / Baseline Rate

Example:

  • Baseline: 20% (5/25 tasks)
  • MCP: 32% (8/25 tasks)
  • Improvement: (32 - 20) / 20 = +60%

What's a good improvement percentage?

It depends on the benchmark:

  • +20% to +50%: Good improvement
  • +50% to +100%: Excellent improvement
  • +100%+: Outstanding improvement
  • Negative: MCP server is hurting performance

Context matters:

  • Some MCP servers excel at specific task types
  • Cross-benchmark analysis helps identify strengths

Why did my MCP agent perform worse than baseline?

Possible reasons:

  1. Tool overhead: MCP calls add latency
  2. Distraction: Too many tools confuse the agent
  3. Wrong tools: Tools not suited for these tasks
  4. Configuration: MCP server misconfigured
  5. Timeout: Agent spent too long on tools

Debugging tips:

  • Check --log-dir output
  • Look at tool usage patterns
  • Compare task types (coding vs security vs DevOps)
  • Try different MCP server configurations

How many tasks should I run for reliable results?

Minimum viable: n=10 (quick validation) Statistically sound: n=25-50 (recommended) Research grade: n=100+ (publication quality)

More tasks = more reliable results, but higher cost.


Troubleshooting

"Docker permission denied" error

Linux:

sudo usermod -aG docker $USER
# Log out and back in

macOS/Windows: Ensure Docker Desktop is running

"Pre-built image not found" message

This is normal! mcpbr will build from scratch:

  • First run: ~5-10 minutes per task
  • Uses generic Python environment
  • Still produces valid results

"API rate limit exceeded" error

Solutions:

  1. Reduce max_concurrent in config
  2. Add delays between runs
  3. Upgrade your API plan
  4. Use a different model tier

Tasks timeout frequently

Increase timeout in config:

timeout_seconds: 600  # 10 minutes instead of 5

Or use a faster model (Sonnet instead of Opus).

Docker out of disk space

Clean up:

# Remove mcpbr containers
mcpbr cleanup

# Remove all stopped containers
docker container prune

# Remove unused images
docker image prune -a

Can't find Claude CLI

Install it:

pip install claude-code

# Verify
which claude

Advanced Topics

Can I use my own benchmarks?

Not directly (yet). Coming in Phase 2:

  • Custom benchmark YAML support
  • See roadmap issue #TBD

Workaround: Extend the Benchmark protocol in your fork.

Can I use different LLM providers?

Currently only Anthropic is supported. Coming soon:

  • OpenAI (GPT-4, GPT-3.5)
  • Google Gemini
  • Alibaba Qwen

See issue #229.

How do I contribute a new benchmark?

See Contributing Guide and the benchmark protocol in src/mcpbr/benchmarks/base.py.

Can I run mcpbr in CI/CD?

Yes! Example GitHub Actions workflow:

- name: Run mcpbr benchmark
  run: |
    mcpbr run -c config.yaml -n 5 -o results.json
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Coming soon: Official GitHub Action (#TBD)

How do I optimize for cost?

  1. Use Haiku model (cheaper, faster)
  2. Start with small samples (n=5-10)
  3. Reduce timeout for simple tasks
  4. Use pre-built images when available
  5. Cache Docker images locally
  6. Run during off-peak hours (if provider has pricing tiers)

How do I optimize for speed?

  1. Increase max_concurrent
  2. Use Haiku model
  3. Enable Docker image caching (coming soon, #228)
  4. Use SSD storage for Docker
  5. Run on powerful hardware (more CPU/RAM)

Community and Support

How do I get help?

  1. Search the FAQ (you're here!)
  2. Check Troubleshooting
  3. Search GitHub Issues
  4. Ask in Discussions
  5. Open a new issue if you found a bug

How do I request a feature?

Open a feature request with:

  • Clear description
  • Use case
  • Example or mockup (if applicable)

How can I contribute?

See Contributing Guide! We welcome:

  • Bug reports and fixes
  • Feature suggestions and implementations
  • Documentation improvements
  • Example configurations
  • Benchmark datasets

Where's the roadmap?

mcpbr v1.0 Roadmap - 200+ planned features!

Is there a Discord/Slack?

Coming soon! Follow the project for updates.


Still have questions?

Can't find your answer? Help us improve this FAQ by suggesting a question!

Clone this wiki locally