You changed a prompt. Swapped a model. Updated a tool. Did anything break? Run EvalView. Know for sure.
pip install evalview && evalview demo # No API key neededLike it? Give us a ⭐ — it helps more devs discover EvalView.
| Status | What it means | What you do |
|---|---|---|
| ✅ PASSED | Agent behavior matches baseline | Ship with confidence |
| Agent is calling different tools | Review the diff | |
| Same tools, output quality shifted | Review the diff | |
| ❌ REGRESSION | Score dropped significantly | Fix before shipping |
1. Your agent works correctly
→ evalview run --save-golden # Save it as your baseline
2. You change something (prompt, model, tools)
→ evalview run --diff # Compare against baseline
3. EvalView tells you exactly what changed
→ REGRESSION: score 85 → 71
→ TOOLS_CHANGED: +web_search, -calculator
→ Agent healthy. No regressions detected.
That's it. Deterministic proof, no LLM-as-judge required, no API keys needed.
pip install evalview
evalview quickstart # Working example in 2 minutesOr try the demo first (zero setup):
evalview demo # See regression detection in actionWant LLM-as-judge scoring too?
export OPENAI_API_KEY='your-key'
evalview run # Adds output quality scoringPrefer local/free evaluation?
evalview run --judge-provider ollama --judge-model llama3.2| Observability (LangSmith) | Benchmarks (Braintrust) | EvalView | |
|---|---|---|---|
| Answers | "What did my agent do?" | "How good is my agent?" | "Did my agent change?" |
| Detects regressions | ❌ | ✅ Automatic | |
| Golden baseline diffing | ❌ | ❌ | ✅ |
| Works without API keys | ❌ | ❌ | ✅ |
| Free & open source | ❌ | ❌ | ✅ |
| Works offline (Ollama) | ❌ | ✅ |
Use observability tools to see what happened. Use EvalView to prove it didn't break.
Talk to your tests. Debug failures. Compare runs.
evalview chatYou: run the calculator test
🤖 Running calculator test...
✅ Passed (score: 92.5)
You: compare to yesterday
🤖 Score: 92.5 → 87.2 (-5.3)
Tools: +1 added (validator)
Cost: $0.003 → $0.005 (+67%)
Slash commands: /run, /test, /compare, /traces, /skill, /adapters
Practice agent eval patterns with guided exercises.
evalview gymevalview init --ci # Generates workflow fileOr add manually:
# .github/workflows/evalview.yml
name: Agent Health Check
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hidai25/eval-view@v0.2.5
with:
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
diff: true
fail-on: 'REGRESSION'PRs with regressions get blocked. Add a PR comment showing exactly what changed:
- run: evalview ci comment
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}| Agent | E2E Testing | Trace Capture |
|---|---|---|
| Claude Code | ✅ | ✅ |
| OpenAI Codex | ✅ | ✅ |
| LangGraph | ✅ | ✅ |
| CrewAI | ✅ | ✅ |
| OpenAI Assistants | ✅ | ✅ |
| Custom (any CLI/API) | ✅ | ✅ |
Also works with: AutoGen • Dify • Ollama • HuggingFace • Any HTTP API
| Feature | Description | Docs |
|---|---|---|
| Golden Traces | Save baselines, detect regressions with --diff |
→ |
| Chat Mode | AI assistant: /run, /test, /compare |
→ |
| Tool Categories | Match by intent, not exact tool names | → |
| Statistical Mode | Handle flaky LLMs with --runs N and pass@k |
→ |
| Cost & Latency | Automatic threshold enforcement | → |
| HTML Reports | Interactive Plotly charts | → |
| Test Generation | Generate 1000 tests from 1 | → |
| Suite Types | Separate capability vs regression tests | → |
| Difficulty Levels | Filter by --difficulty hard, benchmark by tier |
→ |
| Behavior Coverage | Track tasks, tools, paths tested | → |
Test that your agent's code actually works — not just that the output looks right. Best for teams maintaining SKILL.md workflows for Claude Code or Codex.
tests:
- name: creates-working-api
input: "Create an express server with /health endpoint"
expected:
files_created: ["index.js", "package.json"]
build_must_pass:
- "npm install"
- "npm run lint"
smoke_tests:
- command: "node index.js"
background: true
health_check: "http://localhost:3000/health"
expected_status: 200
timeout: 10
no_sudo: true
git_clean: trueevalview skill test tests.yaml --agent claude-code
evalview skill test tests.yaml --agent codex
evalview skill test tests.yaml --agent langgraph| Check | What it catches |
|---|---|
build_must_pass |
Code that doesn't compile, missing dependencies |
smoke_tests |
Runtime crashes, wrong ports, failed health checks |
git_clean |
Uncommitted files, dirty working directory |
no_sudo |
Privilege escalation attempts |
max_tokens |
Cost blowouts, verbose outputs |
| Getting Started | CLI Reference |
| Golden Traces | CI/CD Integration |
| Tool Categories | Statistical Mode |
| Chat Mode | Evaluation Metrics |
| Skills Testing | Debugging |
| FAQ |
Guides: Testing LangGraph in CI • Detecting Hallucinations
| Framework | Link |
|---|---|
| Claude Code (E2E) | examples/agent-test/ |
| LangGraph | examples/langgraph/ |
| CrewAI | examples/crewai/ |
| Anthropic Claude | examples/anthropic/ |
| Dify | examples/dify/ |
| Ollama (Local) | examples/ollama/ |
Node.js? See @evalview/node
- Questions? GitHub Discussions
- Bugs? GitHub Issues
- Want setup help? Email hidai@evalview.com — happy to help configure your first tests
Shipped: Golden traces • Tool categories • Statistical mode • Difficulty levels • Partial sequence credit • Skills validation • E2E agent testing • Build & smoke tests • Health checks • Safety guards (no_sudo, git_clean) • Claude Code & Codex adapters • Opus 4.6 cost tracking • MCP servers • HTML reports • Interactive chat mode • EvalView Gym
Coming: Agent Teams trace analysis • Multi-turn conversations • Grounded hallucination detection • Error compounding metrics • Container isolation
Contributions welcome! See CONTRIBUTING.md.
License: Apache 2.0
Proof that your agent still works.
Get started →
EvalView is an independent open-source project, not affiliated with LangGraph, CrewAI, OpenAI, Anthropic, or any other third party.
