EvalView — Proof that your agent still works.

You changed a prompt. Swapped a model. Updated a tool. Did anything break? Run EvalView. Know for sure.

pip install evalview && evalview demo   # No API key needed

Like it? Give us a ⭐ — it helps more devs discover EvalView.

What EvalView Catches

Status	What it means	What you do
✅ PASSED	Agent behavior matches baseline	Ship with confidence
⚠️ TOOLS_CHANGED	Agent is calling different tools	Review the diff
⚠️ OUTPUT_CHANGED	Same tools, output quality shifted	Review the diff
❌ REGRESSION	Score dropped significantly	Fix before shipping

How It Works

1. Your agent works correctly
   → evalview run --save-golden          # Save it as your baseline

2. You change something (prompt, model, tools)
   → evalview run --diff                  # Compare against baseline

3. EvalView tells you exactly what changed
   → REGRESSION: score 85 → 71
   → TOOLS_CHANGED: +web_search, -calculator
   → Agent healthy. No regressions detected.

That's it. Deterministic proof, no LLM-as-judge required, no API keys needed.

Quick Start

pip install evalview
evalview quickstart                 # Working example in 2 minutes

Or try the demo first (zero setup):

evalview demo                       # See regression detection in action

Want LLM-as-judge scoring too?

export OPENAI_API_KEY='your-key'
evalview run                        # Adds output quality scoring

Prefer local/free evaluation?

evalview run --judge-provider ollama --judge-model llama3.2

Full getting started guide →

Why EvalView?

	Observability (LangSmith)	Benchmarks (Braintrust)	EvalView
Answers	"What did my agent do?"	"How good is my agent?"	"Did my agent change?"
Detects regressions	❌	⚠️ Manual	✅ Automatic
Golden baseline diffing	❌	❌	✅
Works without API keys	❌	❌	✅
Free & open source	❌	❌	✅
Works offline (Ollama)	❌	⚠️ Some	✅

Use observability tools to see what happened. Use EvalView to prove it didn't break.

Explore & Learn

Interactive Chat

Talk to your tests. Debug failures. Compare runs.

evalview chat

You: run the calculator test
🤖 Running calculator test...
✅ Passed (score: 92.5)

You: compare to yesterday
🤖 Score: 92.5 → 87.2 (-5.3)
   Tools: +1 added (validator)
   Cost: $0.003 → $0.005 (+67%)

Slash commands: /run, /test, /compare, /traces, /skill, /adapters

Chat mode docs →

EvalView Gym

Practice agent eval patterns with guided exercises.

evalview gym

Automate It

GitHub Actions

evalview init --ci    # Generates workflow file

Or add manually:

# .github/workflows/evalview.yml
name: Agent Health Check
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hidai25/eval-view@v0.2.5
        with:
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}
          diff: true
          fail-on: 'REGRESSION'

PRs with regressions get blocked. Add a PR comment showing exactly what changed:

      - run: evalview ci comment
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Full CI/CD setup →

Supported Agents & Frameworks

Agent	E2E Testing	Trace Capture
Claude Code	✅	✅
OpenAI Codex	✅	✅
LangGraph	✅	✅
CrewAI	✅	✅
OpenAI Assistants	✅	✅
Custom (any CLI/API)	✅	✅

Also works with: AutoGen • Dify • Ollama • HuggingFace • Any HTTP API

Compatibility details →

Features

Feature	Description	Docs
Golden Traces	Save baselines, detect regressions with `--diff`	→
Chat Mode	AI assistant: `/run`, `/test`, `/compare`	→
Tool Categories	Match by intent, not exact tool names	→
Statistical Mode	Handle flaky LLMs with `--runs N` and pass@k	→
Cost & Latency	Automatic threshold enforcement	→
HTML Reports	Interactive Plotly charts	→
Test Generation	Generate 1000 tests from 1	→
Suite Types	Separate capability vs regression tests	→
Difficulty Levels	Filter by `--difficulty hard`, benchmark by tier	→
Behavior Coverage	Track tasks, tools, paths tested	→

Advanced: Skills Testing

Test that your agent's code actually works — not just that the output looks right. Best for teams maintaining SKILL.md workflows for Claude Code or Codex.

tests:
  - name: creates-working-api
    input: "Create an express server with /health endpoint"
    expected:
      files_created: ["index.js", "package.json"]
      build_must_pass:
        - "npm install"
        - "npm run lint"
      smoke_tests:
        - command: "node index.js"
          background: true
          health_check: "http://localhost:3000/health"
          expected_status: 200
          timeout: 10
      no_sudo: true
      git_clean: true

evalview skill test tests.yaml --agent claude-code
evalview skill test tests.yaml --agent codex
evalview skill test tests.yaml --agent langgraph

Check	What it catches
`build_must_pass`	Code that doesn't compile, missing dependencies
`smoke_tests`	Runtime crashes, wrong ports, failed health checks
`git_clean`	Uncommitted files, dirty working directory
`no_sudo`	Privilege escalation attempts
`max_tokens`	Cost blowouts, verbose outputs

Skills testing docs →

Documentation


Getting Started	CLI Reference
Golden Traces	CI/CD Integration
Tool Categories	Statistical Mode
Chat Mode	Evaluation Metrics
Skills Testing	Debugging
FAQ

Guides: Testing LangGraph in CI • Detecting Hallucinations

Examples

Framework	Link
Claude Code (E2E)	examples/agent-test/
LangGraph	examples/langgraph/
CrewAI	examples/crewai/
Anthropic Claude	examples/anthropic/
Dify	examples/dify/
Ollama (Local)	examples/ollama/

Node.js? See @evalview/node

Get Help

Questions? GitHub Discussions
Bugs? GitHub Issues
Want setup help? Email hidai@evalview.com — happy to help configure your first tests

Roadmap

Shipped: Golden traces • Tool categories • Statistical mode • Difficulty levels • Partial sequence credit • Skills validation • E2E agent testing • Build & smoke tests • Health checks • Safety guards (no_sudo, git_clean) • Claude Code & Codex adapters • Opus 4.6 cost tracking • MCP servers • HTML reports • Interactive chat mode • EvalView Gym

Coming: Agent Teams trace analysis • Multi-turn conversations • Grounded hallucination detection • Error compounding metrics • Container isolation

Vote on features →

Contributing

Contributions welcome! See CONTRIBUTING.md.

License: Apache 2.0

Proof that your agent still works.
Get started →

EvalView is an independent open-source project, not affiliated with LangGraph, CrewAI, OpenAI, Anthropic, or any other third party.

Name		Name	Last commit message	Last commit date
Latest commit History 260 Commits
.github		.github
assets		assets
demo-agent		demo-agent
docs		docs
dogfood		dogfood
evalview		evalview
examples		examples
guides		guides
gym		gym
scripts		scripts
sdks/node		sdks/node
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md
action.yml		action.yml
diff-report.html		diff-report.html
goosebench-test.txt		goosebench-test.txt
package.json		package.json
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
report.html		report.html
requirements.txt		requirements.txt
test_verbose.sh		test_verbose.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EvalView — Proof that your agent still works.

What EvalView Catches

How It Works

Quick Start

Why EvalView?

Explore & Learn

Interactive Chat

EvalView Gym

Automate It

GitHub Actions

Supported Agents & Frameworks

Features

Advanced: Skills Testing

Documentation

Examples

Get Help

Roadmap

Contributing

About

Uh oh!

Releases 11

Packages

Contributors 3

Uh oh!

Languages

License

hidai25/eval-view

Folders and files

Latest commit

History

Repository files navigation

EvalView — Proof that your agent still works.

What EvalView Catches

How It Works

Quick Start

Why EvalView?

Explore & Learn

Interactive Chat

EvalView Gym

Automate It

GitHub Actions

Supported Agents & Frameworks

Features

Advanced: Skills Testing

Documentation

Examples

Get Help

Roadmap

Contributing

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 11

Packages 0

Contributors 3

Uh oh!

Languages

Packages