Skip to content

Catch AI agent regressions before you ship. YAML test cases, golden baselines, execution tracing, cost tracking, CI integration. LangGraph, CrewAI, Anthropic, OpenAI.

License

Notifications You must be signed in to change notification settings

hidai25/eval-view

Use this GitHub action with your project
Add this Action to an existing workflow or create a new one
View on Marketplace

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

260 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

EvalView — Proof that your agent still works.

You changed a prompt. Swapped a model. Updated a tool. Did anything break? Run EvalView. Know for sure.

EvalView Demo

pip install evalview && evalview demo   # No API key needed

PyPI downloads GitHub stars CI License

Like it? Give us a ⭐ — it helps more devs discover EvalView.


What EvalView Catches

Status What it means What you do
PASSED Agent behavior matches baseline Ship with confidence
⚠️ TOOLS_CHANGED Agent is calling different tools Review the diff
⚠️ OUTPUT_CHANGED Same tools, output quality shifted Review the diff
REGRESSION Score dropped significantly Fix before shipping

How It Works

1. Your agent works correctly
   → evalview run --save-golden          # Save it as your baseline

2. You change something (prompt, model, tools)
   → evalview run --diff                  # Compare against baseline

3. EvalView tells you exactly what changed
   → REGRESSION: score 85 → 71
   → TOOLS_CHANGED: +web_search, -calculator
   → Agent healthy. No regressions detected.

That's it. Deterministic proof, no LLM-as-judge required, no API keys needed.


Quick Start

pip install evalview
evalview quickstart                 # Working example in 2 minutes

Or try the demo first (zero setup):

evalview demo                       # See regression detection in action

Want LLM-as-judge scoring too?

export OPENAI_API_KEY='your-key'
evalview run                        # Adds output quality scoring

Prefer local/free evaluation?

evalview run --judge-provider ollama --judge-model llama3.2

Full getting started guide →


Why EvalView?

Observability (LangSmith) Benchmarks (Braintrust) EvalView
Answers "What did my agent do?" "How good is my agent?" "Did my agent change?"
Detects regressions ⚠️ Manual ✅ Automatic
Golden baseline diffing
Works without API keys
Free & open source
Works offline (Ollama) ⚠️ Some

Use observability tools to see what happened. Use EvalView to prove it didn't break.


Explore & Learn

Interactive Chat

Talk to your tests. Debug failures. Compare runs.

evalview chat
You: run the calculator test
🤖 Running calculator test...
✅ Passed (score: 92.5)

You: compare to yesterday
🤖 Score: 92.5 → 87.2 (-5.3)
   Tools: +1 added (validator)
   Cost: $0.003 → $0.005 (+67%)

Slash commands: /run, /test, /compare, /traces, /skill, /adapters

Chat mode docs →

EvalView Gym

Practice agent eval patterns with guided exercises.

evalview gym

Automate It

GitHub Actions

evalview init --ci    # Generates workflow file

Or add manually:

# .github/workflows/evalview.yml
name: Agent Health Check
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hidai25/eval-view@v0.2.5
        with:
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}
          diff: true
          fail-on: 'REGRESSION'

PRs with regressions get blocked. Add a PR comment showing exactly what changed:

      - run: evalview ci comment
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Full CI/CD setup →


Supported Agents & Frameworks

Agent E2E Testing Trace Capture
Claude Code
OpenAI Codex
LangGraph
CrewAI
OpenAI Assistants
Custom (any CLI/API)

Also works with: AutoGen • Dify • Ollama • HuggingFace • Any HTTP API

Compatibility details →


Features

Feature Description Docs
Golden Traces Save baselines, detect regressions with --diff
Chat Mode AI assistant: /run, /test, /compare
Tool Categories Match by intent, not exact tool names
Statistical Mode Handle flaky LLMs with --runs N and pass@k
Cost & Latency Automatic threshold enforcement
HTML Reports Interactive Plotly charts
Test Generation Generate 1000 tests from 1
Suite Types Separate capability vs regression tests
Difficulty Levels Filter by --difficulty hard, benchmark by tier
Behavior Coverage Track tasks, tools, paths tested

Advanced: Skills Testing

Test that your agent's code actually works — not just that the output looks right. Best for teams maintaining SKILL.md workflows for Claude Code or Codex.

tests:
  - name: creates-working-api
    input: "Create an express server with /health endpoint"
    expected:
      files_created: ["index.js", "package.json"]
      build_must_pass:
        - "npm install"
        - "npm run lint"
      smoke_tests:
        - command: "node index.js"
          background: true
          health_check: "http://localhost:3000/health"
          expected_status: 200
          timeout: 10
      no_sudo: true
      git_clean: true
evalview skill test tests.yaml --agent claude-code
evalview skill test tests.yaml --agent codex
evalview skill test tests.yaml --agent langgraph
Check What it catches
build_must_pass Code that doesn't compile, missing dependencies
smoke_tests Runtime crashes, wrong ports, failed health checks
git_clean Uncommitted files, dirty working directory
no_sudo Privilege escalation attempts
max_tokens Cost blowouts, verbose outputs

Skills testing docs →


Documentation

Getting Started CLI Reference
Golden Traces CI/CD Integration
Tool Categories Statistical Mode
Chat Mode Evaluation Metrics
Skills Testing Debugging
FAQ

Guides: Testing LangGraph in CIDetecting Hallucinations


Examples

Framework Link
Claude Code (E2E) examples/agent-test/
LangGraph examples/langgraph/
CrewAI examples/crewai/
Anthropic Claude examples/anthropic/
Dify examples/dify/
Ollama (Local) examples/ollama/

Node.js? See @evalview/node


Get Help


Roadmap

Shipped: Golden traces • Tool categories • Statistical mode • Difficulty levels • Partial sequence credit • Skills validation • E2E agent testing • Build & smoke tests • Health checks • Safety guards (no_sudo, git_clean) • Claude Code & Codex adapters • Opus 4.6 cost tracking • MCP servers • HTML reports • Interactive chat mode • EvalView Gym

Coming: Agent Teams trace analysis • Multi-turn conversations • Grounded hallucination detection • Error compounding metrics • Container isolation

Vote on features →


Contributing

Contributions welcome! See CONTRIBUTING.md.

License: Apache 2.0


Proof that your agent still works.
Get started →


EvalView is an independent open-source project, not affiliated with LangGraph, CrewAI, OpenAI, Anthropic, or any other third party.

About

Catch AI agent regressions before you ship. YAML test cases, golden baselines, execution tracing, cost tracking, CI integration. LangGraph, CrewAI, Anthropic, OpenAI.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages