Skip to content

Agent Evaluation & Testing Framework #14

@antiv

Description

@antiv

Summary

Built-in evaluation system: define test suites per agent with input/expected-output pairs, run evaluations on demand or on config change, track quality scores over time. Support multiple scorer types.

Motivation

W&B Weave, Strongly.AI, and LangSmith all offer evals. MATE has 134 unit tests but no agent behavior testing. Without evals, users can't measure if a config change improved or degraded the agent.

Scope

  • New agent_evaluations and evaluation_results tables
  • Test suite definition: list of input, expected_output, scorer_type per agent
  • Scorer types: exact match, regex, contains, semantic similarity, LLM-as-judge
  • Run evaluations: on demand, on config save (optional), via API
  • Dashboard eval results page: pass/fail rates, score trends over time, per-case details
  • CI hook: python -m mate.eval --agent <name> --suite <suite> for pipeline integration

Acceptance Criteria

  • Define evaluation suites via dashboard
  • Run evaluations and view results with pass/fail per case
  • At least 3 scorer types implemented
  • Score trend charts over time
  • CLI command for CI/CD integration

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High priorityenhancementNew feature or requesttestingTesting and evaluation

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions