-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
P1High priorityHigh priorityenhancementNew feature or requestNew feature or requesttestingTesting and evaluationTesting and evaluation
Description
Summary
Built-in evaluation system: define test suites per agent with input/expected-output pairs, run evaluations on demand or on config change, track quality scores over time. Support multiple scorer types.
Motivation
W&B Weave, Strongly.AI, and LangSmith all offer evals. MATE has 134 unit tests but no agent behavior testing. Without evals, users can't measure if a config change improved or degraded the agent.
Scope
- New
agent_evaluationsandevaluation_resultstables - Test suite definition: list of input, expected_output, scorer_type per agent
- Scorer types: exact match, regex, contains, semantic similarity, LLM-as-judge
- Run evaluations: on demand, on config save (optional), via API
- Dashboard eval results page: pass/fail rates, score trends over time, per-case details
- CI hook:
python -m mate.eval --agent <name> --suite <suite>for pipeline integration
Acceptance Criteria
- Define evaluation suites via dashboard
- Run evaluations and view results with pass/fail per case
- At least 3 scorer types implemented
- Score trend charts over time
- CLI command for CI/CD integration
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P1High priorityHigh priorityenhancementNew feature or requestNew feature or requesttestingTesting and evaluationTesting and evaluation