Add agentic metrics from deepeval as scorers #460

AnilSorathiya · 2025-12-17T13:04:30Z

Pull Request Description

Add agentic metrics from deepeval as scorers

What and why?

This PR adds four new agentic evaluation scorers from the deepeval library to enable comprehensive evaluation of LLM agents. These metrics evaluate different aspects of agent behavior:

New Scorers Added:

ArgumentCorrectness - Evaluates whether agents generate correct arguments for tool calls. This is important because selecting the right tool with wrong arguments is as problematic as selecting the wrong tool entirely. This metric is fully LLM-based and referenceless.
PlanAdherence - Evaluates whether agents follow their own plan during execution. Creating a good plan is only half the battle—an agent that deviates from its strategy mid-execution undermines its own reasoning.
PlanQuality - Evaluates whether the plan generated by the agent is logical, complete, and efficient for accomplishing the given task. It extracts the task and plan from the agent's trace and uses an LLM judge to assess plan quality.
ToolCorrectness - Evaluates whether the agent called the expected tools in a task, and whether argument and response information matches ground truth expectations. This metric compares the tools the agent actually called to the list of expected tools.

Before: Only TaskCompletion and AnswerRelevancy were available for agent evaluation, providing limited coverage of agent behavior.

After: Users now have access to a comprehensive suite of agentic evaluation metrics covering:

Action layer evaluation (ArgumentCorrectness, ToolCorrectness)
Reasoning layer evaluation (PlanAdherence, PlanQuality)
Task completion evaluation (existing TaskCompletion)

Additional Changes:

Added helper functions in __init__.py to extract tool calls from agent outputs (extract_tool_calls_from_agent_output, _extract_tool_responses, _extract_tool_calls_from_message, _convert_to_tool_call_list)
Updated TaskCompletion.py to include _trace_dict in test cases for better traceability
Updated demo notebooks to showcase the new metrics
Enhanced agent_dataset.py to better support the new metrics

How to test

Install dependencies:
```
pip install validmind[llm]
```
Run the demo notebook:
- Open notebooks/code_samples/agents/langgraph_agent_simple_banking_demo.ipynb
- Execute all cells to see the new scorers in action
- The notebook demonstrates all four new metrics with a banking agent use case

What needs special review?

Dependencies, breaking changes, and deployment notes

Dependencies:

Requires deepeval package (already included in validmind[llm] extra)
No new external dependencies beyond what's already in the llm extra

Breaking changes:

None. This is a purely additive change.

Deployment notes:

No special deployment considerations
The new scorers are automatically available when validmind[llm] is installed
Existing code is unaffected

Release notes

Enhancement: Added four new agentic evaluation metrics from deepeval to enable comprehensive evaluation of LLM agents.

This release introduces four new scorers for evaluating agent behavior:

ArgumentCorrectness: Evaluates whether agents generate correct arguments for tool calls
PlanAdherence: Evaluates whether agents follow their own execution plans
PlanQuality: Evaluates the logical quality, completeness, and efficiency of agent-generated plans
ToolCorrectness: Evaluates whether agents call the expected tools with correct arguments

These metrics complement existing agent evaluation capabilities and provide coverage for both the action layer (tool usage) and reasoning layer (planning) of agent behavior. All metrics are available through the validmind.scorer.llm.deepeval module and require the validmind[llm] installation.

Checklist

cachafla · 2026-01-05T17:23:14Z

Anyone using scorers in prod @rmcmen-vm? I don't assume so?

@AnilSorathiya thoughts about renaming the scorer directory to scorers (plural)?

This would make the file/test paths more consistent with everything else we already have, so it would look like this:

validmind.scorers.llm.deepeval.ArgumentCorrectness

rmcmen-vm · 2026-01-05T17:35:17Z

replying to "in prod"- nothing in Production with customers. The "Agents" demo notebook is only spot, but nothing crazy.

cachafla

Great! Small request: let's add agents or agentic to the list of supported tags for the deepeval tests. Currently it looks like this:

@tags("llm", "ArgumentCorrectness", "deepeval", "agent_evaluation", "action_layer")

But I think a more dedicated "agents"/"agentic" tag could help a lot.

github-actions · 2026-01-08T11:29:45Z

PR Summary

This PR performs a comprehensive refactor of the scorer functionality by renaming all imports, usage, and identifiers from the old scorer namespace to the new scorers namespace. The changes affect multiple modules and test files, including adjustments in the provider and dataset handling code. In addition to the renaming, the PR updates LLM evaluation metrics and integrations for DeepEval, including new implementations for metrics such as ArgumentCorrectness, PlanAdherence, PlanQuality, and ToolCorrectness. The notebooks and agent demo scripts have been updated to use the new LLM model (e.g., changed from gpt-4o-mini to gpt-5-mini with additional reasoning parameters) and to sample a smaller dataset for testing. Minor adjustments in configuration files (e.g., in pyproject.toml and poetry.lock) reflect dependency updates and are unrelated to formatting changes. Overall, the functional changes principally enhance consistency in scoring API naming and improve the integration of comprehensive LLM-based evaluation metrics for agents.

Test Suggestions

Run the full test suite to ensure that all unit, integration, and scoring tests pass with the new 'scorers' naming convention.
Verify that the agent evaluation notebook performs correctly, especially checking that new model parameters and tool calls parse as expected.
Test the LLM evaluation metrics (ArgumentCorrectness, PlanAdherence, PlanQuality, ToolCorrectness) with both valid and edge-case datasets to validate results and error handling.
Perform regression tests for previous scoring functionalities to confirm backward compatibility and correct score column naming.

add agetic evaluation tests

a740d27

AnilSorathiya added the enhancement New feature or request label Dec 17, 2025

AnilSorathiya added 2 commits December 17, 2025 14:15

update notebook

f373cb0

add agentic scorers in the demo notebook

b749928

AnilSorathiya requested a review from cachafla January 5, 2026 14:53

AnilSorathiya assigned nibalizer and juanmleng and unassigned nibalizer and juanmleng Jan 5, 2026

AnilSorathiya requested review from juanmleng and nibalizer January 5, 2026 14:54

rename scorer folder to scorers

f5e51f5

cachafla approved these changes Jan 7, 2026

View reviewed changes

AnilSorathiya added 2 commits January 8, 2026 11:28

add relevant tags in the deepeval scorers

561ff29

add relevant tags in the deepeval scorers

7aab513

AnilSorathiya merged commit de66e58 into main Jan 8, 2026
16 checks passed

AnilSorathiya deleted the anilsorathiya/sc-13625/add-agentic-metrics-from-deepeval-as-scorers branch January 8, 2026 11:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add agentic metrics from deepeval as scorers #460

Add agentic metrics from deepeval as scorers #460

Uh oh!

AnilSorathiya commented Dec 17, 2025 •

edited

Loading

Uh oh!

cachafla commented Jan 5, 2026

Uh oh!

rmcmen-vm commented Jan 5, 2026

Uh oh!

cachafla left a comment

Uh oh!

github-actions bot commented Jan 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Add agentic metrics from deepeval as scorers #460

Add agentic metrics from deepeval as scorers #460

Uh oh!

Conversation

AnilSorathiya commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Description

What and why?

How to test

What needs special review?

Dependencies, breaking changes, and deployment notes

Release notes

Checklist

Uh oh!

cachafla commented Jan 5, 2026

Uh oh!

rmcmen-vm commented Jan 5, 2026

Uh oh!

cachafla left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 8, 2026

PR Summary

Test Suggestions

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

AnilSorathiya commented Dec 17, 2025 •

edited

Loading