Skip to content

Conversation

@AnilSorathiya
Copy link
Contributor

@AnilSorathiya AnilSorathiya commented Dec 17, 2025

Pull Request Description

Add agentic metrics from deepeval as scorers

What and why?

This PR adds four new agentic evaluation scorers from the deepeval library to enable comprehensive evaluation of LLM agents. These metrics evaluate different aspects of agent behavior:

New Scorers Added:

  1. ArgumentCorrectness - Evaluates whether agents generate correct arguments for tool calls. This is important because selecting the right tool with wrong arguments is as problematic as selecting the wrong tool entirely. This metric is fully LLM-based and referenceless.

  2. PlanAdherence - Evaluates whether agents follow their own plan during execution. Creating a good plan is only half the battle—an agent that deviates from its strategy mid-execution undermines its own reasoning.

  3. PlanQuality - Evaluates whether the plan generated by the agent is logical, complete, and efficient for accomplishing the given task. It extracts the task and plan from the agent's trace and uses an LLM judge to assess plan quality.

  4. ToolCorrectness - Evaluates whether the agent called the expected tools in a task, and whether argument and response information matches ground truth expectations. This metric compares the tools the agent actually called to the list of expected tools.

Before: Only TaskCompletion and AnswerRelevancy were available for agent evaluation, providing limited coverage of agent behavior.

After: Users now have access to a comprehensive suite of agentic evaluation metrics covering:

  • Action layer evaluation (ArgumentCorrectness, ToolCorrectness)
  • Reasoning layer evaluation (PlanAdherence, PlanQuality)
  • Task completion evaluation (existing TaskCompletion)

Additional Changes:

  • Added helper functions in __init__.py to extract tool calls from agent outputs (extract_tool_calls_from_agent_output, _extract_tool_responses, _extract_tool_calls_from_message, _convert_to_tool_call_list)
  • Updated TaskCompletion.py to include _trace_dict in test cases for better traceability
  • Updated demo notebooks to showcase the new metrics
  • Enhanced agent_dataset.py to better support the new metrics

How to test

  1. Install dependencies:

    pip install validmind[llm]
  2. Run the demo notebook:

    • Open notebooks/code_samples/agents/langgraph_agent_simple_banking_demo.ipynb
    • Execute all cells to see the new scorers in action
    • The notebook demonstrates all four new metrics with a banking agent use case

What needs special review?

Dependencies, breaking changes, and deployment notes

Dependencies:

  • Requires deepeval package (already included in validmind[llm] extra)
  • No new external dependencies beyond what's already in the llm extra

Breaking changes:

  • None. This is a purely additive change.

Deployment notes:

  • No special deployment considerations
  • The new scorers are automatically available when validmind[llm] is installed
  • Existing code is unaffected

Release notes

Enhancement: Added four new agentic evaluation metrics from deepeval to enable comprehensive evaluation of LLM agents.

This release introduces four new scorers for evaluating agent behavior:

  • ArgumentCorrectness: Evaluates whether agents generate correct arguments for tool calls
  • PlanAdherence: Evaluates whether agents follow their own execution plans
  • PlanQuality: Evaluates the logical quality, completeness, and efficiency of agent-generated plans
  • ToolCorrectness: Evaluates whether agents call the expected tools with correct arguments

These metrics complement existing agent evaluation capabilities and provide coverage for both the action layer (tool usage) and reasoning layer (planning) of agent behavior. All metrics are available through the validmind.scorer.llm.deepeval module and require the validmind[llm] installation.

Checklist

  • What and why
  • Screenshots or videos (Frontend)
  • How to test
  • What needs special review
  • Dependencies, breaking changes, and deployment notes
  • Labels applied
  • PR linked to Shortcut
  • Unit tests added (Backend)
  • Tested locally
  • Documentation updated (if required)
  • Environment variable additions/changes documented (if required)

@AnilSorathiya AnilSorathiya added the enhancement New feature or request label Dec 17, 2025
@cachafla
Copy link
Contributor

cachafla commented Jan 5, 2026

Anyone using scorers in prod @rmcmen-vm? I don't assume so?

@AnilSorathiya thoughts about renaming the scorer directory to scorers (plural)?

This would make the file/test paths more consistent with everything else we already have, so it would look like this:

validmind.scorers.llm.deepeval.ArgumentCorrectness

@rmcmen-vm
Copy link
Contributor

replying to "in prod"- nothing in Production with customers. The "Agents" demo notebook is only spot, but nothing crazy.

Copy link
Contributor

@cachafla cachafla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Small request: let's add agents or agentic to the list of supported tags for the deepeval tests. Currently it looks like this:

@tags("llm", "ArgumentCorrectness", "deepeval", "agent_evaluation", "action_layer")

But I think a more dedicated "agents"/"agentic" tag could help a lot.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2026

PR Summary

This PR performs a comprehensive refactor of the scorer functionality by renaming all imports, usage, and identifiers from the old scorer namespace to the new scorers namespace. The changes affect multiple modules and test files, including adjustments in the provider and dataset handling code. In addition to the renaming, the PR updates LLM evaluation metrics and integrations for DeepEval, including new implementations for metrics such as ArgumentCorrectness, PlanAdherence, PlanQuality, and ToolCorrectness. The notebooks and agent demo scripts have been updated to use the new LLM model (e.g., changed from gpt-4o-mini to gpt-5-mini with additional reasoning parameters) and to sample a smaller dataset for testing. Minor adjustments in configuration files (e.g., in pyproject.toml and poetry.lock) reflect dependency updates and are unrelated to formatting changes. Overall, the functional changes principally enhance consistency in scoring API naming and improve the integration of comprehensive LLM-based evaluation metrics for agents.

Test Suggestions

  • Run the full test suite to ensure that all unit, integration, and scoring tests pass with the new 'scorers' naming convention.
  • Verify that the agent evaluation notebook performs correctly, especially checking that new model parameters and tool calls parse as expected.
  • Test the LLM evaluation metrics (ArgumentCorrectness, PlanAdherence, PlanQuality, ToolCorrectness) with both valid and edge-case datasets to validate results and error handling.
  • Perform regression tests for previous scoring functionalities to confirm backward compatibility and correct score column naming.

@AnilSorathiya AnilSorathiya merged commit de66e58 into main Jan 8, 2026
16 checks passed
@AnilSorathiya AnilSorathiya deleted the anilsorathiya/sc-13625/add-agentic-metrics-from-deepeval-as-scorers branch January 8, 2026 11:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants