-
Notifications
You must be signed in to change notification settings - Fork 3
Add agentic metrics from deepeval as scorers #460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add agentic metrics from deepeval as scorers #460
Conversation
|
Anyone using scorers in prod @rmcmen-vm? I don't assume so? @AnilSorathiya thoughts about renaming the This would make the file/test paths more consistent with everything else we already have, so it would look like this:
|
|
replying to "in prod"- nothing in Production with customers. The "Agents" demo notebook is only spot, but nothing crazy. |
cachafla
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Small request: let's add agents or agentic to the list of supported tags for the deepeval tests. Currently it looks like this:
@tags("llm", "ArgumentCorrectness", "deepeval", "agent_evaluation", "action_layer")But I think a more dedicated "agents"/"agentic" tag could help a lot.
PR SummaryThis PR performs a comprehensive refactor of the scorer functionality by renaming all imports, usage, and identifiers from the old Test Suggestions
|
Pull Request Description
Add agentic metrics from deepeval as scorers
What and why?
This PR adds four new agentic evaluation scorers from the deepeval library to enable comprehensive evaluation of LLM agents. These metrics evaluate different aspects of agent behavior:
New Scorers Added:
ArgumentCorrectness - Evaluates whether agents generate correct arguments for tool calls. This is important because selecting the right tool with wrong arguments is as problematic as selecting the wrong tool entirely. This metric is fully LLM-based and referenceless.
PlanAdherence - Evaluates whether agents follow their own plan during execution. Creating a good plan is only half the battle—an agent that deviates from its strategy mid-execution undermines its own reasoning.
PlanQuality - Evaluates whether the plan generated by the agent is logical, complete, and efficient for accomplishing the given task. It extracts the task and plan from the agent's trace and uses an LLM judge to assess plan quality.
ToolCorrectness - Evaluates whether the agent called the expected tools in a task, and whether argument and response information matches ground truth expectations. This metric compares the tools the agent actually called to the list of expected tools.
Before: Only
TaskCompletionandAnswerRelevancywere available for agent evaluation, providing limited coverage of agent behavior.After: Users now have access to a comprehensive suite of agentic evaluation metrics covering:
Additional Changes:
__init__.pyto extract tool calls from agent outputs (extract_tool_calls_from_agent_output,_extract_tool_responses,_extract_tool_calls_from_message,_convert_to_tool_call_list)TaskCompletion.pyto include_trace_dictin test cases for better traceabilityagent_dataset.pyto better support the new metricsHow to test
Install dependencies:
Run the demo notebook:
notebooks/code_samples/agents/langgraph_agent_simple_banking_demo.ipynbWhat needs special review?
Dependencies, breaking changes, and deployment notes
Dependencies:
deepevalpackage (already included invalidmind[llm]extra)Breaking changes:
Deployment notes:
validmind[llm]is installedRelease notes
Enhancement: Added four new agentic evaluation metrics from deepeval to enable comprehensive evaluation of LLM agents.
This release introduces four new scorers for evaluating agent behavior:
These metrics complement existing agent evaluation capabilities and provide coverage for both the action layer (tool usage) and reasoning layer (planning) of agent behavior. All metrics are available through the
validmind.scorer.llm.deepevalmodule and require thevalidmind[llm]installation.Checklist