Skip to content

Conversation

@salma-elshafey
Copy link
Contributor

Description

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

@salma-elshafey salma-elshafey requested a review from a team as a code owner February 12, 2026 12:37
Copilot AI review requested due to automatic review settings February 12, 2026 12:37
@github-actions github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Feb 12, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request changes the error handling behavior across multiple evaluator classes in the azure-ai-evaluation SDK. Previously, when evaluators received unparseable LLM output (non-dictionary format), they would return fallback values (NaN for scored evaluators, 0 for binary evaluators) and log a warning. Now, they raise an EvaluationException with standardized error information.

Changes:

  • Replaced fallback return values with exception raising when LLM output is not parseable
  • Standardized error messages to "Evaluator returned invalid output." across all affected evaluators
  • Changed error target from evaluator-specific targets to the generic ErrorTarget.EVALUATE for consistency with the base class pattern

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
_tool_selection/_tool_selection.py Throws exception instead of returning fallback value when LLM output is not a dictionary
_tool_output_utilization/_tool_output_utilization.py Replaces warning log + NaN return with exception raising for invalid output
_tool_input_accuracy/_tool_input_accuracy.py Throws exception instead of returning fallback value when LLM output is not a dictionary
_tool_call_success/_tool_call_success.py Replaces warning log + NaN return with exception raising for invalid output
_tool_call_accuracy/_tool_call_accuracy.py Throws exception instead of returning fallback value when LLM output is not a dictionary
_task_completion/_task_completion.py Replaces warning log + 0 return with exception raising for invalid output
_task_adherence/_task_adherence.py Replaces warning log + 0 return with exception raising for invalid output
_response_completeness/_response_completeness.py Replaces warning log + NaN return with exception raising for invalid output
_relevance/_relevance.py Replaces warning log + NaN return with exception raising for invalid output
_intent_resolution/_intent_resolution.py Replaces warning log + NaN return with exception raising for invalid output
_common/_base_prompty_eval.py Adds exception raising (already had this pattern, consolidated with other changes)

Comment on lines +235 to +240
raise EvaluationException(
message="Evaluator returned invalid output.",
blame=ErrorBlame.SYSTEM_ERROR,
category=ErrorCategory.FAILED_EXECUTION,
target=ErrorTarget.EVALUATE,
)
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking behavioral change that should be documented in the CHANGELOG. Previously, when the evaluator received invalid (non-parseable) output, it would return a default value (NaN or 0). Now it raises an EvaluationException. This change affects multiple evaluators: ToolSelectionEvaluator, ToolOutputUtilizationEvaluator, ToolInputAccuracyEvaluator, ToolCallSuccessEvaluator, ToolCallAccuracyEvaluator, TaskCompletionEvaluator, TaskAdherenceEvaluator, ResponseCompletenessEvaluator, RelevanceEvaluator, and IntentResolutionEvaluator. The CHANGELOG should document this under a "Breaking Changes" section.

Suggested change
raise EvaluationException(
message="Evaluator returned invalid output.",
blame=ErrorBlame.SYSTEM_ERROR,
category=ErrorCategory.FAILED_EXECUTION,
target=ErrorTarget.EVALUATE,
)
if logger:
logger.warning(
"LLM output is not a dictionary; returning NaN for the score and empty reason."
)
return {
f"{self._result_key}": math.nan,
f"{self._result_key}_reason": "",
f"{self._result_key}_result": "fail",
f"{self._result_key}_threshold": self._threshold,
f"{self._result_key}_prompt_tokens": prompty_output_dict.get("input_token_count", 0),
f"{self._result_key}_completion_tokens": prompty_output_dict.get("output_token_count", 0),
f"{self._result_key}_total_tokens": prompty_output_dict.get("total_token_count", 0),
f"{self._result_key}_finish_reason": prompty_output_dict.get("finish_reason", ""),
f"{self._result_key}_model": prompty_output_dict.get("model_id", ""),
f"{self._result_key}_sample_input": prompty_output_dict.get("sample_input", ""),
f"{self._result_key}_sample_output": prompty_output_dict.get("sample_output", ""),
}

Copilot uses AI. Check for mistakes.
Comment on lines +197 to +202
raise EvaluationException(
message="Evaluator returned invalid output.",
blame=ErrorBlame.SYSTEM_ERROR,
category=ErrorCategory.FAILED_EXECUTION,
target=ErrorTarget.EVALUATE,
)
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change. The existing test at line 230-244 in test_task_completion_evaluator.py expects this evaluator to return 0 when LLM output is not a dictionary, but the new code raises an exception instead. The test needs to be updated to expect an EvaluationException.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant