Skip to content

Reasoning support for evaluators #42482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 79 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 74 commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
4318329
Prepare evals SDK Release
May 28, 2025
192b980
Fix bug
May 28, 2025
758adb4
Fix for ADV_CONV for FDP projects
May 29, 2025
de09fd1
Update release date
May 29, 2025
ef60fe6
Merge branch 'main' into main
nagkumar91 May 29, 2025
8ca51d0
Merge branch 'Azure:main' into main
nagkumar91 May 30, 2025
98bfc3a
Merge branch 'Azure:main' into main
nagkumar91 Jun 2, 2025
a5f32e8
Merge branch 'Azure:main' into main
nagkumar91 Jun 9, 2025
5fd88b6
Merge branch 'Azure:main' into main
nagkumar91 Jun 10, 2025
51f2b44
Merge branch 'Azure:main' into main
nagkumar91 Jun 10, 2025
a5be8b5
Merge branch 'Azure:main' into main
nagkumar91 Jun 16, 2025
75965b7
Merge branch 'Azure:main' into main
nagkumar91 Jun 25, 2025
d0c5e53
Merge branch 'Azure:main' into main
nagkumar91 Jun 25, 2025
b790276
Merge branch 'Azure:main' into main
nagkumar91 Jun 26, 2025
d5ca243
Merge branch 'Azure:main' into main
nagkumar91 Jun 26, 2025
8d62e36
re-add pyrit to matrix
Jun 26, 2025
59a70f2
Change grader ids
Jun 26, 2025
4d146d7
Merge branch 'Azure:main' into main
nagkumar91 Jun 26, 2025
f7a4c83
Update unit test
Jun 27, 2025
79e3a40
replace all old grader IDs in tests
Jun 27, 2025
588cbec
Merge branch 'main' into main
nagkumar91 Jun 30, 2025
7514472
Update platform-matrix.json
nagkumar91 Jun 30, 2025
28b2513
Update test to ensure everything is mocked
Jul 1, 2025
8603e0e
tox/black fixes
Jul 1, 2025
895f226
Skip that test with issues
Jul 1, 2025
b4b2daf
Merge branch 'Azure:main' into main
nagkumar91 Jul 1, 2025
023f07f
update grader ID according to API View feedback
Jul 1, 2025
45b5f5d
Update test
Jul 2, 2025
1ccb4db
remove string check for grader ID
Jul 2, 2025
6fd9aa5
Merge branch 'Azure:main' into main
nagkumar91 Jul 2, 2025
f871855
Update changelog and officialy start freeze
Jul 2, 2025
59ac230
update the enum according to suggestions
Jul 2, 2025
794a2c4
update the changelog
Jul 2, 2025
b33363c
Finalize logic
Jul 2, 2025
464e2dd
Merge branch 'Azure:main' into main
nagkumar91 Jul 3, 2025
4585b14
Merge branch 'Azure:main' into main
nagkumar91 Jul 7, 2025
89c2988
Initial plan
Copilot Jul 7, 2025
6805018
Fix client request ID headers in azure-ai-evaluation
Copilot Jul 7, 2025
aad48df
Fix client request ID header format in rai_service.py
Copilot Jul 7, 2025
db75552
Merge pull request #5 from nagkumar91/copilot/fix-4
nagkumar91 Jul 10, 2025
b8eebf3
Merge branch 'Azure:main' into main
nagkumar91 Jul 10, 2025
2899ad4
Merge branch 'Azure:main' into main
nagkumar91 Jul 10, 2025
c431563
Merge branch 'Azure:main' into main
nagkumar91 Jul 17, 2025
79ed63c
Merge branch 'Azure:main' into main
nagkumar91 Jul 18, 2025
a3be3fc
Merge branch 'Azure:main' into main
nagkumar91 Jul 21, 2025
056ac4d
Passing threshold in AzureOpenAIScoreModelGrader
Jul 21, 2025
1779059
Add changelog
Jul 21, 2025
43fecff
Adding the self.pass_threshold instead of pass_threshold
Jul 21, 2025
b0c102b
Merge branch 'Azure:main' into main
nagkumar91 Jul 22, 2025
7bf5f1f
Add the python grader
Jul 22, 2025
3248ad0
Remove redundant test
Jul 22, 2025
d76f59b
Add class to exception list and format code
Jul 23, 2025
4d60e43
Merge branch 'main' into feature/python_grader
nagkumar91 Jul 24, 2025
98d1626
Merge branch 'Azure:main' into main
nagkumar91 Jul 24, 2025
9248c38
Add properties to evaluation upload run for FDP
Jul 24, 2025
74b760f
Remove debug
Jul 24, 2025
23dbc85
Merge branch 'feature/python_grader'
Jul 24, 2025
467ccb6
Remove the redundant property
Jul 24, 2025
c2beee8
Merge branch 'Azure:main' into main
nagkumar91 Jul 24, 2025
be9a19a
Fix changelog
Jul 24, 2025
de3a1e1
Fix the multiple features added section
Jul 24, 2025
f9faa61
removed the properties in update
Jul 24, 2025
69e783a
Merge branch 'Azure:main' into main
nagkumar91 Jul 28, 2025
8ebea2a
Merge branch 'Azure:main' into main
nagkumar91 Jul 31, 2025
3f9c818
Merge branch 'Azure:main' into main
nagkumar91 Aug 1, 2025
3b3159c
Merge branch 'Azure:main' into main
nagkumar91 Aug 5, 2025
d78b834
Merge branch 'Azure:main' into main
nagkumar91 Aug 6, 2025
ae3fc52
Merge branch 'Azure:main' into main
nagkumar91 Aug 8, 2025
19cce75
evaluation: support is_reasoning_model across all prompty-based evalu…
Aug 8, 2025
e59ca7f
evaluation: docs(Preview) + groundedness feature-detection + is_reaso…
Aug 8, 2025
98b4618
evaluation: revert _proxy_completion_model.py to origin/main version
Aug 8, 2025
706c042
Merge branch 'Azure:main' into main
nagkumar91 Aug 11, 2025
c418513
Merge remote-tracking branch 'origin/main' into diff-20250811-171736
Aug 12, 2025
86f24ba
Restore files that shouldn't have been modified
Aug 12, 2025
a1e55b4
Update sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evalua…
nagkumar91 Aug 12, 2025
bd6809f
Update the groundedness based on comments
Aug 12, 2025
3ae37cb
Add changelog to bug fix and link issue
Aug 12, 2025
6b8d4ce
Fix docstring
Aug 12, 2025
733ee1a
lint fixes
Aug 12, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions sdk/evaluation/azure-ai-evaluation/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,12 @@
### Features Added
- Added support for user-supplied tags in the `evaluate` function. Tags are key-value pairs that can be used for experiment tracking, A/B testing, filtering, and organizing evaluation runs. The function accepts a `tags` parameter.
- Enhanced `GroundednessEvaluator` to support AI agent evaluation with tool calls. The evaluator now accepts agent response data containing tool calls and can extract context from `file_search` tool results for groundedness assessment. This enables evaluation of AI agents that use tools to retrieve information and generate responses. Note: Agent groundedness evaluation is currently supported only when the `file_search` tool is used.
- Preview: Added `is_reasoning_model` keyword parameter to all prompty-based evaluators
(`SimilarityEvaluator`, `RelevanceEvaluator`, `CoherenceEvaluator`, `FluencyEvaluator`,
`RetrievalEvaluator`, `GroundednessEvaluator`, `IntentResolutionEvaluator`,
`ResponseCompletenessEvaluator`, `TaskAdherenceEvaluator`, `ToolCallAccuracyEvaluator`).
When set, evaluator prompty configuration is adjusted appropriately for reasoning models.
`QAEvaluator` now propagates this parameter to its prompty-based child evaluators.

### Bugs Fixed

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,22 @@

class CoherenceEvaluator(PromptyEvaluatorBase[Union[str, float]]):
"""
Evaluates coherence score for a given query and response or a multi-turn conversation, including reasoning.
Evaluates coherence for a given query and response or a multi-turn
conversation, including reasoning.

The coherence measure assesses the ability of the language model to generate text that reads naturally,
flows smoothly, and resembles human-like language in its responses. Use it when assessing the readability
and user-friendliness of a model's generated responses in real-world applications.
The coherence measure assesses the model's ability to generate text that
reads naturally, flows smoothly, and resembles human-like language. Use it
when assessing the readability and user-friendliness of responses.

:param model_config: Configuration for the Azure OpenAI model.
:type model_config: Union[~azure.ai.evaluation.AzureOpenAIModelConfiguration,
:type model_config:
Union[~azure.ai.evaluation.AzureOpenAIModelConfiguration,
~azure.ai.evaluation.OpenAIModelConfiguration]
:param threshold: The threshold for the coherence evaluator. Default is 3.
:type threshold: int
:keyword is_reasoning_model: (Preview) Adjusts prompty config
for reasoning models when True.
:paramtype is_reasoning_model: bool

.. admonition:: Example:

Expand All @@ -31,7 +36,8 @@ class CoherenceEvaluator(PromptyEvaluatorBase[Union[str, float]]):
:end-before: [END coherence_evaluator]
:language: python
:dedent: 8
:caption: Initialize and call CoherenceEvaluator using azure.ai.evaluation.AzureAIProject
:caption: Initialize and call CoherenceEvaluator using
azure.ai.evaluation.AzureAIProject

.. admonition:: Example using Azure AI Project URL:

Expand All @@ -40,7 +46,8 @@ class CoherenceEvaluator(PromptyEvaluatorBase[Union[str, float]]):
:end-before: [END coherence_evaluator]
:language: python
:dedent: 8
:caption: Initialize and call CoherenceEvaluator using Azure AI Project URL in following format
:caption: Initialize and call CoherenceEvaluator using Azure AI
Project URL in following format
https://{resource_name}.services.ai.azure.com/api/projects/{project_name}

.. admonition:: Example with Threshold:
Expand All @@ -50,23 +57,24 @@ class CoherenceEvaluator(PromptyEvaluatorBase[Union[str, float]]):
:end-before: [END threshold_coherence_evaluator]
:language: python
:dedent: 8
:caption: Initialize with threshold and call a CoherenceEvaluator with a query and response.
:caption: Initialize with threshold and call a CoherenceEvaluator
with a query and response.

.. note::

To align with our support of a diverse set of models, an output key without the `gpt_` prefix has been added.
To maintain backwards compatibility, the old key with the `gpt_` prefix is still be present in the output;
however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.
To align with support of diverse models, an output key without the
`gpt_` prefix has been added. The old key with the `gpt_` prefix is
still present for compatibility; however, it will be deprecated.
"""

_PROMPTY_FILE = "coherence.prompty"
_RESULT_KEY = "coherence"

id = "azureai://built-in/evaluators/coherence"
"""Evaluator identifier, experimental and to be used only with evaluation in cloud."""
"""Evaluator identifier for cloud evaluation."""

@override
def __init__(self, model_config, *, threshold=3):
def __init__(self, model_config, *, threshold=3, **kwargs):
current_dir = os.path.dirname(__file__)
prompty_path = os.path.join(current_dir, self._PROMPTY_FILE)
self._threshold = threshold
Expand All @@ -77,6 +85,7 @@ def __init__(self, model_config, *, threshold=3):
result_key=self._RESULT_KEY,
threshold=threshold,
_higher_is_better=self._higher_is_better,
**kwargs,
)

@overload
Expand Down Expand Up @@ -104,9 +113,11 @@ def __call__(
) -> Dict[str, Union[float, Dict[str, List[Union[str, float]]]]]:
"""Evaluate coherence for a conversation

:keyword conversation: The conversation to evaluate. Expected to contain a list of conversation turns under the
key "messages", and potentially a global context under the key "context". Conversation turns are expected
to be dictionaries with keys "content", "role", and possibly "context".
:keyword conversation: The conversation to evaluate. Expected to
contain a list of conversation turns under the key "messages",
and optionally a global context under the key "context". Turns are
dictionaries with keys "content", "role", and possibly
"context".
:paramtype conversation: Optional[~azure.ai.evaluation.Conversation]
:return: The coherence score.
:rtype: Dict[str, Union[float, Dict[str, List[float]]]]
Expand All @@ -118,19 +129,22 @@ def __call__( # pylint: disable=docstring-missing-param
*args,
**kwargs,
):
"""Evaluate coherence. Accepts either a query and response for a single evaluation,
or a conversation for a potentially multi-turn evaluation. If the conversation has more than one pair of
turns, the evaluator will aggregate the results of each turn.
"""Evaluate coherence.

Accepts a query/response for a single evaluation, or a conversation
for a multi-turn evaluation. If the conversation has more than one
pair of turns, results are aggregated.

:keyword query: The query to be evaluated.
:paramtype query: str
:keyword response: The response to be evaluated.
:paramtype response: Optional[str]
:keyword conversation: The conversation to evaluate. Expected to contain a list of conversation turns under the
key "messages". Conversation turns are expected
to be dictionaries with keys "content" and "role".
:keyword conversation: The conversation to evaluate. Expected to
contain conversation turns under the key "messages" as
dictionaries with keys "content" and "role".
:paramtype conversation: Optional[~azure.ai.evaluation.Conversation]
:return: The relevance score.
:rtype: Union[Dict[str, float], Dict[str, Union[float, Dict[str, List[float]]]]]
:rtype: Union[Dict[str, float], Dict[str, Union[float, Dict[str,
List[float]]]]]
"""
return super().__call__(*args, **kwargs)
Original file line number Diff line number Diff line change
Expand Up @@ -121,14 +121,18 @@ def __init__(
not_singleton_inputs: List[str] = ["conversation", "kwargs"],
eval_last_turn: bool = False,
conversation_aggregation_type: _AggregationType = _AggregationType.MEAN,
conversation_aggregator_override: Optional[Callable[[List[float]], float]] = None,
conversation_aggregator_override: Optional[
Callable[[List[float]], float]
] = None,
_higher_is_better: Optional[bool] = True,
):
self._not_singleton_inputs = not_singleton_inputs
self._eval_last_turn = eval_last_turn
self._singleton_inputs = self._derive_singleton_inputs()
self._async_evaluator = AsyncEvaluatorBase(self._real_call)
self._conversation_aggregation_function = GetAggregator(conversation_aggregation_type)
self._conversation_aggregation_function = GetAggregator(
conversation_aggregation_type
)
self._higher_is_better = _higher_is_better
self._threshold = threshold
if conversation_aggregator_override is not None:
Expand Down Expand Up @@ -190,7 +194,10 @@ def _derive_singleton_inputs(self) -> List[List[str]]:
overload_inputs = []
for call_signature in call_signatures:
params = call_signature.parameters
if any(not_singleton_input in params for not_singleton_input in self._not_singleton_inputs):
if any(
not_singleton_input in params
for not_singleton_input in self._not_singleton_inputs
):
continue
# exclude self since it is not a singleton input
overload_inputs.append([p for p in params if p != "self"])
Expand Down Expand Up @@ -234,7 +241,11 @@ def _get_matching_overload_inputs(self, **kwargs) -> List[str]:
best_match = inputs

# Return the best match or the first overload as fallback
return best_match if best_match is not None else (overload_inputs[0] if overload_inputs else [])
return (
best_match
if best_match is not None
else (overload_inputs[0] if overload_inputs else [])
)

def _get_all_singleton_inputs(self) -> List[str]:
"""Get a flattened list of all possible singleton inputs across all overloads.
Expand Down Expand Up @@ -345,12 +356,16 @@ def multi_modal_converter(conversation: Dict) -> List[Dict[str, Any]]:
if len(user_messages) != len(assistant_messages):
raise EvaluationException(
message="Mismatched number of user and assistant messages.",
internal_message=("Mismatched number of user and assistant messages."),
internal_message=(
"Mismatched number of user and assistant messages."
),
)
if len(assistant_messages) > 1:
raise EvaluationException(
message="Conversation can have only one assistant message.",
internal_message=("Conversation can have only one assistant message."),
internal_message=(
"Conversation can have only one assistant message."
),
)
eval_conv_inputs = []
for user_msg, assist_msg in zip(user_messages, assistant_messages):
Expand All @@ -359,12 +374,16 @@ def multi_modal_converter(conversation: Dict) -> List[Dict[str, Any]]:
conv_messages.append(system_messages[0])
conv_messages.append(user_msg)
conv_messages.append(assist_msg)
eval_conv_inputs.append({"conversation": Conversation(messages=conv_messages)})
eval_conv_inputs.append(
{"conversation": Conversation(messages=conv_messages)}
)
return eval_conv_inputs

return multi_modal_converter

def _convert_kwargs_to_eval_input(self, **kwargs) -> Union[List[Dict], List[DerivedEvalInput], Dict[str, Any]]:
def _convert_kwargs_to_eval_input(
self, **kwargs
) -> Union[List[Dict], List[DerivedEvalInput], Dict[str, Any]]:
"""Convert an arbitrary input into a list of inputs for evaluators.
It is assumed that evaluators generally make use of their inputs in one of two ways.
Either they receive a collection of keyname inputs that are all single values
Expand Down Expand Up @@ -414,7 +433,9 @@ def _convert_kwargs_to_eval_input(self, **kwargs) -> Union[List[Dict], List[Deri
matching_inputs = self._get_matching_overload_inputs(**kwargs)
if matching_inputs:
# Check if all required inputs for this overload are provided
required_singletons = {key: kwargs.get(key, None) for key in matching_inputs}
required_singletons = {
key: kwargs.get(key, None) for key in matching_inputs
}
required_singletons = remove_optional_singletons(self, required_singletons)
if all(value is not None for value in required_singletons.values()):
return [singletons]
Expand All @@ -438,11 +459,17 @@ def _is_multi_modal_conversation(self, conversation: Dict) -> bool:
if "content" in message:
content = message.get("content", "")
if isinstance(content, list):
if any(item.get("type") == "image_url" and "url" in item.get("image_url", {}) for item in content):
if any(
item.get("type") == "image_url"
and "url" in item.get("image_url", {})
for item in content
):
return True
return False

def _aggregate_results(self, per_turn_results: List[DoEvalResult[T_EvalValue]]) -> AggregateResult[T_EvalValue]:
def _aggregate_results(
self, per_turn_results: List[DoEvalResult[T_EvalValue]]
) -> AggregateResult[T_EvalValue]:
"""Aggregate the evaluation results of each conversation turn into a single result.

Exact implementation might need to vary slightly depending on the results produced.
Expand Down Expand Up @@ -472,7 +499,9 @@ def _aggregate_results(self, per_turn_results: List[DoEvalResult[T_EvalValue]])
# Find and average all numeric values
for metric, values in evaluation_per_turn.items():
if all(isinstance(value, (int, float)) for value in values):
aggregated[metric] = self._conversation_aggregation_function(cast(List[Union[int, float]], values))
aggregated[metric] = self._conversation_aggregation_function(
cast(List[Union[int, float]], values)
)
# Slap the per-turn results back in.
aggregated["evaluation_per_turn"] = evaluation_per_turn
return aggregated
Expand All @@ -489,17 +518,28 @@ def _parse_tools_from_response(self, response):
if isinstance(response, list):
for message in response:
# Extract tool calls from assistant messages
if message.get("role") == "assistant" and isinstance(message.get("content"), list):
if message.get("role") == "assistant" and isinstance(
message.get("content"), list
):
for content_item in message.get("content"):
if isinstance(content_item, dict) and content_item.get("type") == "tool_call":
if (
isinstance(content_item, dict)
and content_item.get("type") == "tool_call"
):
tool_calls.append(content_item)

# Extract tool results from tool messages
elif message.get("role") == "tool" and message.get("tool_call_id"):
tool_call_id = message.get("tool_call_id")
if isinstance(message.get("content"), list) and len(message.get("content")) > 0:
if (
isinstance(message.get("content"), list)
and len(message.get("content")) > 0
):
result_content = message.get("content")[0]
if isinstance(result_content, dict) and result_content.get("type") == "tool_result":
if (
isinstance(result_content, dict)
and result_content.get("type") == "tool_result"
):
tool_results_map[tool_call_id] = result_content

# Attach results to their corresponding calls
Expand All @@ -510,7 +550,9 @@ def _parse_tools_from_response(self, response):

return tool_calls

async def _real_call(self, **kwargs) -> Union[DoEvalResult[T_EvalValue], AggregateResult[T_EvalValue]]:
async def _real_call(
self, **kwargs
) -> Union[DoEvalResult[T_EvalValue], AggregateResult[T_EvalValue]]:
"""The asynchronous call where real end-to-end evaluation logic is performed.

:keyword kwargs: The inputs to evaluate.
Expand Down Expand Up @@ -563,7 +605,9 @@ def _to_async(self) -> "AsyncEvaluatorBase":

@experimental
@final
def _set_conversation_aggregation_type(self, conversation_aggregation_type: _AggregationType) -> None:
def _set_conversation_aggregation_type(
self, conversation_aggregation_type: _AggregationType
) -> None:
"""Input a conversation aggregation type to re-assign the aggregator function used by this evaluator for
multi-turn conversations. This aggregator is used to combine numeric outputs from each evaluation of a
multi-turn conversation into a single top-level result.
Expand All @@ -572,11 +616,15 @@ def _set_conversation_aggregation_type(self, conversation_aggregation_type: _Agg
results of a conversation to produce a single result.
:type conversation_aggregation_type: ~azure.ai.evaluation._AggregationType
"""
self._conversation_aggregation_function = GetAggregator(conversation_aggregation_type)
self._conversation_aggregation_function = GetAggregator(
conversation_aggregation_type
)

@experimental
@final
def _set_conversation_aggregator(self, aggregator: Callable[[List[float]], float]) -> None:
def _set_conversation_aggregator(
self, aggregator: Callable[[List[float]], float]
) -> None:
"""Set the conversation aggregator function directly. This function will be applied to all numeric outputs
of an evaluator when it evaluates a conversation with multiple-turns thus ends up with multiple results per
evaluation that is needs to coalesce into a single result. Use when built-in aggregators do not
Expand Down Expand Up @@ -606,7 +654,9 @@ class AsyncEvaluatorBase:
to ensure that no one ever needs to extend or otherwise modify this class directly.
"""

def __init__(self, real_call): # DO NOT ADD TYPEHINT PROMPT FLOW WILL SCREAM AT YOU ABOUT META GENERATION
def __init__(
self, real_call
): # DO NOT ADD TYPEHINT PROMPT FLOW WILL SCREAM AT YOU ABOUT META GENERATION
self._real_call = real_call

# Don't look at my shame. Nothing to see here....
Expand Down
Loading
Loading