Skip to content

Commit 765f880

Browse files
MilesHollandNagkumar Arkalgudnagkumar91
authored
Mar25/evals/aoai integration (Azure#40630)
* add typespec autogen files * initial integration * add sub grader classes * remove extra print statement * change polling interval * add name to id property dictionary * column mapping logic * Add a that maps to pass or fail * better error handling and timeout logic * nits * Update sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_aoai/aoai_grader.py Co-authored-by: Nagkumar Arkalgud <[email protected]> * default AOAI version and experimental tag * log nits * testing * recordings * rename graders * CL * cspell * more changes to c spell * fix tests * fix tests * update requirements * rename aoai to azure open ai externally --------- Co-authored-by: Nagkumar Arkalgud <[email protected]> Co-authored-by: Nagkumar Arkalgud <[email protected]>
1 parent 98a16f6 commit 765f880

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+8517
-45
lines changed

sdk/evaluation/azure-ai-evaluation/CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,11 @@
44

55
### Features Added
66
- New `<evaluator>.binary_aggregate` field added to evaluation result metrics. This field contains the aggregated binary evaluation results for each evaluator, providing a summary of the evaluation outcomes.
7+
- Added support for Azure Open AI evaluation via 4 new 'grader' classes, which serve as wrappers around Azure Open AI grader configurations. These new grader objects can be supplied to the main `evaluate` method as if they were normal callable evaluators. The new classes are:
8+
- AzureOpenAIGrader (general class for experienced users)
9+
- AzureOpenAILabelGrader
10+
- AzureOpenAIStringCheckGrader
11+
- AzureOpenAITextSimilarityGrader
712

813
### Breaking Changes
914

sdk/evaluation/azure-ai-evaluation/assets.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,5 @@
22
"AssetsRepo": "Azure/azure-sdk-assets",
33
"AssetsRepoPrefixPath": "python",
44
"TagPrefix": "python/evaluation/azure-ai-evaluation",
5-
"Tag": "python/evaluation/azure-ai-evaluation_e33b6c53d7"
5+
"Tag": "python/evaluation/azure-ai-evaluation_497634c2bf"
66
}

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/__init__.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,11 @@
4040
Message,
4141
OpenAIModelConfiguration,
4242
)
43+
from ._aoai.aoai_grader import AzureOpenAIGrader
44+
from ._aoai.label_grader import AzureOpenAILabelGrader
45+
from ._aoai.string_check_grader import AzureOpenAIStringCheckGrader
46+
from ._aoai.text_similarity_grader import AzureOpenAITextSimilarityGrader
47+
4348

4449
_patch_all = []
4550

@@ -89,6 +94,10 @@
8994
"CodeVulnerabilityEvaluator",
9095
"UngroundedAttributesEvaluator",
9196
"ToolCallAccuracyEvaluator",
97+
"AzureOpenAIGrader",
98+
"AzureOpenAILabelGrader",
99+
"AzureOpenAIStringCheckGrader",
100+
"AzureOpenAITextSimilarityGrader",
92101
]
93102

94103
__all__.extend([p for p in _patch_all if p not in __all__])
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# ---------------------------------------------------------
2+
# Copyright (c) Microsoft Corporation. All rights reserved.
3+
# ---------------------------------------------------------
4+
5+
6+
from .aoai_grader import AzureOpenAIGrader
7+
8+
__all__ = [
9+
"AzureOpenAIGrader",
10+
]
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
# ---------------------------------------------------------
2+
# Copyright (c) Microsoft Corporation. All rights reserved.
3+
# ---------------------------------------------------------
4+
from azure.ai.evaluation._model_configurations import AzureOpenAIModelConfiguration, OpenAIModelConfiguration
5+
6+
from azure.ai.evaluation._constants import DEFAULT_AOAI_API_VERSION
7+
from azure.ai.evaluation._exceptions import ErrorBlame, ErrorCategory, ErrorTarget, EvaluationException
8+
from typing import Any, Dict, Union
9+
from azure.ai.evaluation._common._experimental import experimental
10+
11+
12+
@experimental
13+
class AzureOpenAIGrader():
14+
"""
15+
Base class for Azure OpenAI grader wrappers, recommended only for use by experienced OpenAI API users.
16+
Combines a model configuration and any grader configuration
17+
into a singular object that can be used in evaluations.
18+
19+
Supplying an AzureOpenAIGrader to the `evaluate` method will cause an asynchronous request to evaluate
20+
the grader via the OpenAI API. The results of the evaluation will then be merged into the standard
21+
evaluation results.
22+
23+
:param model_config: The model configuration to use for the grader.
24+
:type model_config: Union[
25+
~azure.ai.evaluation.AzureOpenAIModelConfiguration,
26+
~azure.ai.evaluation.OpenAIModelConfiguration
27+
]
28+
:param grader_config: The grader configuration to use for the grader. This is expected
29+
to be formatted as a dictionary that matches the specifications of the sub-types of
30+
the TestingCriterion alias specified in (OpenAI's SDK)[https://github.com/openai/openai-python/blob/ed53107e10e6c86754866b48f8bd862659134ca8/src/openai/types/eval_create_params.py#L151].
31+
:type grader_config: Dict[str, Any]
32+
:param kwargs: Additional keyword arguments to pass to the grader.
33+
:type kwargs: Dict[str, Any]
34+
35+
36+
"""
37+
38+
id = "aoai://general"
39+
40+
def __init__(self, model_config : Union[AzureOpenAIModelConfiguration, OpenAIModelConfiguration], grader_config: Dict[str, Any], **kwargs: Dict[str, Any]):
41+
self._model_config = model_config
42+
self._grader_config = grader_config
43+
44+
if kwargs.get("validate", True):
45+
self._validate_model_config()
46+
self._validate_grader_config()
47+
48+
49+
50+
def _validate_model_config(self) -> None:
51+
"""Validate the model configuration that this grader wrapper is using."""
52+
if "api_key" not in self._model_config or not self._model_config.get("api_key"):
53+
msg = f"{type(self).__name__}: Requires an api_key in the supplied model_config."
54+
raise EvaluationException(
55+
message=msg,
56+
blame=ErrorBlame.USER_ERROR,
57+
category=ErrorCategory.INVALID_VALUE,
58+
target=ErrorTarget.AOAI_GRADER,
59+
)
60+
61+
def _validate_grader_config(self) -> None:
62+
"""Validate the grader configuration that this grader wrapper is using."""
63+
64+
return
65+
66+
67+
def get_model_config(self) -> AzureOpenAIModelConfiguration:
68+
"""Get the model configuration that this grader wrapper is using.
69+
70+
:return: The model configuration.
71+
:rtype: AzureOpenAIModelConfiguration
72+
"""
73+
return self._model_config
74+
75+
def get_grader_config(self) -> Any:
76+
"""Get the grader configuration that this grader wrapper is using.
77+
78+
:return: The grader configuration.
79+
:rtype: Any
80+
"""
81+
return self._grader_config
82+
83+
def get_client(self) -> Any:
84+
"""Construct an appropriate OpenAI client using this grader's model configuration.
85+
Returns a slightly different client depending on whether or not this grader's model
86+
configuration is for Azure OpenAI or OpenAI.
87+
88+
:return: The OpenAI client.
89+
:rtype: [~openai.OpenAI, ~openai.AzureOpenAI]
90+
"""
91+
if "azure_endpoint" in self._model_config:
92+
from openai import AzureOpenAI
93+
# TODO set default values?
94+
return AzureOpenAI(
95+
azure_endpoint=self._model_config["azure_endpoint"],
96+
api_key=self._model_config.get("api_key", None), # Default-style access to appease linters.
97+
api_version=self._model_config.get("api_version", DEFAULT_AOAI_API_VERSION),
98+
azure_deployment=self._model_config.get("azure_deployment", ""),
99+
)
100+
from openai import OpenAI
101+
# TODO add default values for base_url and organization?
102+
return OpenAI(
103+
api_key=self._model_config["api_key"],
104+
base_url=self._model_config.get("base_url", ""),
105+
organization=self._model_config.get("organization", ""),
106+
)
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# ---------------------------------------------------------
2+
# Copyright (c) Microsoft Corporation. All rights reserved.
3+
# ---------------------------------------------------------
4+
from typing import Any, Dict, Union, List
5+
6+
from azure.ai.evaluation._model_configurations import AzureOpenAIModelConfiguration, OpenAIModelConfiguration
7+
from openai.types.eval_create_params import TestingCriterionLabelModel
8+
from azure.ai.evaluation._common._experimental import experimental
9+
10+
from .aoai_grader import AzureOpenAIGrader
11+
12+
@experimental
13+
class AzureOpenAILabelGrader(AzureOpenAIGrader):
14+
"""
15+
Wrapper class for OpenAI's label model graders.
16+
17+
Supplying a LabelGrader to the `evaluate` method will cause an asynchronous request to evaluate
18+
the grader via the OpenAI API. The results of the evaluation will then be merged into the standard
19+
evaluation results.
20+
21+
:param model_config: The model configuration to use for the grader.
22+
:type model_config: Union[
23+
~azure.ai.evaluation.AzureOpenAIModelConfiguration,
24+
~azure.ai.evaluation.OpenAIModelConfiguration
25+
]
26+
:param input: The list of label-based testing criterion for this grader. Individual
27+
values of this list are expected to be dictionaries that match the format of any of the valid
28+
(TestingCriterionLabelModelInput)[https://github.com/openai/openai-python/blob/ed53107e10e6c86754866b48f8bd862659134ca8/src/openai/types/eval_create_params.py#L125C1-L125C32]
29+
subtypes.
30+
:type input: List[Dict[str, str]]
31+
:param labels: A list of strings representing the classification labels of this grader.
32+
:type labels: List[str]
33+
:param model: The model to use for the evaluation. Must support structured outputs.
34+
:type model: str
35+
:param name: The name of the grader.
36+
:type name: str
37+
:param passing_labels: The labels that indicate a passing result. Must be a subset of labels.
38+
:type passing_labels: List[str]
39+
:param kwargs: Additional keyword arguments to pass to the grader.
40+
:type kwargs: Dict[str, Any]
41+
42+
43+
"""
44+
45+
id = "aoai://label_model"
46+
47+
def __init__(
48+
self,
49+
model_config : Union[AzureOpenAIModelConfiguration, OpenAIModelConfiguration],
50+
input: List[Dict[str, str]],
51+
labels: List[str],
52+
model: str,
53+
name: str,
54+
passing_labels: List[str],
55+
**kwargs: Dict[str, Any]
56+
):
57+
grader = TestingCriterionLabelModel(
58+
input=input,
59+
labels=labels,
60+
model=model,
61+
name=name,
62+
passing_labels=passing_labels,
63+
type="label_model",
64+
)
65+
super().__init__(model_config, grader, **kwargs)
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# ---------------------------------------------------------
2+
# Copyright (c) Microsoft Corporation. All rights reserved.
3+
# ---------------------------------------------------------
4+
from typing import Any, Dict, Union
5+
from typing_extensions import Literal
6+
7+
from azure.ai.evaluation._model_configurations import AzureOpenAIModelConfiguration, OpenAIModelConfiguration
8+
from openai.types.eval_string_check_grader import EvalStringCheckGrader
9+
from azure.ai.evaluation._common._experimental import experimental
10+
11+
from .aoai_grader import AzureOpenAIGrader
12+
13+
@experimental
14+
class AzureOpenAIStringCheckGrader(AzureOpenAIGrader):
15+
"""
16+
Wrapper class for OpenAI's string check graders.
17+
18+
Supplying a StringCheckGrader to the `evaluate` method will cause an asynchronous request to evaluate
19+
the grader via the OpenAI API. The results of the evaluation will then be merged into the standard
20+
evaluation results.
21+
22+
:param model_config: The model configuration to use for the grader.
23+
:type model_config: Union[
24+
~azure.ai.evaluation.AzureOpenAIModelConfiguration,
25+
~azure.ai.evaluation.OpenAIModelConfiguration
26+
]
27+
:param input: The input text. This may include template strings.
28+
:type input: str
29+
:param name: The name of the grader.
30+
:type name: str
31+
:param operation: The string check operation to perform. One of `eq`, `ne`, `like`, or `ilike`.
32+
:type operation: Literal["eq", "ne", "like", "ilike"]
33+
:param reference: The reference text. This may include template strings.
34+
:type reference: str
35+
:param kwargs: Additional keyword arguments to pass to the grader.
36+
:type kwargs: Dict[str, Any]
37+
38+
39+
"""
40+
41+
id = "aoai://string_check"
42+
43+
def __init__(
44+
self,
45+
model_config : Union[AzureOpenAIModelConfiguration, OpenAIModelConfiguration],
46+
input: str,
47+
name: str,
48+
operation: Literal[
49+
"eq",
50+
"ne",
51+
"like",
52+
"ilike",
53+
],
54+
reference: str,
55+
**kwargs: Dict[str, Any]
56+
):
57+
grader = EvalStringCheckGrader(
58+
input=input,
59+
name=name,
60+
operation=operation,
61+
reference=reference,
62+
type="string_check",
63+
)
64+
super().__init__(model_config, grader, **kwargs)
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# ---------------------------------------------------------
2+
# Copyright (c) Microsoft Corporation. All rights reserved.
3+
# ---------------------------------------------------------
4+
from typing import Any, Dict, Union
5+
from typing_extensions import Literal
6+
7+
from azure.ai.evaluation._model_configurations import AzureOpenAIModelConfiguration, OpenAIModelConfiguration
8+
from openai.types.eval_text_similarity_grader import EvalTextSimilarityGrader
9+
from azure.ai.evaluation._common._experimental import experimental
10+
11+
from .aoai_grader import AzureOpenAIGrader
12+
13+
@experimental
14+
class AzureOpenAITextSimilarityGrader(AzureOpenAIGrader):
15+
"""
16+
Wrapper class for OpenAI's string check graders.
17+
18+
Supplying a StringCheckGrader to the `evaluate` method will cause an asynchronous request to evaluate
19+
the grader via the OpenAI API. The results of the evaluation will then be merged into the standard
20+
evaluation results.
21+
22+
:param model_config: The model configuration to use for the grader.
23+
:type model_config: Union[
24+
~azure.ai.evaluation.AzureOpenAIModelConfiguration,
25+
~azure.ai.evaluation.OpenAIModelConfiguration
26+
]
27+
:param evaluation_metric: The evaluation metric to use.
28+
:type evaluation_metric: Literal[
29+
"fuzzy_match",
30+
"bleu",
31+
"gleu",
32+
"meteor",
33+
"rouge_1",
34+
"rouge_2",
35+
"rouge_3",
36+
"rouge_4",
37+
"rouge_5",
38+
"rouge_l",
39+
"cosine",
40+
]
41+
:param input: The text being graded.
42+
:type input: str
43+
:param pass_threshold: A float score where a value greater than or equal indicates a passing grade.
44+
:type pass_threshold: float
45+
:param reference: The text being graded against.
46+
:type reference: str
47+
:param name: The name of the grader.
48+
:type name: str
49+
:param kwargs: Additional keyword arguments to pass to the grader.
50+
:type kwargs: Dict[str, Any]
51+
52+
53+
"""
54+
55+
id = "aoai://text_similarity"
56+
57+
def __init__(
58+
self,
59+
model_config : Union[AzureOpenAIModelConfiguration, OpenAIModelConfiguration],
60+
evaluation_metric: Literal[
61+
"fuzzy_match",
62+
"bleu",
63+
"gleu",
64+
"meteor",
65+
"rouge_1",
66+
"rouge_2",
67+
"rouge_3",
68+
"rouge_4",
69+
"rouge_5",
70+
"rouge_l",
71+
"cosine",
72+
],
73+
input: str,
74+
pass_threshold: float,
75+
reference: str,
76+
name: str,
77+
**kwargs: Dict[str, Any]
78+
):
79+
grader = EvalTextSimilarityGrader(
80+
evaluation_metric=evaluation_metric,
81+
input=input,
82+
pass_threshold=pass_threshold,
83+
name=name,
84+
reference=reference,
85+
type="text_similarity",
86+
)
87+
super().__init__(model_config, grader, **kwargs)

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_constants.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,7 @@ class EvaluationRunProperties:
6262
RUN_TYPE = "runType"
6363
EVALUATION_RUN = "_azureml.evaluation_run"
6464
EVALUATION_SDK = "_azureml.evaluation_sdk_name"
65+
NAME_MAP = "_azureml.evaluation_name_map"
6566

6667

6768
@experimental
@@ -102,3 +103,7 @@ class _AggregationType(enum.Enum):
102103

103104
DEFAULT_MAX_COMPLETION_TOKENS_REASONING_MODELS = 60000
104105
BINARY_AGGREGATE_SUFFIX = "binary_aggregate"
106+
107+
AOAI_COLUMN_NAME = "aoai"
108+
DEFAULT_OAI_EVAL_RUN_NAME = "AI_SDK_EVAL_RUN"
109+
DEFAULT_AOAI_API_VERSION = "2025-04-01-preview" # Unfortunately relying on preview version for now.

0 commit comments

Comments
 (0)