Skip to content

Commit c58b5a1

Browse files
nagkumar91Nagkumar ArkalgudNagkumar ArkalgudCopilot
authored
Azure OpenAI python grader (#42170)
* Prepare evals SDK Release * Fix bug * Fix for ADV_CONV for FDP projects * Update release date * re-add pyrit to matrix * Change grader ids * Update unit test * replace all old grader IDs in tests * Update platform-matrix.json Add pyrit and not remove the other one * Update test to ensure everything is mocked * tox/black fixes * Skip that test with issues * update grader ID according to API View feedback * Update test * remove string check for grader ID * Update changelog and officialy start freeze * update the enum according to suggestions * update the changelog * Finalize logic * Initial plan * Fix client request ID headers in azure-ai-evaluation Co-authored-by: nagkumar91 <[email protected]> * Fix client request ID header format in rai_service.py Co-authored-by: nagkumar91 <[email protected]> * Passing threshold in AzureOpenAIScoreModelGrader * Add changelog * Adding the self.pass_threshold instead of pass_threshold * Add the python grader * Remove redundant test * Add class to exception list and format code --------- Co-authored-by: Nagkumar Arkalgud <[email protected]> Co-authored-by: Nagkumar Arkalgud <[email protected]> Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: nagkumar91 <[email protected]>
1 parent 89c53e7 commit c58b5a1

File tree

7 files changed

+216
-1
lines changed

7 files changed

+216
-1
lines changed

sdk/evaluation/azure-ai-evaluation/CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,10 @@
66

77
- Added `evaluate_query` parameter to all RAI service evaluators that can be passed as a keyword argument. This parameter controls whether queries are included in evaluation data when evaluating query-response pairs. Previously, queries were always included in evaluations. When set to `True`, both query and response will be evaluated; when set to `False` (default), only the response will be evaluated. This parameter is available across all RAI service evaluators including `ContentSafetyEvaluator`, `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator`, `ProtectedMaterialEvaluator`, `IndirectAttackEvaluator`, `CodeVulnerabilityEvaluator`, `UngroundedAttributesEvaluator`, `GroundednessProEvaluator`, and `EciEvaluator`. Existing code that relies on queries being evaluated will need to explicitly set `evaluate_query=True` to maintain the previous behavior.
88

9+
### Features Added
10+
11+
- Added support for Azure OpenAI Python grader via `AzureOpenAIPythonGrader` class, which serves as a wrapper around Azure Open AI Python grader configurations. This new grader object can be supplied to the main `evaluate` method as if it were a normal callable evaluator.
12+
913
### Bugs Fixed
1014

1115
- Fixed red team scan `output_path` issue where individual evaluation results were overwriting each other instead of being preserved as separate files. Individual evaluations now create unique files while the user's `output_path` is reserved for final aggregated results.

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@
4646
from ._aoai.string_check_grader import AzureOpenAIStringCheckGrader
4747
from ._aoai.text_similarity_grader import AzureOpenAITextSimilarityGrader
4848
from ._aoai.score_model_grader import AzureOpenAIScoreModelGrader
49+
from ._aoai.python_grader import AzureOpenAIPythonGrader
4950

5051

5152
_patch_all = []
@@ -135,6 +136,7 @@ def lazy_import():
135136
"AzureOpenAIStringCheckGrader",
136137
"AzureOpenAITextSimilarityGrader",
137138
"AzureOpenAIScoreModelGrader",
139+
"AzureOpenAIPythonGrader",
138140
# Include lazy imports in __all__ so they appear as available
139141
"AIAgentConverter",
140142
"SKAgentConverter",
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# ---------------------------------------------------------
2+
# Copyright (c) Microsoft Corporation. All rights reserved.
3+
# ---------------------------------------------------------
4+
from typing import Any, Dict, Union, Optional
5+
6+
from azure.ai.evaluation._model_configurations import AzureOpenAIModelConfiguration, OpenAIModelConfiguration
7+
from openai.types.graders import PythonGrader
8+
from azure.ai.evaluation._common._experimental import experimental
9+
10+
from .aoai_grader import AzureOpenAIGrader
11+
12+
13+
@experimental
14+
class AzureOpenAIPythonGrader(AzureOpenAIGrader):
15+
"""
16+
Wrapper class for OpenAI's Python code graders.
17+
18+
Enables custom Python-based evaluation logic with flexible scoring and
19+
pass/fail thresholds. The grader executes user-provided Python code
20+
to evaluate outputs against custom criteria.
21+
22+
Supplying a PythonGrader to the `evaluate` method will cause an
23+
asynchronous request to evaluate the grader via the OpenAI API. The
24+
results of the evaluation will then be merged into the standard
25+
evaluation results.
26+
27+
:param model_config: The model configuration to use for the grader.
28+
:type model_config: Union[
29+
~azure.ai.evaluation.AzureOpenAIModelConfiguration,
30+
~azure.ai.evaluation.OpenAIModelConfiguration
31+
]
32+
:param name: The name of the grader.
33+
:type name: str
34+
:param image_tag: The image tag for the Python execution environment.
35+
:type image_tag: str
36+
:param pass_threshold: Score threshold for pass/fail classification.
37+
Scores >= threshold are considered passing.
38+
:type pass_threshold: float
39+
:param source: Python source code containing the grade function.
40+
Must define: def grade(sample: dict, item: dict) -> float
41+
:type source: str
42+
:param kwargs: Additional keyword arguments to pass to the grader.
43+
:type kwargs: Any
44+
45+
46+
.. admonition:: Example:
47+
48+
.. literalinclude:: ../samples/evaluation_samples_common.py
49+
:start-after: [START python_grader_example]
50+
:end-before: [END python_grader_example]
51+
:language: python
52+
:dedent: 8
53+
:caption: Using AzureOpenAIPythonGrader for custom evaluation logic.
54+
"""
55+
56+
id = "azureai://built-in/evaluators/azure-openai/python_grader"
57+
58+
def __init__(
59+
self,
60+
*,
61+
model_config: Union[AzureOpenAIModelConfiguration, OpenAIModelConfiguration],
62+
name: str,
63+
image_tag: str,
64+
pass_threshold: float,
65+
source: str,
66+
**kwargs: Any,
67+
):
68+
# Validate pass_threshold
69+
if not 0.0 <= pass_threshold <= 1.0:
70+
raise ValueError("pass_threshold must be between 0.0 and 1.0")
71+
72+
# Store pass_threshold as instance attribute for potential future use
73+
self.pass_threshold = pass_threshold
74+
75+
# Create OpenAI PythonGrader instance
76+
grader = PythonGrader(
77+
name=name,
78+
image_tag=image_tag,
79+
pass_threshold=pass_threshold,
80+
source=source,
81+
type="python",
82+
)
83+
84+
super().__init__(model_config=model_config, grader_config=grader, **kwargs)

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate_aoai.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -353,6 +353,7 @@ def _get_grader_class(model_id: str) -> Type[AzureOpenAIGrader]:
353353
AzureOpenAIStringCheckGrader,
354354
AzureOpenAITextSimilarityGrader,
355355
AzureOpenAIScoreModelGrader,
356+
AzureOpenAIPythonGrader,
356357
)
357358

358359
id_map = {
@@ -361,6 +362,7 @@ def _get_grader_class(model_id: str) -> Type[AzureOpenAIGrader]:
361362
AzureOpenAIStringCheckGrader.id: AzureOpenAIStringCheckGrader,
362363
AzureOpenAITextSimilarityGrader.id: AzureOpenAITextSimilarityGrader,
363364
AzureOpenAIScoreModelGrader.id: AzureOpenAIScoreModelGrader,
365+
AzureOpenAIPythonGrader.id: AzureOpenAIPythonGrader,
364366
}
365367

366368
for key in id_map.keys():

sdk/evaluation/azure-ai-evaluation/samples/evaluation_samples_common.py

Lines changed: 69 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
"""
1111
DESCRIPTION:
1212
These samples demonstrate usage of various classes and methods commonly used in the azure-ai-evaluation library.
13-
13+
1414
USAGE:
1515
python evaluation_samples_common.py
1616
"""
@@ -50,6 +50,74 @@ def evaluation_common_classes_methods(self):
5050

5151
# [END create_azure_ai_project_object]
5252

53+
# [START python_grader_example]
54+
from azure.ai.evaluation import AzureOpenAIPythonGrader, evaluate
55+
from azure.ai.evaluation._model_configurations import AzureOpenAIModelConfiguration
56+
import os
57+
58+
# Configure your Azure OpenAI connection
59+
model_config = AzureOpenAIModelConfiguration(
60+
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
61+
api_key=os.environ["AZURE_OPENAI_API_KEY"],
62+
api_version=os.environ["AZURE_OPENAI_API_VERSION"],
63+
azure_deployment=os.environ["MODEL_DEPLOYMENT_NAME"],
64+
)
65+
66+
# Create a Python grader with custom evaluation logic
67+
python_grader = AzureOpenAIPythonGrader(
68+
model_config=model_config,
69+
name="custom_accuracy",
70+
image_tag="2025-05-08",
71+
pass_threshold=0.8, # 80% threshold for passing
72+
source="""
73+
def grade(sample: dict, item: dict) -> float:
74+
\"\"\"
75+
Custom grading logic that compares model output to expected label.
76+
77+
Args:
78+
sample: Dictionary that is typically empty in Azure AI Evaluation
79+
item: Dictionary containing ALL the data including model output and ground truth
80+
81+
Returns:
82+
Float score between 0.0 and 1.0
83+
\"\"\"
84+
# Important: In Azure AI Evaluation, all data is in 'item', not 'sample'
85+
# The 'sample' parameter is typically an empty dictionary
86+
87+
# Get the model's response/output from item
88+
output = item.get("response", "") or item.get("output", "") or item.get("output_text", "")
89+
output = output.lower()
90+
91+
# Get the expected label/ground truth from item
92+
label = item.get("ground_truth", "") or item.get("label", "") or item.get("expected", "")
93+
label = label.lower()
94+
95+
# Handle empty cases
96+
if not output or not label:
97+
return 0.0
98+
99+
# Exact match gets full score
100+
if output == label:
101+
return 1.0
102+
103+
# Partial match logic (customize as needed)
104+
if output in label or label in output:
105+
return 0.5
106+
107+
return 0.0
108+
""",
109+
)
110+
111+
# Run evaluation
112+
evaluation_result = evaluate(
113+
data="evaluation_data.jsonl", # JSONL file with columns: query, response, ground_truth, etc.
114+
evaluators={"custom_accuracy": python_grader},
115+
)
116+
117+
# Access results
118+
print(f"Pass rate: {evaluation_result['metrics']['custom_accuracy.pass_rate']}")
119+
# [END python_grader_example]
120+
53121

54122
if __name__ == "__main__":
55123
print("Loading samples in evaluation_samples_common.py")
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
import pytest
2+
from unittest.mock import MagicMock, patch
3+
4+
from azure.ai.evaluation import AzureOpenAIPythonGrader
5+
from azure.ai.evaluation._model_configurations import AzureOpenAIModelConfiguration
6+
7+
8+
class TestAzureOpenAIPythonGrader:
9+
"""Test cases for AzureOpenAIPythonGrader."""
10+
11+
def test_init_valid(self):
12+
"""Test valid initialization."""
13+
model_config = AzureOpenAIModelConfiguration(
14+
azure_endpoint="https://test.openai.azure.com",
15+
api_key="test-key",
16+
azure_deployment="test-deployment",
17+
)
18+
19+
source_code = """
20+
def grade(sample: dict, item: dict) -> float:
21+
output = sample.get("output_text")
22+
label = item.get("label")
23+
return 1.0 if output == label else 0.0
24+
"""
25+
26+
grader = AzureOpenAIPythonGrader(
27+
model_config=model_config,
28+
name="python_test",
29+
image_tag="2025-05-08",
30+
pass_threshold=0.5,
31+
source=source_code,
32+
)
33+
34+
assert grader.pass_threshold == 0.5
35+
assert grader.id == "azureai://built-in/evaluators/azure-openai/python_grader"
36+
37+
def test_invalid_pass_threshold(self):
38+
"""Test invalid pass_threshold values."""
39+
model_config = AzureOpenAIModelConfiguration(
40+
azure_endpoint="https://test.openai.azure.com",
41+
api_key="test-key",
42+
azure_deployment="test-deployment",
43+
)
44+
45+
source_code = "def grade(sample: dict, item: dict) -> float:\n return 1.0"
46+
47+
with pytest.raises(ValueError, match="pass_threshold must be between 0.0 and 1.0"):
48+
AzureOpenAIPythonGrader(
49+
model_config=model_config,
50+
name="python_test",
51+
image_tag="2025-05-08",
52+
pass_threshold=1.5,
53+
source=source_code,
54+
)

sdk/evaluation/azure-ai-evaluation/tests/unittests/test_save_eval.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ class TestSaveEval:
4343
"AzureOpenAIScoreModelGrader",
4444
"AzureOpenAIStringCheckGrader",
4545
"AzureOpenAITextSimilarityGrader",
46+
"AzureOpenAIPythonGrader",
4647
],
4748
)
4849

0 commit comments

Comments
 (0)