Azure OpenAI python grader (#42170)

nagkumar91 · Nagkumar Arkalgud · Nagkumar Arkalgud · web-flow · commit c58b5a1d930d · 2025-07-24T16:23:47.000Z
* Prepare evals SDK Release

* Fix bug

* Fix for ADV_CONV for FDP projects

* Update release date

* re-add pyrit to matrix

* Change grader ids

* Update unit test

* replace all old grader IDs in tests

* Update platform-matrix.json

Add pyrit and not remove the other one

* Update test to ensure everything is mocked

* tox/black fixes

* Skip that test with issues

* update grader ID according to API View feedback

* Update test

* remove string check for grader ID

* Update changelog and officialy start freeze

* update the enum according to suggestions

* update the changelog

* Finalize logic

* Initial plan

* Fix client request ID headers in azure-ai-evaluation

Co-authored-by: nagkumar91 &lt;4727422+nagkumar91@users.noreply.github.com&gt;

* Fix client request ID header format in rai_service.py

Co-authored-by: nagkumar91 &lt;4727422+nagkumar91@users.noreply.github.com&gt;

* Passing threshold in AzureOpenAIScoreModelGrader

* Add changelog

* Adding the self.pass_threshold instead of pass_threshold

* Add the python grader

* Remove redundant test

* Add class to exception list and format code

---------

Co-authored-by: Nagkumar Arkalgud &lt;nagkumar@naarkalg-work-mac.local&gt;
Co-authored-by: Nagkumar Arkalgud &lt;nagkumar@Mac.lan&gt;
Co-authored-by: copilot-swe-agent[bot] &lt;198982749+Copilot@users.noreply.github.com&gt;
Co-authored-by: nagkumar91 &lt;4727422+nagkumar91@users.noreply.github.com&gt;
diff --git a/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md b/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md
@@ -6,6 +6,10 @@
 
 - Added `evaluate_query` parameter to all RAI service evaluators that can be passed as a keyword argument. This parameter controls whether queries are included in evaluation data when evaluating query-response pairs. Previously, queries were always included in evaluations. When set to `True`, both query and response will be evaluated; when set to `False` (default), only the response will be evaluated. This parameter is available across all RAI service evaluators including `ContentSafetyEvaluator`, `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator`, `ProtectedMaterialEvaluator`, `IndirectAttackEvaluator`, `CodeVulnerabilityEvaluator`, `UngroundedAttributesEvaluator`, `GroundednessProEvaluator`, and `EciEvaluator`.  Existing code that relies on queries being evaluated will need to explicitly set `evaluate_query=True` to maintain the previous behavior.
 
+### Features Added
+
+- Added support for Azure OpenAI Python grader via `AzureOpenAIPythonGrader` class, which serves as a wrapper around Azure Open AI Python grader configurations. This new grader object can be supplied to the main `evaluate` method as if it were a normal callable evaluator.
+
 ### Bugs Fixed
 
 - Fixed red team scan `output_path` issue where individual evaluation results were overwriting each other instead of being preserved as separate files. Individual evaluations now create unique files while the user's `output_path` is reserved for final aggregated results.
diff --git a/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/__init__.py b/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/__init__.py
@@ -46,6 +46,7 @@
 from ._aoai.string_check_grader import AzureOpenAIStringCheckGrader
 from ._aoai.text_similarity_grader import AzureOpenAITextSimilarityGrader
 from ._aoai.score_model_grader import AzureOpenAIScoreModelGrader
+from ._aoai.python_grader import AzureOpenAIPythonGrader
 
 
 _patch_all = []
@@ -135,6 +136,7 @@ def lazy_import():
     "AzureOpenAIStringCheckGrader",
     "AzureOpenAITextSimilarityGrader",
     "AzureOpenAIScoreModelGrader",
+    "AzureOpenAIPythonGrader",
     # Include lazy imports in __all__ so they appear as available
     "AIAgentConverter",
     "SKAgentConverter",
diff --git a/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_aoai/python_grader.py b/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_aoai/python_grader.py
@@ -0,0 +1,84 @@
+# ---------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# ---------------------------------------------------------
+from typing import Any, Dict, Union, Optional
+
+from azure.ai.evaluation._model_configurations import AzureOpenAIModelConfiguration, OpenAIModelConfiguration
+from openai.types.graders import PythonGrader
+from azure.ai.evaluation._common._experimental import experimental
+
+from .aoai_grader import AzureOpenAIGrader
+
+
+@experimental
+class AzureOpenAIPythonGrader(AzureOpenAIGrader):
+    """
+    Wrapper class for OpenAI's Python code graders.
+
+    Enables custom Python-based evaluation logic with flexible scoring and
+    pass/fail thresholds. The grader executes user-provided Python code
+    to evaluate outputs against custom criteria.
+
+    Supplying a PythonGrader to the `evaluate` method will cause an
+    asynchronous request to evaluate the grader via the OpenAI API. The
+    results of the evaluation will then be merged into the standard
+    evaluation results.
+
+    :param model_config: The model configuration to use for the grader.
+    :type model_config: Union[
+        ~azure.ai.evaluation.AzureOpenAIModelConfiguration,
+        ~azure.ai.evaluation.OpenAIModelConfiguration
+    ]
+    :param name: The name of the grader.
+    :type name: str
+    :param image_tag: The image tag for the Python execution environment.
+    :type image_tag: str
+    :param pass_threshold: Score threshold for pass/fail classification.
+        Scores >= threshold are considered passing.
+    :type pass_threshold: float
+    :param source: Python source code containing the grade function.
+        Must define: def grade(sample: dict, item: dict) -> float
+    :type source: str
+    :param kwargs: Additional keyword arguments to pass to the grader.
+    :type kwargs: Any
+
+
+    .. admonition:: Example:
+
+        .. literalinclude:: ../samples/evaluation_samples_common.py
+            :start-after: [START python_grader_example]
+            :end-before: [END python_grader_example]
+            :language: python
+            :dedent: 8
+            :caption: Using AzureOpenAIPythonGrader for custom evaluation logic.
+    """
+
+    id = "azureai://built-in/evaluators/azure-openai/python_grader"
+
+    def __init__(
+        self,
+        *,
+        model_config: Union[AzureOpenAIModelConfiguration, OpenAIModelConfiguration],
+        name: str,
+        image_tag: str,
+        pass_threshold: float,
+        source: str,
+        **kwargs: Any,
+    ):
+        # Validate pass_threshold
+        if not 0.0 <= pass_threshold <= 1.0:
+            raise ValueError("pass_threshold must be between 0.0 and 1.0")
+
+        # Store pass_threshold as instance attribute for potential future use
+        self.pass_threshold = pass_threshold
+
+        # Create OpenAI PythonGrader instance
+        grader = PythonGrader(
+            name=name,
+            image_tag=image_tag,
+            pass_threshold=pass_threshold,
+            source=source,
+            type="python",
+        )
+
+        super().__init__(model_config=model_config, grader_config=grader, **kwargs)
diff --git a/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate_aoai.py b/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate_aoai.py
@@ -353,6 +353,7 @@ def _get_grader_class(model_id: str) -> Type[AzureOpenAIGrader]:
         AzureOpenAIStringCheckGrader,
         AzureOpenAITextSimilarityGrader,
         AzureOpenAIScoreModelGrader,
+        AzureOpenAIPythonGrader,
     )
 
     id_map = {
@@ -361,6 +362,7 @@ def _get_grader_class(model_id: str) -> Type[AzureOpenAIGrader]:
         AzureOpenAIStringCheckGrader.id: AzureOpenAIStringCheckGrader,
         AzureOpenAITextSimilarityGrader.id: AzureOpenAITextSimilarityGrader,
         AzureOpenAIScoreModelGrader.id: AzureOpenAIScoreModelGrader,
+        AzureOpenAIPythonGrader.id: AzureOpenAIPythonGrader,
     }
 
     for key in id_map.keys():
diff --git a/sdk/evaluation/azure-ai-evaluation/samples/evaluation_samples_common.py b/sdk/evaluation/azure-ai-evaluation/samples/evaluation_samples_common.py
@@ -10,7 +10,7 @@
 """
 DESCRIPTION:
     These samples demonstrate usage of various classes and methods commonly used in the azure-ai-evaluation library.
-    
+
 USAGE:
     python evaluation_samples_common.py
 """
@@ -50,6 +50,74 @@ def evaluation_common_classes_methods(self):
 
         # [END create_azure_ai_project_object]
 
+        # [START python_grader_example]
+        from azure.ai.evaluation import AzureOpenAIPythonGrader, evaluate
+        from azure.ai.evaluation._model_configurations import AzureOpenAIModelConfiguration
+        import os
+
+        # Configure your Azure OpenAI connection
+        model_config = AzureOpenAIModelConfiguration(
+            azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
+            api_key=os.environ["AZURE_OPENAI_API_KEY"],
+            api_version=os.environ["AZURE_OPENAI_API_VERSION"],
+            azure_deployment=os.environ["MODEL_DEPLOYMENT_NAME"],
+        )
+
+        # Create a Python grader with custom evaluation logic
+        python_grader = AzureOpenAIPythonGrader(
+            model_config=model_config,
+            name="custom_accuracy",
+            image_tag="2025-05-08",
+            pass_threshold=0.8,  # 80% threshold for passing
+            source="""
+        def grade(sample: dict, item: dict) -> float:
+            \"\"\"
+            Custom grading logic that compares model output to expected label.
+            
+            Args:
+                sample: Dictionary that is typically empty in Azure AI Evaluation
+                item: Dictionary containing ALL the data including model output and ground truth
+            
+            Returns:
+                Float score between 0.0 and 1.0
+            \"\"\"
+            # Important: In Azure AI Evaluation, all data is in 'item', not 'sample'
+            # The 'sample' parameter is typically an empty dictionary
+            
+            # Get the model's response/output from item
+            output = item.get("response", "") or item.get("output", "") or item.get("output_text", "")
+            output = output.lower()
+            
+            # Get the expected label/ground truth from item
+            label = item.get("ground_truth", "") or item.get("label", "") or item.get("expected", "")
+            label = label.lower()
+            
+            # Handle empty cases
+            if not output or not label:
+                return 0.0
+            
+            # Exact match gets full score
+            if output == label:
+                return 1.0
+            
+            # Partial match logic (customize as needed)
+            if output in label or label in output:
+                return 0.5
+            
+            return 0.0
+        """,
+        )
+
+        # Run evaluation
+        evaluation_result = evaluate(
+            data="evaluation_data.jsonl",  # JSONL file with columns: query, response, ground_truth, etc.
+            evaluators={"custom_accuracy": python_grader},
+        )
+
+        # Access results
+        print(f"Pass rate: {evaluation_result['metrics']['custom_accuracy.pass_rate']}")
+        # [END python_grader_example]
+
 
 if __name__ == "__main__":
     print("Loading samples in evaluation_samples_common.py")
diff --git a/sdk/evaluation/azure-ai-evaluation/tests/unittests/test_aoai_python_grader.py b/sdk/evaluation/azure-ai-evaluation/tests/unittests/test_aoai_python_grader.py
@@ -0,0 +1,54 @@
+import pytest
+from unittest.mock import MagicMock, patch
+
+from azure.ai.evaluation import AzureOpenAIPythonGrader
+from azure.ai.evaluation._model_configurations import AzureOpenAIModelConfiguration
+
+
+class TestAzureOpenAIPythonGrader:
+    """Test cases for AzureOpenAIPythonGrader."""
+
+    def test_init_valid(self):
+        """Test valid initialization."""
+        model_config = AzureOpenAIModelConfiguration(
+            azure_endpoint="https://test.openai.azure.com",
+            api_key="test-key",
+            azure_deployment="test-deployment",
+        )
+
+        source_code = """
+def grade(sample: dict, item: dict) -> float:
+    output = sample.get("output_text")
+    label = item.get("label")
+    return 1.0 if output == label else 0.0
+"""
+
+        grader = AzureOpenAIPythonGrader(
+            model_config=model_config,
+            name="python_test",
+            image_tag="2025-05-08",
+            pass_threshold=0.5,
+            source=source_code,
+        )
+
+        assert grader.pass_threshold == 0.5
+        assert grader.id == "azureai://built-in/evaluators/azure-openai/python_grader"
+
+    def test_invalid_pass_threshold(self):
+        """Test invalid pass_threshold values."""
+        model_config = AzureOpenAIModelConfiguration(
+            azure_endpoint="https://test.openai.azure.com",
+            api_key="test-key",
+            azure_deployment="test-deployment",
+        )
+
+        source_code = "def grade(sample: dict, item: dict) -> float:\n    return 1.0"
+
+        with pytest.raises(ValueError, match="pass_threshold must be between 0.0 and 1.0"):
+            AzureOpenAIPythonGrader(
+                model_config=model_config,
+                name="python_test",
+                image_tag="2025-05-08",
+                pass_threshold=1.5,
+                source=source_code,
+            )
diff --git a/sdk/evaluation/azure-ai-evaluation/tests/unittests/test_save_eval.py b/sdk/evaluation/azure-ai-evaluation/tests/unittests/test_save_eval.py
@@ -43,6 +43,7 @@ class TestSaveEval:
             "AzureOpenAIScoreModelGrader",
             "AzureOpenAIStringCheckGrader",
             "AzureOpenAITextSimilarityGrader",
+            "AzureOpenAIPythonGrader",
         ],
     )
 

Original file line number	Diff line number	Diff line change
`@@ -353,6 +353,7 @@ def _get_grader_class(model_id: str) -> Type[AzureOpenAIGrader]:`
`353`	`353`	`AzureOpenAIStringCheckGrader,`
`354`	`354`	`AzureOpenAITextSimilarityGrader,`
`355`	`355`	`AzureOpenAIScoreModelGrader,`
	`356`	`+ AzureOpenAIPythonGrader,`
`356`	`357`	`)`
`357`	`358`
`358`	`359`	`id_map = {`
`@@ -361,6 +362,7 @@ def _get_grader_class(model_id: str) -> Type[AzureOpenAIGrader]:`
`361`	`362`	`AzureOpenAIStringCheckGrader.id: AzureOpenAIStringCheckGrader,`
`362`	`363`	`AzureOpenAITextSimilarityGrader.id: AzureOpenAITextSimilarityGrader,`
`363`	`364`	`AzureOpenAIScoreModelGrader.id: AzureOpenAIScoreModelGrader,`
	`365`	`+ AzureOpenAIPythonGrader.id: AzureOpenAIPythonGrader,`
`364`	`366`	`}`
`365`	`367`
`366`	`368`	`for key in id_map.keys():`
Original file line number	Diff line number	Diff line change
`@@ -43,6 +43,7 @@ class TestSaveEval:`
`43`	`43`	`"AzureOpenAIScoreModelGrader",`
`44`	`44`	`"AzureOpenAIStringCheckGrader",`
`45`	`45`	`"AzureOpenAITextSimilarityGrader",`
	`46`	`+ "AzureOpenAIPythonGrader",`
`46`	`47`	`],`
`47`	`48`	`)`
`48`	`49`