Passing threshold in AzureOpenAIScoreModelGrader (#42136)

nagkumar91 · Nagkumar Arkalgud · Nagkumar Arkalgud · web-flow · commit d2707038d326 · 2025-07-22T07:42:09.000-07:00
* Prepare evals SDK Release

* Fix bug

* Fix for ADV_CONV for FDP projects

* Update release date

* re-add pyrit to matrix

* Change grader ids

* Update unit test

* replace all old grader IDs in tests

* Update platform-matrix.json

Add pyrit and not remove the other one

* Update test to ensure everything is mocked

* tox/black fixes

* Skip that test with issues

* update grader ID according to API View feedback

* Update test

* remove string check for grader ID

* Update changelog and officialy start freeze

* update the enum according to suggestions

* update the changelog

* Finalize logic

* Initial plan

* Fix client request ID headers in azure-ai-evaluation

Co-authored-by: nagkumar91 &lt;4727422+nagkumar91@users.noreply.github.com&gt;

* Fix client request ID header format in rai_service.py

Co-authored-by: nagkumar91 &lt;4727422+nagkumar91@users.noreply.github.com&gt;

* Passing threshold in AzureOpenAIScoreModelGrader

* Add changelog

* Adding the self.pass_threshold instead of pass_threshold

---------

Co-authored-by: Nagkumar Arkalgud &lt;nagkumar@naarkalg-work-mac.local&gt;
Co-authored-by: Nagkumar Arkalgud &lt;nagkumar@Mac.lan&gt;
Co-authored-by: copilot-swe-agent[bot] &lt;198982749+Copilot@users.noreply.github.com&gt;
Co-authored-by: nagkumar91 &lt;4727422+nagkumar91@users.noreply.github.com&gt;
diff --git a/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md b/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md
@@ -29,6 +29,7 @@
 - Fixes and improvements to ToolCallAccuracy evaluator. New version has less variance. and now works on all tool calls that happen in a turn at once. Previously, it worked on each tool call independently without having context on the other tool calls that happen in the same turn, and then aggregated the results to a score in the range [0-1]. The score range is now [1-5].
 - Fixed MeteorScoreEvaluator and other threshold-based evaluators returning incorrect binary results due to integer conversion of decimal scores. Previously, decimal scores like 0.9375 were incorrectly converted to integers (0) before threshold comparison, causing them to fail even when above the threshold. [#41415](https://github.com/Azure/azure-sdk-for-python/issues/41415)
 - Added a new enum `ADVERSARIAL_QA_DOCUMENTS` which moves all the "file_content" type prompts away from `ADVERSARIAL_QA` to the new enum
+- `AzureOpenAIScoreModelGrader` evaluator now supports `pass_threshold` parameter to set the minimum score required for a response to be considered passing. This allows users to define custom thresholds for evaluation results, enhancing flexibility in grading AI model responses.
 
 ## 1.8.0 (2025-05-29)
 
diff --git a/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_aoai/score_model_grader.py b/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_aoai/score_model_grader.py
@@ -84,6 +84,7 @@ def __init__(
             grader_kwargs["range"] = range
         if sampling_params is not None:
             grader_kwargs["sampling_params"] = sampling_params
+        grader_kwargs["pass_threshold"] = self.pass_threshold
 
         grader = ScoreModelGrader(**grader_kwargs)