Skip to content

Commit bb37432

Browse files
nagkumar91Nagkumar ArkalgudNagkumar Arkalgud
authored
AzureOpenAI model grader support in evals (#41599)
* Prepare evals SDK Release * Fix bug * Fix for ADV_CONV for FDP projects * Update release date * feat: Add AzureOpenAIScoreModelGrader for continuous scoring evaluation - Implement AzureOpenAIScoreModelGrader in _aoai/score_model_grader.py - Update module exports in _aoai/__init__.py and __init__.py - Register grader in _evaluate/_evaluate_aoai.py grader registry - Add comprehensive sample script with real credentials support - Include integration plan documentation - Support conversation-style input, score ranges, and sampling parameters - Handle template variables using {{ item.field }} syntax - Provide fallback demo mode for configuration testing * Add tests * Removed the plan md * Add evaluator to exceptions for save eval e2e test * update changelog and sample --------- Co-authored-by: Nagkumar Arkalgud <[email protected]> Co-authored-by: Nagkumar Arkalgud <[email protected]>
1 parent a352284 commit bb37432

File tree

9 files changed

+1317
-13
lines changed

9 files changed

+1317
-13
lines changed

sdk/evaluation/azure-ai-evaluation/CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,9 @@
44

55
### Features Added
66

7+
- Added support for Azure Open AI evaluation via `AzureOpenAIScoreModelGrader` class, which serves as a wrapper around Azure Open AI score model configurations. This new grader object can be supplied to the main `evaluate` method as if it were a normal callable evaluator.
8+
9+
710
### Bugs Fixed
811

912
- Significant improvements to IntentResolution evaluator. New version has less variance, is nearly 2x faster and consumes fewer tokens.

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@
4545
from ._aoai.label_grader import AzureOpenAILabelGrader
4646
from ._aoai.string_check_grader import AzureOpenAIStringCheckGrader
4747
from ._aoai.text_similarity_grader import AzureOpenAITextSimilarityGrader
48+
from ._aoai.score_model_grader import AzureOpenAIScoreModelGrader
4849

4950

5051
_patch_all = []
@@ -102,6 +103,7 @@
102103
"AzureOpenAILabelGrader",
103104
"AzureOpenAIStringCheckGrader",
104105
"AzureOpenAITextSimilarityGrader",
106+
"AzureOpenAIScoreModelGrader",
105107
]
106108

107109
__all__.extend([p for p in _patch_all if p not in __all__])
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# ---------------------------------------------------------
2+
# Copyright (c) Microsoft Corporation. All rights reserved.
3+
# ---------------------------------------------------------
4+
from typing import Any, Dict, Union, List, Optional
5+
6+
from azure.ai.evaluation._model_configurations import AzureOpenAIModelConfiguration, OpenAIModelConfiguration
7+
from openai.types.graders import ScoreModelGrader
8+
from azure.ai.evaluation._common._experimental import experimental
9+
10+
from .aoai_grader import AzureOpenAIGrader
11+
12+
13+
@experimental
14+
class AzureOpenAIScoreModelGrader(AzureOpenAIGrader):
15+
"""
16+
Wrapper class for OpenAI's score model graders.
17+
18+
Enables continuous scoring evaluation with custom prompts and flexible
19+
conversation-style inputs. Supports configurable score ranges and
20+
pass thresholds for binary classification.
21+
22+
Supplying a ScoreModelGrader to the `evaluate` method will cause an
23+
asynchronous request to evaluate the grader via the OpenAI API. The
24+
results of the evaluation will then be merged into the standard
25+
evaluation results.
26+
27+
:param model_config: The model configuration to use for the grader.
28+
:type model_config: Union[
29+
~azure.ai.evaluation.AzureOpenAIModelConfiguration,
30+
~azure.ai.evaluation.OpenAIModelConfiguration
31+
]
32+
:param input: The input messages for the grader. List of conversation
33+
messages with role and content.
34+
:type input: List[Dict[str, str]]
35+
:param model: The model to use for the evaluation.
36+
:type model: str
37+
:param name: The name of the grader.
38+
:type name: str
39+
:param range: The range of the score. Defaults to [0, 1].
40+
:type range: Optional[List[float]]
41+
:param pass_threshold: Score threshold for pass/fail classification.
42+
Defaults to midpoint of range.
43+
:type pass_threshold: Optional[float]
44+
:param sampling_params: The sampling parameters for the model.
45+
:type sampling_params: Optional[Dict[str, Any]]
46+
:param kwargs: Additional keyword arguments to pass to the grader.
47+
:type kwargs: Any
48+
"""
49+
50+
id = "aoai://score_model"
51+
52+
def __init__(
53+
self,
54+
*,
55+
model_config: Union[AzureOpenAIModelConfiguration, OpenAIModelConfiguration],
56+
input: List[Dict[str, str]],
57+
model: str,
58+
name: str,
59+
range: Optional[List[float]] = None,
60+
pass_threshold: Optional[float] = None,
61+
sampling_params: Optional[Dict[str, Any]] = None,
62+
**kwargs: Any,
63+
):
64+
# Validate range and pass_threshold
65+
if range is not None:
66+
if len(range) != 2 or range[0] >= range[1]:
67+
raise ValueError("range must be a list of two numbers [min, max] where min < max")
68+
else:
69+
range = [0.0, 1.0] # Default range
70+
71+
if pass_threshold is not None:
72+
if range and (pass_threshold < range[0] or pass_threshold > range[1]):
73+
raise ValueError(f"pass_threshold {pass_threshold} must be within range {range}")
74+
else:
75+
pass_threshold = (range[0] + range[1]) / 2 # Default to midpoint
76+
77+
# Store pass_threshold as instance attribute
78+
self.pass_threshold = pass_threshold
79+
80+
# Create OpenAI ScoreModelGrader instance
81+
grader_kwargs = {"input": input, "model": model, "name": name, "type": "score_model"}
82+
83+
if range is not None:
84+
grader_kwargs["range"] = range
85+
if sampling_params is not None:
86+
grader_kwargs["sampling_params"] = sampling_params
87+
88+
grader = ScoreModelGrader(**grader_kwargs)
89+
90+
super().__init__(model_config=model_config, grader_config=grader, **kwargs)

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate_aoai.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -317,13 +317,15 @@ def _get_grader_class(model_id: str) -> Type[AzureOpenAIGrader]:
317317
AzureOpenAILabelGrader,
318318
AzureOpenAIStringCheckGrader,
319319
AzureOpenAITextSimilarityGrader,
320+
AzureOpenAIScoreModelGrader,
320321
)
321322

322323
id_map = {
323324
AzureOpenAIGrader.id: AzureOpenAIGrader,
324325
AzureOpenAILabelGrader.id: AzureOpenAILabelGrader,
325326
AzureOpenAIStringCheckGrader.id: AzureOpenAIStringCheckGrader,
326327
AzureOpenAITextSimilarityGrader.id: AzureOpenAITextSimilarityGrader,
328+
AzureOpenAIScoreModelGrader.id: AzureOpenAIScoreModelGrader,
327329
}
328330

329331
for key in id_map.keys():
Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
# ---------------------------------------------------------
2+
# Copyright (c) Microsoft Corporation. All rights reserved.
3+
# ---------------------------------------------------------
4+
5+
"""
6+
Sample demonstrating the use of AzureOpenAIScoreModelGrader for continuous
7+
scoring evaluation.
8+
9+
This sample shows how to:
10+
1. Configure an Azure OpenAI model for grading
11+
2. Create a score model grader with custom prompts
12+
3. Run evaluation using the evaluate() method
13+
4. Interpret continuous scoring results
14+
15+
Prerequisites:
16+
- Azure OpenAI resource with API key and endpoint
17+
- Model deployment (e.g., gpt-4, gpt-4o-mini)
18+
- Sample conversation data in JSONL format
19+
- Environment variables configured in .env file
20+
"""
21+
22+
import json
23+
import os
24+
from dotenv import load_dotenv
25+
import pandas as pd
26+
from azure.ai.evaluation import evaluate, AzureOpenAIScoreModelGrader
27+
from azure.ai.evaluation import AzureOpenAIModelConfiguration
28+
29+
# Load environment variables
30+
load_dotenv()
31+
32+
33+
def create_sample_data() -> str:
34+
"""Create sample conversation data for testing."""
35+
sample_conversations = [
36+
{
37+
"conversation": {
38+
"messages": [
39+
{"content": "How can I improve my Python coding skills?", "role": "user"},
40+
{
41+
"content": (
42+
"Here are some effective ways to improve your "
43+
"Python skills: 1) Practice coding daily with "
44+
"platforms like LeetCode or HackerRank, 2) Work "
45+
"on real projects that interest you, 3) Read "
46+
"other people's code on GitHub, 4) Join Python "
47+
"communities and forums, 5) Take online courses "
48+
"or tutorials. The key is consistent practice and "
49+
"building projects that challenge you."
50+
),
51+
"role": "assistant",
52+
},
53+
]
54+
},
55+
"expected_quality": "high",
56+
"context": "Programming advice request",
57+
},
58+
{
59+
"conversation": {
60+
"messages": [
61+
{"content": "What is Python?", "role": "user"},
62+
{"content": "Python is a programming language.", "role": "assistant"},
63+
]
64+
},
65+
"expected_quality": "low",
66+
"context": "Basic definition request",
67+
},
68+
{
69+
"conversation": {
70+
"messages": [
71+
{"content": ("Can you explain machine learning concepts " "for a beginner?"), "role": "user"},
72+
{
73+
"content": (
74+
"Machine learning is a subset of artificial "
75+
"intelligence where computers learn patterns from "
76+
"data without being explicitly programmed. Think "
77+
"of it like teaching a child to recognize animals "
78+
"- you show them many pictures of cats and dogs, "
79+
"and eventually they learn to distinguish between "
80+
"them. In ML, we feed algorithms lots of data so "
81+
"they can make predictions or decisions on new, "
82+
"unseen data. Common types include supervised "
83+
"learning (learning from labeled examples), "
84+
"unsupervised learning (finding hidden patterns), "
85+
"and reinforcement learning (learning through "
86+
"trial and error)."
87+
),
88+
"role": "assistant",
89+
},
90+
]
91+
},
92+
"expected_quality": "high",
93+
"context": "Educational explanation request",
94+
},
95+
{
96+
"conversation": {
97+
"messages": [
98+
{"content": "Help me debug this code", "role": "user"},
99+
{"content": "Sure, please share your code.", "role": "assistant"},
100+
]
101+
},
102+
"expected_quality": "medium",
103+
"context": "Debug assistance request",
104+
},
105+
{
106+
"conversation": {
107+
"messages": [
108+
{"content": ("What are the best practices for writing " "clean Python code?"), "role": "user"},
109+
{
110+
"content": (
111+
"Here are key best practices for writing clean "
112+
"Python code: 1) Follow PEP 8 style guidelines "
113+
"for consistent formatting, 2) Use descriptive "
114+
"variable and function names that explain their "
115+
"purpose, 3) Write docstrings for functions and "
116+
"classes, 4) Keep functions small and focused on "
117+
"a single task, 5) Use type hints to improve code "
118+
"clarity, 6) Handle exceptions appropriately, "
119+
"7) Write unit tests for your code, 8) Use "
120+
"virtual environments for dependency management, "
121+
"9) Comment complex logic but avoid obvious "
122+
"comments, 10) Refactor code regularly to improve "
123+
"readability and maintainability."
124+
),
125+
"role": "assistant",
126+
},
127+
]
128+
},
129+
"expected_quality": "high",
130+
"context": "Best practices inquiry",
131+
},
132+
]
133+
134+
# Create JSONL file
135+
filename = "sample_conversations.jsonl"
136+
with open(filename, "w") as f:
137+
for conv in sample_conversations:
138+
f.write(json.dumps(conv) + "\n")
139+
140+
print(f"Created sample data file: {filename}")
141+
return filename
142+
143+
144+
def demonstrate_score_model_grader():
145+
"""Demonstrate the AzureOpenAIScoreModelGrader usage with real credentials."""
146+
147+
# Create sample data
148+
data_file = create_sample_data()
149+
150+
print("=== Azure OpenAI Score Model Grader Demo ===\n")
151+
152+
try:
153+
# 1. Configure Azure OpenAI model using environment variables
154+
model_config = AzureOpenAIModelConfiguration(
155+
azure_endpoint=os.environ.get("endpoint"),
156+
api_key=os.environ.get("key"),
157+
azure_deployment=os.environ.get("deployment_name"),
158+
api_version="2024-12-01-preview",
159+
)
160+
161+
print("✅ Model configuration loaded successfully")
162+
163+
# 2. Create conversation quality grader
164+
conversation_quality_grader = AzureOpenAIScoreModelGrader(
165+
model_config=model_config,
166+
name="Conversation Quality Assessment",
167+
model="gpt-4o-mini",
168+
input=[
169+
{
170+
"role": "system",
171+
"content": (
172+
"You are an expert conversation quality evaluator. "
173+
"Assess the quality of AI assistant responses based on "
174+
"helpfulness, completeness, accuracy, and "
175+
"appropriateness. Return a score between 0.0 (very "
176+
"poor) and 1.0 (excellent)."
177+
),
178+
},
179+
{
180+
"role": "user",
181+
"content": (
182+
"Evaluate this conversation:\n"
183+
"Context: {{ item.context }}\n"
184+
"Messages: {{ item.conversation }}\n\n"
185+
"Provide a quality score from 0.0 to 1.0."
186+
),
187+
},
188+
],
189+
range=[0.0, 1.0],
190+
sampling_params={"temperature": 0.0},
191+
)
192+
193+
print("✅ Conversation quality grader created successfully")
194+
195+
# 3. Run evaluation with the score model grader
196+
print("\n🚀 Running evaluation with score model grader...")
197+
198+
result = evaluate(
199+
data=data_file,
200+
evaluators={"conversation_quality": conversation_quality_grader},
201+
azure_ai_project=os.environ.get("AZURE_AI_PROJECT_ENDPOINT"),
202+
)
203+
204+
# 4. Display results
205+
print("\n=== Evaluation Results ===")
206+
print(f"Total samples evaluated: {len(result['rows'])}")
207+
208+
# Show metrics
209+
print("\n=== Metrics Summary ===")
210+
for metric_name, metric_value in result["metrics"].items():
211+
print(f"{metric_name}: {metric_value:.3f}")
212+
213+
# Show detailed results
214+
print("\n=== Sample Results ===")
215+
df = pd.DataFrame(result["rows"])
216+
217+
for i, row in df.head(3).iterrows():
218+
print(f"\nSample {i+1}:")
219+
print(f" Context: {row.get('context', 'N/A')}")
220+
221+
# Show grader results
222+
for col in df.columns:
223+
if col.startswith("outputs."):
224+
grader_name = col.split(".")[1]
225+
if "score" in col:
226+
print(f" {grader_name} Score: {row[col]:.3f}")
227+
elif "passed" in col:
228+
print(f" {grader_name} Passed: {row[col]}")
229+
230+
print("\n✅ Evaluation completed successfully!")
231+
232+
except Exception as e:
233+
print(f"\n❌ Error during evaluation: {str(e)}")
234+
235+
# Clean up
236+
if os.path.exists(data_file):
237+
os.remove(data_file)
238+
print(f"\n🧹 Cleaned up temporary file: {data_file}")
239+
240+
241+
if __name__ == "__main__":
242+
print("🚀 Starting Azure OpenAI Score Model Grader Demo\n")
243+
244+
# Check if environment variables are set
245+
required_vars = ["endpoint", "key", "deployment_name"]
246+
247+
missing_vars = [var for var in required_vars if not os.environ.get(var)]
248+
249+
if missing_vars:
250+
print("⚠️ Missing environment variables:")
251+
for var in missing_vars:
252+
print(f" - {var}")
253+
else:
254+
print("✅ All environment variables found")
255+
demonstrate_score_model_grader()
256+
257+
print("\n🎉 Demo completed!")
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{"conversation": {"messages": [{"role": "user", "content": "How can I improve my Python coding skills?"}, {"role": "assistant", "content": "Here are some effective ways to improve your Python skills: 1) Practice coding daily with platforms like LeetCode or HackerRank, 2) Work on real projects that interest you, 3) Read other people's code on GitHub, 4) Join Python communities and forums, 5) Take online courses or tutorials."}]}, "context": "Programming advice", "expected_quality": "high"}
2+
{"conversation": {"messages": [{"role": "user", "content": "What is Python?"}, {"role": "assistant", "content": "Python is a programming language."}]}, "context": "Basic definition", "expected_quality": "low"}
3+
{"conversation": {"messages": [{"role": "user", "content": "Can you explain machine learning for a beginner?"}, {"role": "assistant", "content": "Machine learning is a subset of artificial intelligence where computers learn patterns from data without being explicitly programmed. Think of it like teaching a child to recognize animals - you show them many pictures of cats and dogs, and eventually they learn to distinguish between them."}]}, "context": "Educational explanation", "expected_quality": "high"}

0 commit comments

Comments
 (0)