Skip to content

Commit b01eebb

Browse files
committed
updated agent eval docs
1 parent d0640d2 commit b01eebb

File tree

2 files changed

+10
-7
lines changed

2 files changed

+10
-7
lines changed

articles/ai-foundry/concepts/evaluation-evaluators/rag-evaluators.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -244,8 +244,10 @@ AI systems can fabricate content or generate irrelevant responses outside the gi
244244
### Groundedness Pro example
245245

246246
```python
247-
import os
248247
from azure.ai.evaluation import GroundednessProEvaluator
248+
from azure.identity import DefaultAzureCredential
249+
250+
import os
249251
from dotenv import load_dotenv
250252
load_dotenv()
251253

@@ -258,7 +260,7 @@ azure_ai_project = {
258260
## Using Azure AI Foundry Development Platform, example: AZURE_AI_PROJECT=https://your-account.services.ai.azure.com/api/projects/your-project
259261
azure_ai_project = os.environ.get("AZURE_AI_PROJECT")
260262

261-
groundedness_pro = GroundednessProEvaluator(azure_ai_project=azure_ai_project),
263+
groundedness_pro = GroundednessProEvaluator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())
262264
groundedness_pro(
263265
query="Is Marie Curie is born in Paris?",
264266
context="Background: 1. Marie Curie is born on November 7, 1867. 2. Marie Curie is born in Warsaw.",

articles/ai-foundry/how-to/develop/agent-evaluate-sdk.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -172,8 +172,8 @@ And that's it! `converted_data` contains all inputs required for [these evaluato
172172

173173
| Evaluators | Reasoning Models as Judge (example: o-series models from Azure OpenAI / OpenAI) | Non-reasoning models as Judge (example: gpt-4.1, gpt-4o, etc.) | To enable |
174174
|--|--|--|--|
175-
| `Intent Resolution`, `Task Adherence`, `Tool Call Accuracy`, `Response Completeness`| Supported | Supported | Set additional parameter `is_reasoning_model=True` in initializing evaluators |
176-
| Other quality evaluators| Not Supported | Supported | -- |
175+
| All quality evaluators except for `GroundednessProEvaluator` | Supported | Supported | Set additional parameter `is_reasoning_model=True` in initializing evaluators |
176+
| `GroundednessProEvaluator` | User does not need to support model | User does not need to support model | -- |
177177

178178
For complex tasks that require refined reasoning for the evaluation, we recommend a strong reasoning model like `o3-mini` or the o-series mini models released afterwards with a balance of reasoning performance and cost efficiency.
179179

@@ -197,17 +197,18 @@ model_config = {
197197
"api_version": os.getenv("AZURE_API_VERSION"),
198198
}
199199

200+
# example config for a reasoning model
200201
reasoning_model_config = {
201202
"azure_deployment": "o3-mini",
202203
"api_key": os.getenv("AZURE_API_KEY"),
203204
"azure_endpoint": os.getenv("AZURE_ENDPOINT"),
204205
"api_version": os.getenv("AZURE_API_VERSION"),
205206
}
206207

207-
# Evaluators with reasoning model support
208+
# Evaluators you may want to use reasoning models with
208209
quality_evaluators = {evaluator.__name__: evaluator(model_config=reasoning_model_config, is_reasoning_model=True) for evaluator in [IntentResolutionEvaluator, TaskAdherenceEvaluator, ToolCallAccuracyEvaluator]}
209210

210-
# Other evaluators do not support reasoning models
211+
# Other evaluators you may NOT want to use reasoning models
211212
quality_evaluators.update({ evaluator.__name__: evaluator(model_config=model_config) for evaluator in [CoherenceEvaluator, FluencyEvaluator, RelevanceEvaluator]})
212213

213214
## Using Azure AI Foundry (non-Hub) project endpoint, example: AZURE_AI_PROJECT=https://your-account.services.ai.azure.com/api/projects/your-project
@@ -233,12 +234,12 @@ AI-assisted quality evaluators provide a result for a query and response pair. T
233234
- `{metric_name}`: Provides a numerical score, on a Likert scale (integer 1 to 5) or a float between 0 and 1.
234235
- `{metric_name}_label`: Provides a binary label (if the metric naturally outputs a binary score).
235236
- `{metric_name}_reason`: Explains why a certain score or label was given for each data point.
237+
- `details`: Optional output containing debugging information about the quality of a single agent run.
236238

237239
To further improve intelligibility, all evaluators accept a binary threshold (unless their outputs are already binary) and output two new keys. For the binarization threshold, a default is set, which the user can override. The two new keys are:
238240

239241
- `{metric_name}_result`: A "pass" or "fail" string based on a binarization threshold.
240242
- `{metric_name}_threshold`: A numerical binarization threshold set by default or by the user.
241-
- `additional_details`: Contains debugging information about the quality of a single agent run.
242243

243244
See the following example output for some evaluators:
244245

0 commit comments

Comments
 (0)