You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|Other quality evaluators| Not Supported | Supported| -- |
175
+
|All quality evaluators except for `GroundednessProEvaluator`| Supported | Supported | Set additional parameter `is_reasoning_model=True` in initializing evaluators |
176
+
|`GroundednessProEvaluator`| User does not need to support model | User does not need to support model| -- |
177
177
178
178
For complex tasks that require refined reasoning for the evaluation, we recommend a strong reasoning model like `o3-mini` or the o-series mini models released afterwards with a balance of reasoning performance and cost efficiency.
179
179
@@ -197,17 +197,18 @@ model_config = {
197
197
"api_version": os.getenv("AZURE_API_VERSION"),
198
198
}
199
199
200
+
# example config for a reasoning model
200
201
reasoning_model_config = {
201
202
"azure_deployment": "o3-mini",
202
203
"api_key": os.getenv("AZURE_API_KEY"),
203
204
"azure_endpoint": os.getenv("AZURE_ENDPOINT"),
204
205
"api_version": os.getenv("AZURE_API_VERSION"),
205
206
}
206
207
207
-
# Evaluators with reasoning model support
208
+
# Evaluators you may want to use reasoning models with
208
209
quality_evaluators = {evaluator.__name__: evaluator(model_config=reasoning_model_config, is_reasoning_model=True) for evaluator in [IntentResolutionEvaluator, TaskAdherenceEvaluator, ToolCallAccuracyEvaluator]}
209
210
210
-
# Other evaluators do not support reasoning models
211
+
# Other evaluators you may NOT want to use reasoning models
211
212
quality_evaluators.update({ evaluator.__name__: evaluator(model_config=model_config) for evaluator in [CoherenceEvaluator, FluencyEvaluator, RelevanceEvaluator]})
212
213
213
214
## Using Azure AI Foundry (non-Hub) project endpoint, example: AZURE_AI_PROJECT=https://your-account.services.ai.azure.com/api/projects/your-project
@@ -233,12 +234,12 @@ AI-assisted quality evaluators provide a result for a query and response pair. T
233
234
-`{metric_name}`: Provides a numerical score, on a Likert scale (integer 1 to 5) or a float between 0 and 1.
234
235
-`{metric_name}_label`: Provides a binary label (if the metric naturally outputs a binary score).
235
236
-`{metric_name}_reason`: Explains why a certain score or label was given for each data point.
237
+
-`details`: Optional output containing debugging information about the quality of a single agent run.
236
238
237
239
To further improve intelligibility, all evaluators accept a binary threshold (unless their outputs are already binary) and output two new keys. For the binarization threshold, a default is set, which the user can override. The two new keys are:
238
240
239
241
-`{metric_name}_result`: A "pass" or "fail" string based on a binarization threshold.
240
242
-`{metric_name}_threshold`: A numerical binarization threshold set by default or by the user.
241
-
-`additional_details`: Contains debugging information about the quality of a single agent run.
242
243
243
244
See the following example output for some evaluators:
0 commit comments