You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|`ContentSafetyEvaluator`| Required: String | Required: String | N/A | N/A | Supported for text and image |
82
+
|`ContentSafetyEvaluator`| Required: String | Required: String | N/A | N/A | Supported for text and image |
83
83
84
84
- Query: the query sent in to the generative AI application
85
85
- Response: the response to the query generated by the generative AI application
@@ -295,9 +295,10 @@ The result of the AI-assisted quality evaluators for a query and response pair i
295
295
-`{metric_name}_label` provides a binary label.
296
296
-`{metric_name}_reason` explains why a certain score or label was given for each data point.
297
297
298
+
#### Comparing quality and custom evaluators
298
299
For NLP evaluators, only a score is given in the `{metric_name}` key.
299
300
300
-
Like 6 other AI-assisted evaluators, `GroundednessEvaluator` is a prompt-based evaluator that outputs a score on a 5-point scale (the higher the score, the more grounded the result is). On the other hand, `GroundednessProEvaluator` invokes our backend evaluation service powered by Azure AI Content Safety and outputs `True` if all content is grounded, or `False` if any ungrounded content is detected.
301
+
Like 6 other AI-assisted evaluators, `GroundednessEvaluator` is a prompt-based evaluator that outputs a score on a 5-point scale (the higher the score, the more grounded the result is). On the other hand, `GroundednessProEvaluator`(preview) invokes our backend evaluation service powered by Azure AI Content Safety and outputs `True` if all content is grounded, or `False` if any ungrounded content is detected.
301
302
302
303
We open-source the prompts of our quality evaluators except for `GroundednessProEvaluator` (powered by Azure AI Content Safety) for transparency. These prompts serve as instructions for a language model to perform their evaluation task, which requires a human-friendly definition of the metric and its associated scoring rubrics (what the 5 levels of quality mean for the metric). We highly recommend that users customize the definitions and grading rubrics to their scenario specifics. See details in [Custom Evaluators](#custom-evaluators).
0 commit comments