You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Query: the query sent in to the generative AI application
87
87
- Response: the response to the query generated by the generative AI application
@@ -133,11 +133,12 @@ When using AI-assisted performance and quality metrics,
133
133
134
134
#### Set up
135
135
136
-
1. For AI-assisted performance and quality evaluators, you must specify a GPT model to act as a judge to score the evaluation data. Choose a deployment with either GPT-3.5, GPT-4, GPT-4o or GPT-4-mini model for your calculations and set it as your `model_config`. We support both Azure OpenAI or OpenAI model configuration schema. We recommend using GPT models that do not have the `(preview)` suffix for the best performance and parseable responses with our evaluators.
136
+
1. For AI-assisted quality evaluators except for `GroundednessProEvaluator`, you must specify a GPT model to act as a judge to score the evaluation data. Choose a deployment with either GPT-3.5, GPT-4, GPT-4o or GPT-4-mini model for your calculations and set it as your `model_config`. We support both Azure OpenAI or OpenAI model configuration schema. We recommend using GPT models that do not have the `(preview)` suffix for the best performance and parseable responses with our evaluators.
137
137
138
-
Make sure the you have at least `Cognitive Services OpenAI User` role for the Azure OpenAI resource to make inference calls with API key. For more permissions, learn more about [permissioning for Azure OpenAI resource](../../../ai-services/openai/how-to/role-based-access-control.md#summary).
138
+
> [!NOTE]
139
+
> Make sure the you have at least `Cognitive Services OpenAI User` role for the Azure OpenAI resource to make inference calls with API key. For more permissions, learn more about [permissioning for Azure OpenAI resource](../../../ai-services/openai/how-to/role-based-access-control.md#summary).
139
140
140
-
2. For `GroundednessProEvaluator`, instead of a GPT deployment in `model_config`, you must provide your `azure_ai_project` information. This accesses the evaluation service of your Azure AI project.
141
+
2. For `GroundednessProEvaluator`, instead of a GPT deployment in `model_config`, you must provide your `azure_ai_project` information. This accesses the backend evaluation service of your Azure AI project.
141
142
142
143
143
144
#### Performance and quality evaluator usage
@@ -164,25 +165,26 @@ model_config = {
164
165
}
165
166
166
167
167
-
168
168
from azure.ai.evaluation import GroundednessProEvaluator, GroundednessEvaluator
169
169
170
170
# Initialzing Groundedness and Groundedness Pro evaluators
# Running Groundedness Evaluator on a query and response pair
175
-
groundedness_score = groundedness_eval(
174
+
query_response =dict(
176
175
query="Which tent is the most waterproof?",
177
176
context="The Alpine Explorer Tent is the most water-proof of all tents available.",
178
177
response="The Alpine Explorer Tent is the most waterproof."
179
178
)
179
+
180
+
# Running Groundedness Evaluator on a query and response pair
181
+
groundedness_score = groundedness_eval(
182
+
**query_response
183
+
)
180
184
print(groundedness_score)
181
185
182
186
groundedness_pro_score = groundedness_pro_eval(
183
-
query="Which tent is the most waterproof?",
184
-
context="The Alpine Explorer Tent is the most water-proof of all tents available.",
185
-
response="The Alpine Explorer Tent is the most waterproof."
187
+
**query_response
186
188
)
187
189
print(groundedness_pro_score)
188
190
@@ -193,21 +195,29 @@ Here's an example of the result for a query and response pair:
193
195
For
194
196
```python
195
197
196
-
# GroundednessEvaluator result:
197
-
{
198
-
'groundedness.gpt_groundedness': 5.0,
199
-
'groundedness.groundedness': 5.0,
200
-
'groundedness.groundedness_reason': "The response is perfectly relevant to the query, as it directly addresses the aspect the query is seeking."
198
+
# Evaluation Service-based Groundedness Pro score:
199
+
{
200
+
'groundedness_pro_label': False,
201
+
'groundedness_pro_reason': '\'The Alpine Explorer Tent is the most waterproof.\' is ungrounded because "The Alpine Explorer Tent is the second most water-proof of all tents available." Thus, the tagged word [ Alpine Explorer Tent ] being the most waterproof is a contradiction.'
201
202
}
202
-
# GroundednessProEvaluator result:
203
-
{
204
-
'groundedness_pro_label': True,
205
-
'groundedness_pro_reason': 'All Contents are grounded'
203
+
# Open-source prompt-based Groundedness score:
204
+
{
205
+
'groundedness': 3.0,
206
+
'gpt_groundedness': 3.0,
207
+
'groundedness_reason': 'The response attempts to answer the query but contains incorrect information, as it contradicts the context by stating the Alpine Explorer Tent is the most waterproof when the context specifies it is the second most waterproof.'
206
208
}
207
209
208
210
```
211
+
The result of the AI-assisted quality evaluators for a query and response pair is a dictionary containing:
212
+
-`{metric_name}` provides a numerical score.
213
+
-`{metric_name}_label` provides a binary label.
214
+
-`{metric_name}_reason` has a text reasoning for why a certain score or label was given for each data point.
209
215
210
-
Like 6 other AI-assisted evaluators, `GroundednessEvaluator` is open-source prompt-based evaluator that outputs a score on a 5-point scale (the higher the score, the more grounded the result is). On the other hand, `GroundednessProEvaluator` invokes our backend evaluation service powered by Azure AI Content Safety and outputs `True` or `False` for grounded and ungrounded response.
216
+
For NLP evaluators, only a score is given in the `{metric_name}` key.
217
+
218
+
Like 6 other AI-assisted evaluators, `GroundednessEvaluator` is a prompt-based evaluator that outputs a score on a 5-point scale (the higher the score, the more grounded the result is). On the other hand, `GroundednessProEvaluator` invokes our backend evaluation service powered by Azure AI Content Safety and outputs `True` if all content is grounded, or `False` if any ungrounded content is detected.
219
+
220
+
We open-source the prompts of our quality evaluators except for `GroundednessProEvaluator` (powered by Azure AI Content Safety) for transparency. These prompts serve as instructions for a language model to perform their evaluation task, which requires a human-friendly definition of the metric and its associated scoring rubrics (what the 5 levels of quality means for the metric). We highly recommend that users customize the definitions and grading rubrics to their scenario specifics.
211
221
212
222
For conversation mode, here is an example for `GroundednessEvaluator`:
213
223
@@ -252,7 +262,6 @@ Currently AI-assisted risk and safety metrics are only available in the followin
252
262
|UK South | Will be deprecated 12/1/24 | N/A |
253
263
|East US 2 | Supported | Supported |
254
264
|Sweden Central | Supported | N/A |
255
-
|US North Central | Supported | N/A |
256
265
|France Central | Supported | N/A |
257
266
|Switzerland West | Supported | N/A |
258
267
@@ -659,7 +668,7 @@ After local evaluations of your generative AI applications, you may want to trig
659
668
- Azure AI project in the same [regions](#region-support) as risk and safety evaluators. If you do not have an existing project, follow the guide [How to create Azure AI project](../create-projects.md?tabs=ai-studio) to create one.
660
669
661
670
> [!NOTE]
662
-
> Currently remote evaluations do not yet support `Groundedness-Pro-Evaluator`.
671
+
> Remote evaluations do not support `Groundedness-Pro-Evaluator`, `Retrieval-Evaluator`, `Protected-Material-Evaluator`, `Direct-Attack-Evaluator`, and `Indirect-Attack-Evaluator`.
663
672
664
673
- Azure OpenAI Deployment with GPT model supporting `chat completion`, for example `gpt-4`.
665
674
-`Connection String` for Azure AI project to easily create `AIProjectClient` object. You can get the **Project connection string** under **Project details** from the project's **Overview** page.
@@ -721,6 +730,7 @@ We provide a list of built-in evaluators registered in the [Evaluator library](.
721
730
from azure.ai.evaluation import F1ScoreEvaluator, RelevanceEvaluator, ViolenceEvaluator
0 commit comments