Skip to content

Commit 4a26ea6

Browse files
committed
updated remote eval service support; g.Pro comparison
1 parent c125ee4 commit 4a26ea6

File tree

1 file changed

+53
-43
lines changed

1 file changed

+53
-43
lines changed

articles/ai-studio/how-to/develop/evaluate-sdk.md

Lines changed: 53 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -60,28 +60,28 @@ Built-in evaluators can accept *either* query and response pairs or a list of co
6060
- Query and response pairs in `.jsonl` format with the required inputs.
6161
- List of conversations in `.jsonl` format in the following section.
6262

63-
| Evaluator type | Evaluator | `query` | `response` | `context` | `ground_truth` | `conversation` |
64-
|-----|----------------|---------------|---------------|---------------|---------------|-----------|
65-
|AI-assisted performance and quality evaluators| `GroundednessEvaluator` | Optional: String | Required: String | Required: String | N/A | Supported |
66-
|| `GroundednessProEvaluator` | Required: String | Required: String | Required: String | N/A | Supported |
67-
|| `RetrievalEvaluator` | Required: String | N/A | Required: String | N/A | Supported |
68-
|| `RelevanceEvaluator` | Required: String | Required: String | N/A | N/A | Supported |
69-
|| `CoherenceEvaluator` | Required: String | Required: String | N/A | N/A |Supported |
70-
|| `FluencyEvaluator` | N/A | Required: String | N/A | N/A |Supported |
71-
|| `SimilarityEvaluator` | Required: String | Required: String | N/A | Required: String |Not supported |
72-
|Natural language processing (NLP) evaluators| `F1ScoreEvaluator` | N/A | Required: String | N/A | Required: String |Not supported |
73-
|| `RougeScoreEvaluator` | N/A | Required: String | N/A | Required: String | Not supported |
74-
|| `GleuScoreEvaluator` | N/A | Required: String | N/A | Required: String |Not supported |
75-
|| `BleuScoreEvaluator` | N/A | Required: String | N/A | Required: String |Not supported |
76-
|| `MeteorScoreEvaluator` | N/A | Required: String | N/A | Required: String |Not supported |
77-
|Risk and safety evaluators| `ViolenceEvaluator` | Required: String | Required: String | N/A | N/A |Supported |
78-
|| `SexualEvaluator` | Required: String | Required: String | N/A | N/A |Supported |
79-
|| `SelfHarmEvaluator` | Required: String | Required: String | N/A | N/A |Supported |
80-
|| `HateUnfairnessEvaluator` | Required: String | Required: String | N/A | N/A |Supported |
81-
|| `IndirectAttackEvaluator` | Required: String | Required: String | Required: String | N/A |Supported |
82-
|| `ProtectedMaterialEvaluator` | Required: String | Required: String | N/A | N/A |Supported |
83-
|Composite Evaluators| `QAEvaluator` | Required: String | Required: String | Required: String | N/A | Not supported |
84-
|| `ContentSafetyEvaluator` | Required: String | Required: String | N/A | N/A | Supported |
63+
| Evaluator | `query` | `response` | `context` | `ground_truth` | `conversation` |
64+
|----------------|---------------|---------------|---------------|---------------|-----------|
65+
|`GroundednessEvaluator` | Optional: String | Required: String | Required: String | N/A | Supported |
66+
| `GroundednessProEvaluator` | Required: String | Required: String | Required: String | N/A | Supported |
67+
| `RetrievalEvaluator` | Required: String | N/A | Required: String | N/A | Supported |
68+
| `RelevanceEvaluator` | Required: String | Required: String | N/A | N/A | Supported |
69+
| `CoherenceEvaluator` | Required: String | Required: String | N/A | N/A |Supported |
70+
| `FluencyEvaluator` | N/A | Required: String | N/A | N/A |Supported |
71+
| `SimilarityEvaluator` | Required: String | Required: String | N/A | Required: String |Not supported |
72+
|`F1ScoreEvaluator` | N/A | Required: String | N/A | Required: String |Not supported |
73+
| `RougeScoreEvaluator` | N/A | Required: String | N/A | Required: String | Not supported |
74+
| `GleuScoreEvaluator` | N/A | Required: String | N/A | Required: String |Not supported |
75+
| `BleuScoreEvaluator` | N/A | Required: String | N/A | Required: String |Not supported |
76+
| `MeteorScoreEvaluator` | N/A | Required: String | N/A | Required: String |Not supported |
77+
| `ViolenceEvaluator` | Required: String | Required: String | N/A | N/A |Supported |
78+
| `SexualEvaluator` | Required: String | Required: String | N/A | N/A |Supported |
79+
| `SelfHarmEvaluator` | Required: String | Required: String | N/A | N/A |Supported |
80+
| `HateUnfairnessEvaluator` | Required: String | Required: String | N/A | N/A |Supported |
81+
| `IndirectAttackEvaluator` | Required: String | Required: String | Required: String | N/A |Supported |
82+
| `ProtectedMaterialEvaluator` | Required: String | Required: String | N/A | N/A |Supported |
83+
| `QAEvaluator` | Required: String | Required: String | Required: String | N/A | Not supported |
84+
| `ContentSafetyEvaluator` | Required: String | Required: String | N/A | N/A | Supported |
8585

8686
- Query: the query sent in to the generative AI application
8787
- Response: the response to the query generated by the generative AI application
@@ -133,11 +133,12 @@ When using AI-assisted performance and quality metrics,
133133

134134
#### Set up
135135

136-
1. For AI-assisted performance and quality evaluators, you must specify a GPT model to act as a judge to score the evaluation data. Choose a deployment with either GPT-3.5, GPT-4, GPT-4o or GPT-4-mini model for your calculations and set it as your `model_config`. We support both Azure OpenAI or OpenAI model configuration schema. We recommend using GPT models that do not have the `(preview)` suffix for the best performance and parseable responses with our evaluators.
136+
1. For AI-assisted quality evaluators except for `GroundednessProEvaluator`, you must specify a GPT model to act as a judge to score the evaluation data. Choose a deployment with either GPT-3.5, GPT-4, GPT-4o or GPT-4-mini model for your calculations and set it as your `model_config`. We support both Azure OpenAI or OpenAI model configuration schema. We recommend using GPT models that do not have the `(preview)` suffix for the best performance and parseable responses with our evaluators.
137137

138-
Make sure the you have at least `Cognitive Services OpenAI User` role for the Azure OpenAI resource to make inference calls with API key. For more permissions, learn more about [permissioning for Azure OpenAI resource](../../../ai-services/openai/how-to/role-based-access-control.md#summary).
138+
> [!NOTE]
139+
> Make sure the you have at least `Cognitive Services OpenAI User` role for the Azure OpenAI resource to make inference calls with API key. For more permissions, learn more about [permissioning for Azure OpenAI resource](../../../ai-services/openai/how-to/role-based-access-control.md#summary).
139140
140-
2. For `GroundednessProEvaluator`, instead of a GPT deployment in `model_config`, you must provide your `azure_ai_project` information. This accesses the evaluation service of your Azure AI project.
141+
2. For `GroundednessProEvaluator`, instead of a GPT deployment in `model_config`, you must provide your `azure_ai_project` information. This accesses the backend evaluation service of your Azure AI project.
141142

142143

143144
#### Performance and quality evaluator usage
@@ -164,25 +165,26 @@ model_config = {
164165
}
165166

166167

167-
168168
from azure.ai.evaluation import GroundednessProEvaluator, GroundednessEvaluator
169169

170170
# Initialzing Groundedness and Groundedness Pro evaluators
171171
groundedness_eval = GroundednessEvaluator(model_config)
172172
groundedness_pro_eval = GroundednessProEvaluator(azure_ai_project=azure_ai_project, credential=credential)
173173

174-
# Running Groundedness Evaluator on a query and response pair
175-
groundedness_score = groundedness_eval(
174+
query_response = dict(
176175
query="Which tent is the most waterproof?",
177176
context="The Alpine Explorer Tent is the most water-proof of all tents available.",
178177
response="The Alpine Explorer Tent is the most waterproof."
179178
)
179+
180+
# Running Groundedness Evaluator on a query and response pair
181+
groundedness_score = groundedness_eval(
182+
**query_response
183+
)
180184
print(groundedness_score)
181185

182186
groundedness_pro_score = groundedness_pro_eval(
183-
query="Which tent is the most waterproof?",
184-
context="The Alpine Explorer Tent is the most water-proof of all tents available.",
185-
response="The Alpine Explorer Tent is the most waterproof."
187+
**query_response
186188
)
187189
print(groundedness_pro_score)
188190

@@ -193,21 +195,29 @@ Here's an example of the result for a query and response pair:
193195
For
194196
```python
195197

196-
# GroundednessEvaluator result:
197-
{
198-
'groundedness.gpt_groundedness': 5.0,
199-
'groundedness.groundedness': 5.0,
200-
'groundedness.groundedness_reason': "The response is perfectly relevant to the query, as it directly addresses the aspect the query is seeking."
198+
# Evaluation Service-based Groundedness Pro score:
199+
{
200+
'groundedness_pro_label': False,
201+
'groundedness_pro_reason': '\'The Alpine Explorer Tent is the most waterproof.\' is ungrounded because "The Alpine Explorer Tent is the second most water-proof of all tents available." Thus, the tagged word [ Alpine Explorer Tent ] being the most waterproof is a contradiction.'
201202
}
202-
# GroundednessProEvaluator result:
203-
{
204-
'groundedness_pro_label': True,
205-
'groundedness_pro_reason': 'All Contents are grounded'
203+
# Open-source prompt-based Groundedness score:
204+
{
205+
'groundedness': 3.0,
206+
'gpt_groundedness': 3.0,
207+
'groundedness_reason': 'The response attempts to answer the query but contains incorrect information, as it contradicts the context by stating the Alpine Explorer Tent is the most waterproof when the context specifies it is the second most waterproof.'
206208
}
207209

208210
```
211+
The result of the AI-assisted quality evaluators for a query and response pair is a dictionary containing:
212+
- `{metric_name}` provides a numerical score.
213+
- `{metric_name}_label` provides a binary label.
214+
- `{metric_name}_reason` has a text reasoning for why a certain score or label was given for each data point.
209215

210-
Like 6 other AI-assisted evaluators, `GroundednessEvaluator` is open-source prompt-based evaluator that outputs a score on a 5-point scale (the higher the score, the more grounded the result is). On the other hand, `GroundednessProEvaluator` invokes our backend evaluation service powered by Azure AI Content Safety and outputs `True` or `False` for grounded and ungrounded response.
216+
For NLP evaluators, only a score is given in the `{metric_name}` key.
217+
218+
Like 6 other AI-assisted evaluators, `GroundednessEvaluator` is a prompt-based evaluator that outputs a score on a 5-point scale (the higher the score, the more grounded the result is). On the other hand, `GroundednessProEvaluator` invokes our backend evaluation service powered by Azure AI Content Safety and outputs `True` if all content is grounded, or `False` if any ungrounded content is detected.
219+
220+
We open-source the prompts of our quality evaluators except for `GroundednessProEvaluator` (powered by Azure AI Content Safety) for transparency. These prompts serve as instructions for a language model to perform their evaluation task, which requires a human-friendly definition of the metric and its associated scoring rubrics (what the 5 levels of quality means for the metric). We highly recommend that users customize the definitions and grading rubrics to their scenario specifics.
211221

212222
For conversation mode, here is an example for `GroundednessEvaluator`:
213223

@@ -252,7 +262,6 @@ Currently AI-assisted risk and safety metrics are only available in the followin
252262
|UK South | Will be deprecated 12/1/24 | N/A |
253263
|East US 2 | Supported | Supported |
254264
|Sweden Central | Supported | N/A |
255-
|US North Central | Supported | N/A |
256265
|France Central | Supported | N/A |
257266
|Switzerland West | Supported | N/A |
258267

@@ -659,7 +668,7 @@ After local evaluations of your generative AI applications, you may want to trig
659668
- Azure AI project in the same [regions](#region-support) as risk and safety evaluators. If you do not have an existing project, follow the guide [How to create Azure AI project](../create-projects.md?tabs=ai-studio) to create one.
660669

661670
> [!NOTE]
662-
> Currently remote evaluations do not yet support `Groundedness-Pro-Evaluator`.
671+
> Remote evaluations do not support `Groundedness-Pro-Evaluator`, `Retrieval-Evaluator`, `Protected-Material-Evaluator`, `Direct-Attack-Evaluator`, and `Indirect-Attack-Evaluator`.
663672
664673
- Azure OpenAI Deployment with GPT model supporting `chat completion`, for example `gpt-4`.
665674
- `Connection String` for Azure AI project to easily create `AIProjectClient` object. You can get the **Project connection string** under **Project details** from the project's **Overview** page.
@@ -721,6 +730,7 @@ We provide a list of built-in evaluators registered in the [Evaluator library](.
721730
from azure.ai.evaluation import F1ScoreEvaluator, RelevanceEvaluator, ViolenceEvaluator
722731
print("F1 Score evaluator id:", F1ScoreEvaluator.id)
723732
```
733+
724734
- **From UI**: Follows these steps to fetch evaluator ids after they are registered to your project:
725735
- Select **Evaluation** tab in your Azure AI project;
726736
- Select Evaluator library;

0 commit comments

Comments
 (0)