Skip to content

Commit 8a4c945

Browse files
Merge pull request #283919 from lgayhardt/aistudiofloweval0824
Update flow-evaluate-sdk.md
2 parents b18d6d6 + 057e117 commit 8a4c945

File tree

1 file changed

+115
-9
lines changed

1 file changed

+115
-9
lines changed

articles/ai-studio/how-to/develop/flow-evaluate-sdk.md

Lines changed: 115 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,11 @@ ms.service: azure-ai-studio
77
ms.custom:
88
- build-2024
99
ms.topic: how-to
10-
ms.date: 5/21/2024
10+
ms.date: 08/07/2024
1111
ms.reviewer: dantaylo
1212
ms.author: eur
1313
author: eric-urban
1414
---
15-
1615
# Evaluate with the prompt flow SDK
1716

1817
[!INCLUDE [Feature preview](~/reusable-content/ce-skilling/azure/includes/ai-studio/includes/feature-preview.md)]
@@ -51,7 +50,10 @@ Built-in composite evaluators are composed of individual evaluators.
5150
- `ContentSafetyEvaluator` combines all the safety evaluators for a single output of combined metrics for question and answer pairs
5251
- `ContentSafetyChatEvaluator` combines all the safety evaluators for a single output of combined metrics for chat messages following the OpenAI message protocol that can be found [here](https://platform.openai.com/docs/api-reference/messages/object#messages/object-content).
5352

54-
### Required data input for built-in evaluators
53+
> [!TIP]
54+
> For more information about inputs and outputs, see the [Prompt flow Python reference documentation](https://microsoft.github.io/promptflow/reference/python-library-reference/promptflow-evals/promptflow.evals.evaluators.html).
55+
56+
### Data requirements for built-in evaluators
5557
We require question and answer pairs in `.jsonl` format with the required inputs, and column mapping for evaluating datasets, as follows:
5658

5759
| Evaluator | `question` | `answer` | `context` | `ground_truth` |
@@ -160,9 +162,11 @@ chat_evaluator = ChatEvaluator(
160162
```
161163

162164
## Custom evaluators
165+
163166
Built-in evaluators are great out of the box to start evaluating your application's generations. However you might want to build your own code-based or prompt-based evaluator to cater to your specific evaluation needs.
164167

165168
### Code-based evaluators
169+
166170
Sometimes a large language model isn't needed for certain evaluation metrics. This is when code-based evaluators can give you the flexibility to define metrics based on functions or callable class. Given a simple Python class in an example `answer_length.py` that calculates the length of an answer:
167171
```python
168172
class AnswerLengthEvaluator:
@@ -186,7 +190,41 @@ The result:
186190
```JSON
187191
{"answer_length":27}
188192
```
193+
#### Log your custom code-based evaluator to your AI Studio project
194+
```python
195+
# First we need to save evaluator into separate file in its own directory:
196+
def answer_len(answer):
197+
return len(answer)
198+
199+
# Note, we create temporary directory to store our python file
200+
target_dir_tmp = "flex_flow_tmp"
201+
os.makedirs(target_dir_tmp, exist_ok=True)
202+
lines = inspect.getsource(answer_len)
203+
with open(os.path.join("flex_flow_tmp", "answer.py"), "w") as fp:
204+
fp.write(lines)
205+
206+
from flex_flow_tmp.answer import answer_len as answer_length
207+
# Then we convert it to flex flow
208+
pf = PFClient()
209+
flex_flow_path = "flex_flow"
210+
pf.flows.save(entry=answer_length, path=flex_flow_path)
211+
# Finally save the evaluator
212+
eval = Model(
213+
path=flex_flow_path,
214+
name="answer_len_uploaded",
215+
description="Evaluator, calculating answer length using Flex flow.",
216+
)
217+
flex_model = ml_client.evaluators.create_or_update(eval)
218+
# This evaluator can be downloaded and used now
219+
retrieved_eval = ml_client.evaluators.get("answer_len_uploaded", version=1)
220+
ml_client.evaluators.download("answer_len_uploaded", version=1, download_path=".")
221+
evaluator = load_flow(os.path.join("answer_len_uploaded", flex_flow_path))
222+
```
223+
224+
After logging your custom evaluator to your AI project, you can view it in your [Evaluator library](../evaluate-generative-ai-app.md#view-and-manage-the-evaluators-in-the-evaluator-library) under Evaluation tab in AI studio.
225+
189226
### Prompt-based evaluators
227+
190228
To build your own prompt-based large language model evaluator, you can create a custom evaluator based on a **Prompty** file. Prompty is a file with `.prompty` extension for developing prompt template. The Prompty asset is a markdown file with a modified front matter. The front matter is in YAML format that contains many metadata fields that define model configuration and expected inputs of the Prompty. Given an example `apology.prompty` file that looks like the following:
191229

192230
```markdown
@@ -248,13 +286,33 @@ apology_score = apology_eval(
248286
print(apology_score)
249287
```
250288

251-
Here is the result:
289+
Here's the result:
252290
```JSON
253291
{"apology": 0}
254292
```
293+
#### Log your custom prompt-based evaluator to your AI Studio project
294+
```python
295+
# Define the path to prompty file.
296+
prompty_path = os.path.join("apology-prompty", "apology.prompty")
297+
# Finally the evaluator
298+
eval = Model(
299+
path=prompty_path,
300+
name="prompty_uploaded",
301+
description="Evaluator, calculating answer length using Flex flow.",
302+
)
303+
flex_model = ml_client.evaluators.create_or_update(eval)
304+
# This evaluator can be downloaded and used now
305+
retrieved_eval = ml_client.evaluators.get("prompty_uploaded", version=1)
306+
ml_client.evaluators.download("prompty_uploaded", version=1, download_path=".")
307+
evaluator = load_flow(os.path.join("prompty_uploaded", "apology.prompty"))
308+
```
309+
310+
After logging your custom evaluator to your AI project, you can view it in your [Evaluator library](../evaluate-generative-ai-app.md#view-and-manage-the-evaluators-in-the-evaluator-library) under Evaluation tab in AI studio.
255311

256312
## Evaluate on test dataset using `evaluate()`
257-
After you spot-check your built-in or custom evaluators on a single row of data, you can combine multiple evaluators with the `evaluate()` API on an entire test dataset. In order to ensure the `evaluate()` can correctly parse the data, you must specify column mapping to map the column from the dataset to key words that are accepted by the evaluators. In this case, we specify the data mapping for `ground_truth`.
313+
314+
After you spot-check your built-in or custom evaluators on a single row of data, you can combine multiple evaluators with the `evaluate()` API on an entire test dataset. In order to ensure the `evaluate()` can correctly parse the data, you must specify column mapping to map the column from the dataset to key words that are accepted by the evaluators. In this case, we specify the data mapping for `ground_truth`.
315+
258316
```python
259317
from promptflow.evals.evaluate import evaluate
260318

@@ -276,9 +334,11 @@ result = evaluate(
276334
output_path="./myevalresults.json"
277335
)
278336
```
337+
279338
> [!TIP]
280339
> Get the contents of the `result.studio_url` property for a link to view your logged evaluation results in Azure AI Studio.
281340
The evaluator outputs results in a dictionary which contains aggregate `metrics` and row-level data and metrics. An example of an output:
341+
282342
```python
283343
{'metrics': {'answer_length.value': 49.333333333333336,
284344
'relevance.gpt_relevance': 5.0},
@@ -311,9 +371,17 @@ The evaluator outputs results in a dictionary which contains aggregate `metrics`
311371
'outputs.answer_length.value': 66,
312372
'outputs.relevance.gpt_relevance': 5}],
313373
'traces': {}}
374+
314375
```
315-
### Supported data formats for `evaluate()`
316-
The `evaluate()` API only accepts data in the JSONLines format. For all built-in evaluators, except for `ChatEvaluator` or `ContentSafetyChatEvaluator`, `evaluate()` requires data in the following format with required input fields. See the [previous section on required data input for built-in evaluators](#required-data-input-for-built-in-evaluators).
376+
377+
### Requirements for `evaluate()`
378+
379+
The `evaluate()` API has a few requirements for the data format that it accepts and how it handles evaluator parameter key names so that the charts in your AI Studio evaluation results show up properly.
380+
381+
#### Data format
382+
383+
The `evaluate()` API only accepts data in the JSONLines format. For all built-in evaluators, except for `ChatEvaluator` or `ContentSafetyChatEvaluator`, `evaluate()` requires data in the following format with required input fields. See the [previous section on required data input for built-in evaluators](#data-requirements-for built-in evaluators).
384+
317385
```json
318386
{
319387
"question":"What is the capital of France?",
@@ -322,7 +390,9 @@ The `evaluate()` API only accepts data in the JSONLines format. For all built-in
322390
"ground_truth": "Paris"
323391
}
324392
```
393+
325394
For the composite evaluator class, `ChatEvaluator` and `ContentSafetyChatEvaluator`, we require an array of messages that adheres to OpenAI's messages protocol that can be found [here](https://platform.openai.com/docs/api-reference/messages/object#messages/object-content). The messages protocol contains a role-based list of messages with the following:
395+
326396
- `content`: The content of that turn of the interaction between user and application or assistant.
327397
- `role`: Either the user or application/assistant.
328398
- `"citations"` (within `"context"`): Provides the documents and its ID as key value pairs from the retrieval-augmented generation model.
@@ -360,7 +430,7 @@ To `evaluate()` with either the `ChatEvaluator` or `ContentSafetyChatEvaluator`,
360430
result = evaluate(
361431
data="data.jsonl",
362432
evaluators={
363-
"chatevaluator": chat_evaluator
433+
"chat": chat_evaluator
364434
},
365435
# column mapping for messages
366436
evaluator_config={
@@ -371,6 +441,40 @@ result = evaluate(
371441
)
372442
```
373443

444+
#### Evaluator parameter format
445+
446+
When passing in your built-in evaluators, it's important to specify the right keyword mapping in the `evaluators` parameter list. The following is the keyword mapping required for the results from your built-in evaluators to show up in the UI when logged to Azure AI Studio.
447+
448+
| Evaluator | keyword param |
449+
|------------------------------|-----------------------|
450+
| `RelevanceEvaluator` | "relevance" |
451+
| `CoherenceEvaluator` | "coherence" |
452+
| `GroundednessEvaluator` | "groundedness" |
453+
| `FluencyEvaluator` | "fluency" |
454+
| `SimilarityEvaluator` | "similarity" |
455+
| `F1ScoreEvaluator` | "f1_score" |
456+
| `ViolenceEvaluator` | "violence" |
457+
| `SexualEvaluator` | "sexual" |
458+
| `SelfHarmEvaluator` | "self_harm" |
459+
| `HateUnfairnessEvaluator` | "hate_unfairness" |
460+
| `QAEvaluator` | "qa" |
461+
| `ChatEvaluator` | "chat" |
462+
| `ContentSafetyEvaluator` | "content_safety" |
463+
| `ContentSafetyChatEvaluator` | "content_safety_chat" |
464+
465+
Here's an example of setting the `evaluators` parameters:
466+
```python
467+
result = evaluate(
468+
data="data.jsonl",
469+
evaluators={
470+
"sexual":sexual_evaluator
471+
"self_harm":self_harm_evaluator
472+
"hate_unfairness":hate_unfairness_evaluator
473+
"violence":violence_evaluator
474+
}
475+
)
476+
```
477+
374478
## Evaluate on a target
375479

376480
If you have a list of queries that you'd like to run then evaluate, the `evaluate()` also supports a `target` parameter, which can send queries to an application to collect answers then run your evaluators on the resulting question and answers.
@@ -399,4 +503,6 @@ result = evaluate(
399503
## Related content
400504

401505
- [Get started building a chat app using the prompt flow SDK](../../quickstarts/get-started-code.md)
402-
- [Work with projects in VS Code](vscode.md)
506+
- [Prompt flow Python reference documentation](https://microsoft.github.io/promptflow/reference/python-library-reference/promptflow-evals/promptflow.evals.evaluators.html)
507+
- [Learn more about the evaluation metrics](../../concepts/evaluation-metrics-built-in.md)
508+
- [View your evaluation results in Azure AI Studio](../../how-to/evaluate-flow-results.md)

0 commit comments

Comments
 (0)