Skip to content

Commit d476dee

Browse files
committed
updates
1 parent 47d31c0 commit d476dee

File tree

2 files changed

+22
-22
lines changed

2 files changed

+22
-22
lines changed

articles/machine-learning/prompt-flow/how-to-develop-an-evaluation-flow.md

Lines changed: 22 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
2-
title: Customize evaluation flow and metrics in prompt flow
2+
title: Evaluation flow and metrics in prompt flow
33
titleSuffix: Azure Machine Learning
4-
description: Learn how to customize or create your own evaluation flow and evaluation metrics tailored to your tasks and objectives, and then use in a batch run as an evaluation method in prompt flow with Azure Machine Learning studio.
4+
description: Use Azure Machine Learning studio to create or customize evaluation flows and metrics tailored to your tasks and objectives, and use a batch run as a prompt flow evaluation method.
55
services: machine-learning
66
ms.service: azure-machine-learning
77
ms.subservice: prompt-flow
@@ -14,17 +14,17 @@ ms.reviewer: ziqiwang
1414
ms.date: 10/23/2024
1515
---
1616

17-
# Customize evaluation flow and metrics
17+
# Evaluation flows and metrics
1818

19-
Evaluation flows are a special type of prompt flows that calculate metrics to assess how well the outputs of a run meet specific criteria and goals. You can create or customize evaluation flows and metrics tailored to your tasks and objectives, and use them to evaluate other flows. This article explains evaluation flows, how to develop and customize them, and how to use them in batch runs.
19+
Evaluation flows are a special type of prompt flows that calculate metrics to assess how well the outputs of a run meet specific criteria and goals. You can create or customize evaluation flows and metrics tailored to your tasks and objectives, and use them to evaluate other prompt flows. This article explains evaluation flows, how to develop and customize them, and how to use them in prompt flow batch runs to evaluate performance.
2020

2121
## Understand evaluation flows
2222

23-
A prompt flow is a sequence of nodes that process input and generate output. Evaluation flows take required inputs and produce corresponding outputs that are usually scores or metrics. Evaluation flows differ from standard flows in their authoring experience and usage.
23+
A prompt flow is a sequence of nodes that process input and generate output. Evaluation flows consume required inputs and produce corresponding outputs that are usually scores or metrics. Evaluation flows differ from standard flows in their authoring experience and usage.
2424

25-
Evaluation flows usually run after the run they're testing, by receiving its outputs and using the outputs to calculate the scores and metrics. Evaluation flows log metrics by using the `log_metric()` function.
25+
Evaluation flows usually run after the run they're testing by receiving its outputs and using the outputs to calculate scores and metrics. Evaluation flows log metrics by using the promptflow SDK `log_metric()` function.
2626

27-
The outputs of the evaluation flow are results that measure the performance of the flow being tested. Evaluation flows may have an aggregation node that calculates the overall performance of the flow being tested over the test dataset.
27+
The outputs of the evaluation flow are results that measure the performance of the flow being tested. Evaluation flows can have an aggregation node that calculates the overall performance of the flow being tested over the test dataset.
2828

2929
The next sections describe how inputs and outputs are defined in evaluation flows.
3030

@@ -34,11 +34,11 @@ Evaluation flows calculate metrics or scores for batch runs by taking in the out
3434

3535
You might need other inputs as ground truth. For example, if you want to calculate the accuracy of a classification flow, you need to provide the `category` column of the dataset as ground truth. If you want to calculate the accuracy of a QnA flow, you need to provide the `answer` column of the dataset as the ground truth. You might need some other inputs to calculate metrics, such as `question` and `context` in QnA or retrieval augmented generation (RAG) scenarios.
3636

37-
You define the inputs of the evaluation flow in the same way as defining the inputs of a standard flow. By default, evaluation uses the same dataset as the run being tested. However, if the corresponding labels or target ground truth values are in a different dataset, you can easily switch to that dataset.
37+
You define the inputs of the evaluation flow in the same way as you define the inputs of a standard flow. By default, evaluation uses the same dataset as the run being tested. However, if the corresponding labels or target ground truth values are in a different dataset, you can easily switch to that dataset.
3838

3939
#### Input descriptions
4040

41-
To describe the needed inputs for calculating metrics, you can add descriptions. The descriptions appear when you map the sources in batch run submissions.
41+
To describe the inputs needed for calculating metrics, you can add descriptions. The descriptions appear when you map the input sources in batch run submissions.
4242

4343
To add descriptions for each input, select **Show description** in the input section when developing your evaluation method, and then enter the descriptions.
4444

@@ -54,19 +54,19 @@ The outputs of an evaluation are results that show the performance of the flow b
5454

5555
#### Output scores
5656

57-
Prompt flow processes one row of data at a time and generates an output record. Evaluation flows likewise can calculate scores for each row of data, so you can check how the flow performs on each individual data point.
57+
Prompt flows process one row of data at a time and generate an output record. Evaluation flows likewise can calculate scores for each row of data, so you can check how a flow performs on each individual data point.
5858

59-
You can record the scores for each data instance as evaluation flow outputs by setting them in the output section of the evaluation flow. The authoring experience is the same as defining a standard flow output.
59+
You can record the scores for each data instance as evaluation flow outputs by specifying them in the output section of the evaluation flow. The authoring experience is the same as defining a standard flow output.
6060

6161
:::image type="content" source="./media/how-to-develop-an-evaluation-flow/eval-output.png" alt-text="Screenshot of the outputs section showing a name and value. " lightbox = "./media/how-to-develop-an-evaluation-flow/eval-output.png":::
6262

63-
You can view the individual scores in the **Outputs** tab when you select **View outputs**, the same as when you check the outputs of a standard flow batch run. You can append the instance-level scores to the output of the flow being tested.
63+
You can view the individual scores in the **Outputs** tab when you select **View outputs**, the same as when you check the outputs of a standard flow batch run. You can append these instance-level scores to the output of the flow being tested.
6464

6565
#### Aggregation and metrics logging
6666

67-
The evaluation flow also provides an overall assessment for the run. To distinguish from the individual output scores, values for evaluating overall run performance are called *metrics*.
67+
The evaluation flow also provides an overall assessment for the run. To distinguish them from individual output scores, values for evaluating overall run performance are called *metrics*.
6868

69-
To calculate an overall assessment value based on individual scores, you can check the **Aggregation** on a Python node in an evaluation flow to turn it into a *reduce* node. The node then takes in the inputs as a list and processes them as a batch.
69+
To calculate an overall assessment value based on individual scores, select the **Aggregation** checkbox on a Python node in an evaluation flow to turn it into a *reduce* node. The node then takes in the inputs as a list and processes them as a batch.
7070

7171
:::image type="content" source="./media/how-to-develop-an-evaluation-flow/set-as-aggregation.png" alt-text="Screenshot of the Python node heading pointing to an unchecked checked box. " lightbox = "./media/how-to-develop-an-evaluation-flow/set-as-aggregation.png":::
7272

@@ -87,7 +87,7 @@ def calculate_accuracy(grades: List[str]): # Receive a list of grades from a pre
8787
return accuracy
8888
```
8989

90-
Because you call this function in the Python node, you don't need to assign it elsewhere, and you can view the metrics later. After you use this evaluation method in a batch run, you can view the metrics showing overall performance by selecting the **Metrics** tab when you **View outputs**.
90+
Because you call this function in the Python node, you don't need to assign it elsewhere, and you can view the metrics later. After you use this evaluation method in a batch run, you can view the metric showing overall performance by selecting the **Metrics** tab when you **View outputs**.
9191

9292
:::image type="content" source="./media/how-to-develop-an-evaluation-flow/evaluation-metrics-bulk.png" alt-text="Screenshot of the metrics tab that shows the metrics logged by log metrics. " lightbox = "./media/how-to-develop-an-evaluation-flow/evaluation-metrics-bulk.png":::
9393

@@ -99,24 +99,24 @@ To develop your own evaluation flow, select **Create** on the Azure Machine Lear
9999

100100
- Select **Evaluation flow** in the **Explore gallery**, and select from one of the available built-in flows. Select **View details** to get a summary of each flow, and select **Clone** to open and customize the flow. The flow creation wizard helps you modify the flow for your own scenario.
101101

102-
:::image type="content" source="./media/how-to-develop-an-evaluation-flow/create-by-type.png" alt-text="Screenshot of create a new evaluation flow from scratch. " lightbox = "./media/how-to-develop-an-evaluation-flow/create-by-type.png":::
102+
:::image type="content" source="./media/how-to-develop-an-evaluation-flow/create-by-type.png" alt-text="Screenshot of different ways to create a new evaluation flow. " lightbox = "./media/how-to-develop-an-evaluation-flow/create-by-type.png":::
103103

104104
### Calculate scores for each data point
105105

106-
Evaluation flows calculate scores and metrics for a flow run on a dataset. The first step in evaluation flows is calculating scores for each individual output.
106+
Evaluation flows calculate scores and metrics for flows run on datasets. The first step in evaluation flows is calculating scores for each individual output.
107107

108108
For example, in the built-in Classification Accuracy Evaluation flow, the `grade` that measures the accuracy of each flow-generated output to its corresponding ground truth is calculated in the **grade** Python node.
109109

110-
If you create an evaluation flow from template, you calculate this score in the **line_process** Python node. You can also replace the **line_process** python node with an LLM node to use an LLM to calculate the score, or use multiple nodes to perform the calculation.
110+
If you use the evaluation flow template, you calculate this score in the **line_process** Python node. You can also replace the **line_process** python node with an LLM node to use an LLM to calculate the score, or use multiple nodes to perform the calculation.
111111

112112
:::image type="content" source="./media/how-to-develop-an-evaluation-flow/line-process.png" alt-text="Screenshot of line process node in the template. " lightbox = "./media/how-to-develop-an-evaluation-flow/line-process.png":::
113-
You need to specify the output of the node as the outputs of the evaluation flow, which indicates that the outputs are the scores calculated for each data sample. You can also output reasoning for more information, and it's the same experience in defining outputs in standard flow.
113+
You specify the outputs of this node as the outputs of the evaluation flow, which indicates that the outputs are the scores calculated for each data sample. You can also output reasoning for more information, and it's the same experience as defining outputs in standard flow.
114114

115115
### Calculate and log metrics
116116

117117
The next step in evaluation is to calculate overall metrics to assess the run. You calculate metrics in a Python node that has the **Aggregation** option selected. This node takes in the scores from the previous calculation node and organizes them into a list, then calculates overall values.
118118

119-
If you use the evaluation template, this score is calculated in the **aggregate** node. The following code snippet is the template of the aggregation node.
119+
If you use the evaluation template, this score is calculated in the **aggregate** node. The following code snippet shows the template for the aggregation node.
120120

121121
```python
122122

@@ -139,9 +139,9 @@ def aggregate(processed_results: List[str]):
139139
return aggregated_results
140140

141141
```
142-
You can use your own aggregation logic, such as calculating score mean, median, or standard deviation.
142+
You can use your own aggregation logic, such as calculating score average, variance, or standard deviation.
143143

144-
Log the metrics by using the `promptflow.logmetric()` function. You can log multiple metrics in a single evaluation flow. Metrics must be numerical (`float`/`int`).
144+
Log the metrics by using the `promptflow.log_metric()` function. You can log multiple metrics in a single evaluation flow. Metrics must be numerical (`float`/`int`).
145145

146146
## Use a customized evaluation flow
147147

160 Bytes
Loading

0 commit comments

Comments
 (0)