You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/prompt-flow/how-to-bulk-test-evaluate-flow.md
+15-15Lines changed: 15 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,9 +17,9 @@ ms.date: 10/28/2024
17
17
18
18
# Submit a batch run to evaluate a flow
19
19
20
-
A batch run executes your prompt flow with a large dataset and generates outputs for each data row. To evaluate how well the prompt flow performs with a large dataset, you can submit a batch run and use evaluation methods to generate performance scores and metrics.
20
+
A batch run executes a prompt flow with a large dataset and generates outputs for each data row. To evaluate how well your prompt flow performs with a large dataset, you can submit a batch run and use evaluation methods to generate performance scores and metrics.
21
21
22
-
After the batch flow completes, the evaluation methods automatically execute to calculate the scores and metrics. You can use the evaluation metrics to compare the output of your flow with your performance criteria and goals.
22
+
After the batch flow completes, the evaluation methods automatically execute to calculate the scores and metrics. You can use the evaluation metrics to assess the output of your flow against your performance criteria and goals.
23
23
24
24
This article describes how to submit a batch run and use an evaluation method to measure the quality of your flow output. You learn how to view the evaluation result and metrics, and how to start a new round of evaluation with a different method or subset of variants.
25
25
@@ -58,7 +58,7 @@ To submit a batch run, you select the dataset to test your flow with. You can al
58
58
59
59
:::image type="content" source="./media/how-to-bulk-test-evaluate-flow/batch-run-evaluation-selection.png" alt-text="Screenshot of evaluation settings where you can select built-in evaluation method." lightbox = "./media/how-to-bulk-test-evaluate-flow/batch-run-evaluation-selection.png":::
60
60
61
-
1.On the **Configure evaluation** screen, specify the sources of required inputs for the evaluation. For example, the ground truth column might come from a dataset. By default, evaluation uses the same dataset as the overall batch run. However, if the corresponding labels or target ground truth values are in a different dataset, you can use that one.
61
+
1.Next, on the **Configure evaluation** screen, specify the sources of required inputs for the evaluation. For example, the ground truth column might come from a dataset. By default, evaluation uses the same dataset as the overall batch run. However, if the corresponding labels or target ground truth values are in a different dataset, you can use that one.
62
62
63
63
> [!NOTE]
64
64
> If your evaluation method doesn't require data from a dataset, dataset selection is an optional configuration that doesn't affect evaluation results. You don't need to select a dataset, or reference any dataset columns in the input mapping section.
@@ -70,7 +70,7 @@ To submit a batch run, you select the dataset to test your flow with. You can al
1. Some evaluation methods require Large Language Models (LLMs) like GPT-4 or GPT-3 or need other connections to consume credentials or keys. For those methods, you must enter the connection data in the **Connection** section at the bottom of this screen to be able to use the evaluation flow. For more information, see [Set up a connection](get-started-prompt-flow.md#set-up-a-connection).
73
+
1. Some evaluation methods require Large Language Models (LLMs) like GPT-4 or GPT-3, or need other connections to consume credentials or keys. For those methods, you must enter the connection data in the **Connection** section at the bottom of this screen to be able to use the evaluation flow. For more information, see [Set up a connection](get-started-prompt-flow.md#set-up-a-connection).
74
74
75
75
:::image type="content" source="./media/how-to-bulk-test-evaluate-flow/batch-run-evaluation-connection.png" alt-text="Screenshot of connection where you can configure the connection for evaluation method. " lightbox = "./media/how-to-bulk-test-evaluate-flow/batch-run-evaluation-connection.png":::
76
76
@@ -88,7 +88,7 @@ You can find the list of submitted batch runs on the **Runs** tab in the Azure M
88
88
89
89
:::image type="content" source="./media/how-to-bulk-test-evaluate-flow/batch-run-list.png" alt-text="Screenshot of prompt flow run list page where you find batch runs." lightbox = "./media/how-to-bulk-test-evaluate-flow/batch-run-list.png":::
90
90
91
-
On the **Visualize outputs** screen, the **Runs & metrics** section shows overall results for the batch run and the evaluation run. The **Outputs** section shows the run inputs and outputs line by line in a results table that also includes line ID, **Run**, **Status**, and **System metrics**.
91
+
On the **Visualize outputs** screen, the **Runs & metrics** section shows overall results for the batch run and the evaluation run. The **Outputs** section shows the run inputs line by line in a results table that also includes line ID, **Run**, **Status**, and **System metrics**.
92
92
93
93
:::image type="content" source="./media/how-to-bulk-test-evaluate-flow/batch-run-output.png" alt-text="Screenshot of batch run result page on the outputs tab where you check batch run outputs. " lightbox = "./media/how-to-bulk-test-evaluate-flow/batch-run-output.png":::
94
94
@@ -100,13 +100,13 @@ You can find the list of submitted batch runs on the **Runs** tab in the Azure M
100
100
101
101
:::image type="content" source="./media/how-to-bulk-test-evaluate-flow/batch-run-output-new-evaluation.png" alt-text="Screenshot of the Trace view with expanded steps and details." lightbox ="./media/how-to-bulk-test-evaluate-flow/batch-run-output-new-evaluation.png":::
102
102
103
-
You can also view evaluation run results from the prompt flow page you tested. Under **View batch runs**, select **View batch runs** to see the list of batch runs for the flow, or select **View latest batch run outputs** to see the outputs for the latest run.
103
+
You can also view evaluation run results from the prompt flow you tested. Under **View batch runs**, select **View batch runs** to see the list of batch runs for the flow, or select **View latest batch run outputs** to see the outputs for the latest run.
104
104
105
105
:::image type="content" source="./media/how-to-bulk-test-evaluate-flow/batch-run-history.png" alt-text="Screenshot of Web Classification with the view bulk runs button selected." lightbox = "./media/how-to-bulk-test-evaluate-flow/batch-run-history.png":::
106
106
107
107
In the batch run list, select a batch run name to open the flow page for that run.
108
108
109
-
On the flow page for an evaluation run, select **View outputs** to see details for the flow. You can also **Clone** the flow to create a new flow, or **Deploy** it as an online endpoint.
109
+
On the flow page for an evaluation run, select **View outputs**or **Details**to see details for the flow. You can also **Clone** the flow to create a new flow, or **Deploy** it as an online endpoint.
110
110
111
111
:::image type="content" source="./media/how-to-bulk-test-evaluate-flow/batch-run-history-list.png" alt-text="Screenshot of batch run runs showing the history." lightbox = "./media/how-to-bulk-test-evaluate-flow/batch-run-history-list.png":::
112
112
@@ -124,7 +124,7 @@ On the **Details** screen:
124
124
125
125
:::image type="content" source="./media/how-to-bulk-test-evaluate-flow/batch-run-snapshot.png" alt-text="Screenshot of batch run snapshot." lightbox = "./media/how-to-bulk-test-evaluate-flow/batch-run-snapshot.png":::
126
126
127
-
### Start a new evaluation round on the same run
127
+
### Start a new evaluation round for the same run
128
128
129
129
You can run a new evaluation round to calculate metrics for a completed batch run without running the flow again. This process saves the cost of rerunning your flow and is helpful in the following scenarios:
130
130
@@ -140,15 +140,15 @@ The new run appears in the prompt flow **Run** list, and you can select more tha
140
140
141
141
If you modify your flow to improve its performance, you can submit multiple batch runs to compare the performance of the different flow versions. You can also compare the metrics calculated by different evaluation methods to see which method is more suitable for your flow.
142
142
143
-
To check your flow batch run history, select **View batch runs** at the top of your flow page. You can select each run to check the detail. You can also select multiple runs and select **Visualize outputs** to compare the metrics and the outputs of those runs.
143
+
To check your flow batch run history, select **View batch runs** at the top of your flow page. You can select each run to check the details. You can also select multiple runs and select **Visualize outputs** to compare the metrics and the outputs of those runs.
144
144
145
145
:::image type="content" source="./media/how-to-bulk-test-evaluate-flow/batch-run-compare.png" alt-text="Screenshot of metrics compare of multiple batch runs." lightbox = "./media/how-to-bulk-test-evaluate-flow/batch-run-compare.png":::
146
146
147
147
## Understand built-in evaluation metrics
148
148
149
149
Azure Machine Learning prompt flow provides several built-in evaluation methods to help you measure the performance of your flow output. Each evaluation method calculates different metrics. The following table describes the available built-in evaluation methods.
| Classification Accuracy Evaluation | Accuracy | Measures the performance of a classification system by comparing its outputs to ground truth | No | prediction, ground truth | In the range [0, 1]|
154
154
| QnA Groundedness Evaluation | Groundedness | Measures how grounded the model's predicted answers are in the input source. Even if the LLM responses are accurate, they're ungrounded if they're not verifiable against source. | Yes | question, answer, context (no ground truth) | 1 to 5, with 1 = worst and 5 = best |
@@ -165,22 +165,22 @@ If your run fails, check the output and log data and debug any flow failure. To
165
165
166
166
### Prompt engineering
167
167
168
-
Prompt construction can be difficult. To learn about prompt construction concepts, see [Introduction to prompt engineering](/azure/cognitive-services/openai/concepts/prompt-engineering). To learn how to construct a prompt that can help achieve your goals, see [Prompt engineering techniques](/azure/cognitive-services/openai/concepts/advanced-prompt-engineering).
168
+
Prompt construction can be difficult. To learn about prompt construction concepts, see [Overview of prompts](/ai-builder/prompts-overview). To learn how to construct a prompt that can help achieve your goals, see [Prompt engineering techniques](/azure/cognitive-services/openai/concepts/prompt-engineering).
169
169
170
170
### System message
171
171
172
-
You can use the system message, sometimes referred to as a metaprompt or [system prompt](/azure/cognitive-services/openai/concepts/advanced-prompt-engineering#meta-prompts), to guide an AI system's behavior and improve system performance. To learn how to improve your flow performance with system messages, see [System message framework and template recommendations for Large Language Models (LLMs)](/azure/cognitive-services/openai/concepts/system-message).
172
+
You can use the system message, sometimes referred to as a metaprompt or [system prompt](/azure/cognitive-services/openai/concepts/advanced-prompt-engineering), to guide an AI system's behavior and improve system performance. To learn how to improve your flow performance with system messages, see [System messages step-by-step authoring](/azure/cognitive-services/openai/concepts/system-message#step-by-step-authoring-best-practices).
173
173
174
174
### Golden datasets
175
175
176
176
Creating a copilot that uses LLMs typically involves grounding the model in reality by using source datasets. A *golden dataset* helps ensure that the LLMs provide the most accurate and useful responses to customer queries.
177
177
178
178
A golden dataset is a collection of realistic customer questions and expertly crafted answers that serve as a quality assurance tool for the LLMs your copilot uses. Golden datasets aren't used to train an LLM or inject context into an LLM prompt, but to assess the quality of the answers the LLM generates.
179
179
180
-
If your scenario involves a copilot, or you're building your own copilot, see [Producing Golden Datasets: Guidance for creating Golden Datasets used for Copilot quality assurance](https://aka.ms/copilot-golden-dataset-guide) for detailed guidance and best practices.
180
+
If your scenario involves a copilot, or you're building your own copilot, see [Producing Golden Datasets](https://aka.ms/copilot-golden-dataset-guide) for detailed guidance and best practices.
181
181
182
182
## Related content
183
183
184
-
-[Develop a customized evaluation flow](how-to-develop-an-evaluation-flow.md#use-a-customized-evaluation-flow)
184
+
-[Develop a customized evaluation flow](how-to-develop-an-evaluation-flow.md#develop-an-evaluation-flow)
185
185
-[Tune prompts using variants](how-to-tune-prompts-using-variants.md)
186
-
-[Deploy a flow](how-to-deploy-for-real-time-inference.md)
186
+
-[Deploy a flow as a managed online endpoint for real-time inference](how-to-deploy-for-real-time-inference.md)
0 commit comments