Skip to content

Commit ae0fe11

Browse files
authored
Merge pull request #1312 from lgayhardt/release-ignite-ai-studio
AI Studio Eval: Updates to results and wizard IU
2 parents 800041c + e81886f commit ae0fe11

35 files changed

+148
-67
lines changed

articles/ai-studio/how-to/evaluate-generative-ai-app.md

Lines changed: 111 additions & 45 deletions
Large diffs are not rendered by default.

articles/ai-studio/how-to/evaluate-results.md

Lines changed: 35 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -8,16 +8,14 @@ ms.custom:
88
- ignite-2023
99
- build-2024
1010
ms.topic: how-to
11-
ms.date: 9/24/2024
11+
ms.date: 11/19/2024
1212
ms.reviewer: wenxwei
1313
ms.author: lagayhar
1414
author: lgayhardt
1515
---
1616

1717
# How to view evaluation results in Azure AI Studio
1818

19-
[!INCLUDE [feature-preview](../includes/feature-preview.md)]
20-
2119
The Azure AI Studio evaluation page is a versatile hub that not only allows you to visualize and assess your results but also serves as a control center for optimizing, troubleshooting, and selecting the ideal AI model for your deployment needs. It's a one-stop solution for data-driven decision-making and performance enhancement in your AI Studio projects. You can seamlessly access and interpret the results from various sources, including your flow, the playground quick test session, evaluation submission UI, and SDK. This flexibility ensures that you can interact with your results in a way that best suits your workflow and preferences.
2220

2321
Once you've visualized your evaluation results, you can dive into a thorough examination. This includes the ability to not only view individual results but also to compare these results across multiple evaluation runs. By doing so, you can identify trends, patterns, and discrepancies, gaining invaluable insights into the performance of your AI system under various conditions.
@@ -32,18 +30,42 @@ In this article you learn to:
3230

3331
## Find your evaluation results
3432

35-
Upon submitting your evaluation, you can locate the submitted evaluation run within the run list by navigating to the **Evaluation** page.
33+
Upon submitting your evaluation, you can locate the submitted evaluation run within the run list by navigating to the **Evaluation** page.
3634

3735
You can monitor and manage your evaluation runs within the run list. With the flexibility to modify the columns using the column editor and implement filters, you can customize and create your own version of the run list. Additionally, you can swiftly review the aggregated evaluation metrics across the runs, enabling you to perform quick comparisons.
3836

3937
:::image type="content" source="../media/evaluations/view-results/evaluation-run-list.png" alt-text="Screenshot of the evaluation run list." lightbox="../media/evaluations/view-results/evaluation-run-list.png":::
4038

41-
For a deeper understanding of how the evaluation metrics are derived, you can access a comprehensive explanation by selecting the 'Understand more about metrics' option. This detailed resource provides valuable insights into the calculation and interpretation of the metrics used in the evaluation process.
39+
> [!TIP]
40+
> To view evaluations run with any version of the promptflow-evals SDK or azure-ai-evaluation versions 1.0.0b1, 1.0.0b2, 1.0.0b3, enable the "Show all runs" toggle to locate the run.
41+
42+
For a deeper understanding of how the evaluation metrics are derived, you can access a comprehensive explanation by selecting the 'Learn more about metrics' option. This detailed resource provides valuable insights into the calculation and interpretation of the metrics used in the evaluation process.
4243

43-
:::image type="content" source="../media/evaluations/view-results/understand-metrics-details.png" alt-text="Screenshot of the evaluation metrics details." lightbox="../media/evaluations/view-results/understand-metrics-details.png":::
44+
:::image type="content" source="../media/evaluations/view-results/learn-more-metrics.png" alt-text="Screenshot of the evaluation metrics details." lightbox="../media/evaluations/view-results/learn-more-metrics.png":::
4445

4546
You can choose a specific run, which will take you to the run detail page. Here, you can access comprehensive information, including evaluation details such as test dataset, task type, prompt, temperature, and more. Furthermore, you can view the metrics associated with each data sample. The metrics scores charts provide a visual representation of how scores are distributed for each metric throughout your dataset.
4647

48+
### Metric dashboard charts
49+
50+
We break down the aggregate views with different types of your metrics by AI Quality (AI assisted), Risk and safety, AI Quality (NLP), and Custom when applicable. You can view the distribution of scores across the evaluated dataset and see aggregate scores for each metric.
51+
52+
- For AI Quality (AI assisted), we aggregate by calculating an average across all the scores for each metric. If you calculate Groundedness Pro, the output is binary and so the aggregated score is passing rate, which is calculated by (#trues / #instances) × 100.
53+
:::image type="content" source="../media/evaluations/view-results/ai-quality-ai-assisted-chart.png" alt-text="Screenshot of AI Quality (AI assisted) metrics dashboard tab." lightbox="../media/evaluations/view-results/ai-quality-ai-assisted-chart.png":::
54+
- For risk and safety metrics, we aggregate by calculating a defect rate for each metric.
55+
- For content harm metrics, the defect rate is defined as the percentage of instances in your test dataset that surpass a threshold on the severity scale over the whole dataset size. By default, the threshold is “Medium”.
56+
- For protected material and indirect attack, the defect rate is calculated as the percentage of instances where the output is 'true' (Defect Rate = (#trues / #instances) × 100).
57+
:::image type="content" source="../media/evaluations/view-results/risk-and-safety-chart.png" alt-text="Screenshot of risk and safety metrics dashboard tab." lightbox="../media/evaluations/view-results/risk-and-safety-chart.png":::
58+
- For AI Quality (NLP) metrics, we show histogram of the metric distribution between 0 and 1. We aggregate by calculating an average across all the scores for each metric.
59+
:::image type="content" source="../media/evaluations/view-results/ai-quality-nlp-chart.png" alt-text="Screenshot of AI Quality (NLP) dashboard tab." lightbox="../media/evaluations/view-results/ai-quality-nlp-chart.png":::
60+
- For custom metrics, you can select **Add custom chart**, to create a custom chart with your chosen metrics or to view a metric against selected input parameters.
61+
:::image type="content" source="../media/evaluations/view-results/custom-chart-pop-up.png" alt-text="Screenshot of create custom chart pop up." lightbox="../media/evaluations/view-results/custom-chart-pop-up.png":::
62+
63+
You can also customize existing charts for built-in metrics by changing the chart type.
64+
65+
:::image type="content" source="../media/evaluations/view-results/custom-chart-pop-up.png" alt-text="Screenshot of changing chart type." lightbox="../media/evaluations/view-results/custom-chart-pop-up.png":::
66+
67+
### Detailed metrics result table
68+
4769
Within the metrics detail table, you can conduct a comprehensive examination of each individual data sample. Here, you can scrutinize the generated output and its corresponding evaluation metric score. This level of detail enables you to make data-driven decisions and take specific actions to improve your model's performance.
4870

4971
Some potential action items based on the evaluation metrics could include:
@@ -55,15 +77,6 @@ Some potential action items based on the evaluation metrics could include:
5577

5678
The metrics detail table offers a wealth of data that can guide your model improvement efforts, from recognizing patterns to customizing your view for efficient analysis and refining your model based on identified issues.
5779

58-
We break down the aggregate views or your metrics by **Performance and quality** and **Risk and safety metrics**. You can view the distribution of scores across the evaluated dataset and see aggregate scores for each metric.
59-
60-
- For performance and quality metrics, we aggregate by calculating an average across all the scores for each metric.
61-
:::image type="content" source="../media/evaluations/view-results/evaluation-details-page.png" alt-text="Screenshot of performance and quality metrics dashboard tab." lightbox="../media/evaluations/view-results/evaluation-details-page.png":::
62-
- For risk and safety metrics, we aggregate by calculating a defect rate for each metric.
63-
- For content harm metrics, the defect rate is defined as the percentage of instances in your test dataset that surpass a threshold on the severity scale over the whole dataset size. By default, the threshold is “Medium”.
64-
- For protected material and indirect attack, the defect rate is calculated as the percentage of instances where the output is 'true' (Defect Rate = (#trues / #instances) × 100).
65-
:::image type="content" source="../media/evaluations/view-results/evaluation-details-safety-metrics.png" alt-text="Screenshot of risk and safety metrics dashboard tab." lightbox="../media/evaluations/view-results/evaluation-details-safety-metrics.png":::
66-
6780
Here are some examples of the metrics results for the question answering scenario:
6881

6982
:::image type="content" source="../media/evaluations/view-results/metrics-details-qa.png" alt-text="Screenshot of metrics results for the question answering scenario." lightbox="../media/evaluations/view-results/metrics-details-qa.png":::
@@ -82,25 +95,27 @@ For risk and safety metrics, the evaluation provides a severity score and reason
8295

8396
:::image type="content" source="../media/evaluations/view-results/risk-safety-metric-example.png" alt-text="Screenshot of risk and safety metrics results for question answering scenario." lightbox="../media/evaluations/view-results/risk-safety-metric-example.png":::
8497

85-
Evaluation results may have different meanings for different audiences. For example, safety evaluations may generate a label for “Low” severity of violent content that may not align to a human reviewer’s definition of how severe that specific violent content might be. We provide a **human feedback** column with thumbs up and thumbs down when reviewing your evaluation results to surface which instances were approved or flagged as incorrect by a human reviewer.
98+
Evaluation results might have different meanings for different audiences. For example, safety evaluations might generate a label for “Low” severity of violent content that may not align to a human reviewer’s definition of how severe that specific violent content might be. We provide a **human feedback** column with thumbs up and thumbs down when reviewing your evaluation results to surface which instances were approved or flagged as incorrect by a human reviewer.
8699

87100
:::image type="content" source="../media/evaluations/view-results/risk-safety-metric-human-feedback.png" alt-text="Screenshot of risk and safety metrics results with human feedback." lightbox="../media/evaluations/view-results/risk-safety-metric-human-feedback.png":::
88101

89102
When understanding each content risk metric, you can easily view each metric definition and severity scale by selecting on the metric name above the chart to see a detailed explanation in a pop-up.
90103

91104
:::image type="content" source="../media/evaluations/view-results/risk-safety-metric-definition-popup.png" alt-text="Screenshot of risk and safety metrics detailed explanation pop-up." lightbox="../media/evaluations/view-results/risk-safety-metric-definition-popup.png":::
92105

93-
If there's something wrong with the run, you can also debug your evaluation run with the log and trace.
106+
If there's something wrong with the run, you can also debug your evaluation run with the logs.
94107

95108
Here are some examples of the logs that you can use to debug your evaluation run:
96109

97110
:::image type="content" source="../media/evaluations/view-results/evaluation-log.png" alt-text="Screenshot of logs that you can use to debug your evaluation run." lightbox="../media/evaluations/view-results/evaluation-log.png":::
98111

99-
And here's an example of the tracing and debugging view:
112+
If you're evaluating a prompt flow, you can select the **View in flow** button to navigate to the evaluated flow page to make update to your flow. For example, adding additional meta prompt instruction, or change some parameters and re-evaluate.
113+
114+
### Manage and share view with view options
100115

101-
:::image type="content" source="../media/evaluations/view-results/evaluation-trace.png" alt-text="Screenshot of the trace that you can use to debug your evaluation run." lightbox="../media/evaluations/view-results/evaluation-trace.png":::
116+
On the Evaluation Details page, you can customize your view by adding custom charts or editing columns. Once customized, you have the option to save the view and/or share it with others using the view options. This enables you to review evaluation results in a format tailored to your preferences and facilitates collaboration with colleagues.
102117

103-
If you're evaluating a prompt flow, you can select the **View in flow** button to navigate to the evaluated flow page to make update to your flow. For example, adding additional meta prompt instruction, or change some parameters and re-evaluate.
118+
:::image type="content" source="../media/evaluations/view-results/view-options-evaluation-details.png" alt-text="Screenshot of the view options button dropdown." lightbox="../media/evaluations/view-results/view-options-evaluation-details.png":::
104119

105120
## Compare the evaluation results
106121

articles/ai-studio/how-to/monitor-quality-safety.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -100,9 +100,9 @@ The parameters that are configured in your data asset dictate what metrics you c
100100
| Groundedness | Required | Required | Required|
101101
| Relevance | Required | Required | Required|
102102

103-
For more information on the specific data mapping requirements for each metric, see [Question answering metric requirements](evaluate-generative-ai-app.md#question-answering-metric-requirements).
103+
For more information on the specific data mapping requirements for each metric, see [Query and response metric requirements](evaluate-generative-ai-app.md#query-and-response-metric-requirements).
104104

105-
## Set up monitoring for prompt flow
105+
## Set up monitoring for prompt flow
106106

107107
To set up monitoring for your prompt flow application, you first have to deploy your prompt flow application with inferencing data collection, then you can configure monitoring for the deployed application.
108108

154 KB
Loading
174 KB
Loading
95.7 KB
Loading
Binary file not shown.
Binary file not shown.
28.5 KB
Loading
-203 KB
Loading

0 commit comments

Comments
 (0)