You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Azure AI Studio evaluation page is a versatile hub that not only allows you to visualize and assess your results but also serves as a control center for optimizing, troubleshooting, and selecting the ideal AI model for your deployment needs. It's a one-stop solution for data-driven decision-making and performance enhancement in your AI Studio projects. You can seamlessly access and interpret the results from various sources, including your flow, the playground quick test session, evaluation submission UI, and SDK. This flexibility ensures that you can interact with your results in a way that best suits your workflow and preferences.
22
20
23
21
Once you've visualized your evaluation results, you can dive into a thorough examination. This includes the ability to not only view individual results but also to compare these results across multiple evaluation runs. By doing so, you can identify trends, patterns, and discrepancies, gaining invaluable insights into the performance of your AI system under various conditions.
@@ -32,18 +30,42 @@ In this article you learn to:
32
30
33
31
## Find your evaluation results
34
32
35
-
Upon submitting your evaluation, you can locate the submitted evaluation run within the run list by navigating to the **Evaluation** page.
33
+
Upon submitting your evaluation, you can locate the submitted evaluation run within the run list by navigating to the **Evaluation** page.
36
34
37
35
You can monitor and manage your evaluation runs within the run list. With the flexibility to modify the columns using the column editor and implement filters, you can customize and create your own version of the run list. Additionally, you can swiftly review the aggregated evaluation metrics across the runs, enabling you to perform quick comparisons.
38
36
39
37
:::image type="content" source="../media/evaluations/view-results/evaluation-run-list.png" alt-text="Screenshot of the evaluation run list." lightbox="../media/evaluations/view-results/evaluation-run-list.png":::
40
38
41
-
For a deeper understanding of how the evaluation metrics are derived, you can access a comprehensive explanation by selecting the 'Understand more about metrics' option. This detailed resource provides valuable insights into the calculation and interpretation of the metrics used in the evaluation process.
39
+
> [!TIP]
40
+
> To view evaluations run with any version of the promptflow-evals SDK or azure-ai-evaluation versions 1.0.0b1, 1.0.0b2, 1.0.0b3, enable the "Show all runs" toggle to locate the run.
41
+
42
+
For a deeper understanding of how the evaluation metrics are derived, you can access a comprehensive explanation by selecting the 'Learn more about metrics' option. This detailed resource provides valuable insights into the calculation and interpretation of the metrics used in the evaluation process.
42
43
43
-
:::image type="content" source="../media/evaluations/view-results/understand-metrics-details.png" alt-text="Screenshot of the evaluation metrics details." lightbox="../media/evaluations/view-results/understand-metrics-details.png":::
44
+
:::image type="content" source="../media/evaluations/view-results/learn-more-metrics.png" alt-text="Screenshot of the evaluation metrics details." lightbox="../media/evaluations/view-results/learn-more-metrics.png":::
44
45
45
46
You can choose a specific run, which will take you to the run detail page. Here, you can access comprehensive information, including evaluation details such as test dataset, task type, prompt, temperature, and more. Furthermore, you can view the metrics associated with each data sample. The metrics scores charts provide a visual representation of how scores are distributed for each metric throughout your dataset.
46
47
48
+
### Metric dashboard charts
49
+
50
+
We break down the aggregate views with different types of your metrics by AI Quality (AI assisted), Risk and safety, AI Quality (NLP), and Custom when applicable. You can view the distribution of scores across the evaluated dataset and see aggregate scores for each metric.
51
+
52
+
- For AI Quality (AI assisted), we aggregate by calculating an average across all the scores for each metric. If you calculate Groundedness Pro, the output is binary and so the aggregated score is passing rate, which is calculated by (#trues / #instances) × 100.
53
+
:::image type="content" source="../media/evaluations/view-results/ai-quality-ai-assisted-chart.png" alt-text="Screenshot of AI Quality (AI assisted) metrics dashboard tab." lightbox="../media/evaluations/view-results/ai-quality-ai-assisted-chart.png":::
54
+
- For risk and safety metrics, we aggregate by calculating a defect rate for each metric.
55
+
- For content harm metrics, the defect rate is defined as the percentage of instances in your test dataset that surpass a threshold on the severity scale over the whole dataset size. By default, the threshold is “Medium”.
56
+
- For protected material and indirect attack, the defect rate is calculated as the percentage of instances where the output is 'true' (Defect Rate = (#trues / #instances) × 100).
57
+
:::image type="content" source="../media/evaluations/view-results/risk-and-safety-chart.png" alt-text="Screenshot of risk and safety metrics dashboard tab." lightbox="../media/evaluations/view-results/risk-and-safety-chart.png":::
58
+
- For AI Quality (NLP) metrics, we show histogram of the metric distribution between 0 and 1. We aggregate by calculating an average across all the scores for each metric.
59
+
:::image type="content" source="../media/evaluations/view-results/ai-quality-nlp-chart.png" alt-text="Screenshot of AI Quality (NLP) dashboard tab." lightbox="../media/evaluations/view-results/ai-quality-nlp-chart.png":::
60
+
- For custom metrics, you can select **Add custom chart**, to create a custom chart with your chosen metrics or to view a metric against selected input parameters.
61
+
:::image type="content" source="../media/evaluations/view-results/custom-chart-pop-up.png" alt-text="Screenshot of create custom chart pop up." lightbox="../media/evaluations/view-results/custom-chart-pop-up.png":::
62
+
63
+
You can also customize existing charts for built-in metrics by changing the chart type.
64
+
65
+
:::image type="content" source="../media/evaluations/view-results/custom-chart-pop-up.png" alt-text="Screenshot of changing chart type." lightbox="../media/evaluations/view-results/custom-chart-pop-up.png":::
66
+
67
+
### Detailed metrics result table
68
+
47
69
Within the metrics detail table, you can conduct a comprehensive examination of each individual data sample. Here, you can scrutinize the generated output and its corresponding evaluation metric score. This level of detail enables you to make data-driven decisions and take specific actions to improve your model's performance.
48
70
49
71
Some potential action items based on the evaluation metrics could include:
@@ -55,15 +77,6 @@ Some potential action items based on the evaluation metrics could include:
55
77
56
78
The metrics detail table offers a wealth of data that can guide your model improvement efforts, from recognizing patterns to customizing your view for efficient analysis and refining your model based on identified issues.
57
79
58
-
We break down the aggregate views or your metrics by **Performance and quality** and **Risk and safety metrics**. You can view the distribution of scores across the evaluated dataset and see aggregate scores for each metric.
59
-
60
-
- For performance and quality metrics, we aggregate by calculating an average across all the scores for each metric.
61
-
:::image type="content" source="../media/evaluations/view-results/evaluation-details-page.png" alt-text="Screenshot of performance and quality metrics dashboard tab." lightbox="../media/evaluations/view-results/evaluation-details-page.png":::
62
-
- For risk and safety metrics, we aggregate by calculating a defect rate for each metric.
63
-
- For content harm metrics, the defect rate is defined as the percentage of instances in your test dataset that surpass a threshold on the severity scale over the whole dataset size. By default, the threshold is “Medium”.
64
-
- For protected material and indirect attack, the defect rate is calculated as the percentage of instances where the output is 'true' (Defect Rate = (#trues / #instances) × 100).
65
-
:::image type="content" source="../media/evaluations/view-results/evaluation-details-safety-metrics.png" alt-text="Screenshot of risk and safety metrics dashboard tab." lightbox="../media/evaluations/view-results/evaluation-details-safety-metrics.png":::
66
-
67
80
Here are some examples of the metrics results for the question answering scenario:
68
81
69
82
:::image type="content" source="../media/evaluations/view-results/metrics-details-qa.png" alt-text="Screenshot of metrics results for the question answering scenario." lightbox="../media/evaluations/view-results/metrics-details-qa.png":::
@@ -82,25 +95,27 @@ For risk and safety metrics, the evaluation provides a severity score and reason
82
95
83
96
:::image type="content" source="../media/evaluations/view-results/risk-safety-metric-example.png" alt-text="Screenshot of risk and safety metrics results for question answering scenario." lightbox="../media/evaluations/view-results/risk-safety-metric-example.png":::
84
97
85
-
Evaluation results may have different meanings for different audiences. For example, safety evaluations may generate a label for “Low” severity of violent content that may not align to a human reviewer’s definition of how severe that specific violent content might be. We provide a **human feedback** column with thumbs up and thumbs down when reviewing your evaluation results to surface which instances were approved or flagged as incorrect by a human reviewer.
98
+
Evaluation results might have different meanings for different audiences. For example, safety evaluations might generate a label for “Low” severity of violent content that may not align to a human reviewer’s definition of how severe that specific violent content might be. We provide a **human feedback** column with thumbs up and thumbs down when reviewing your evaluation results to surface which instances were approved or flagged as incorrect by a human reviewer.
86
99
87
100
:::image type="content" source="../media/evaluations/view-results/risk-safety-metric-human-feedback.png" alt-text="Screenshot of risk and safety metrics results with human feedback." lightbox="../media/evaluations/view-results/risk-safety-metric-human-feedback.png":::
88
101
89
102
When understanding each content risk metric, you can easily view each metric definition and severity scale by selecting on the metric name above the chart to see a detailed explanation in a pop-up.
90
103
91
104
:::image type="content" source="../media/evaluations/view-results/risk-safety-metric-definition-popup.png" alt-text="Screenshot of risk and safety metrics detailed explanation pop-up." lightbox="../media/evaluations/view-results/risk-safety-metric-definition-popup.png":::
92
105
93
-
If there's something wrong with the run, you can also debug your evaluation run with the log and trace.
106
+
If there's something wrong with the run, you can also debug your evaluation run with the logs.
94
107
95
108
Here are some examples of the logs that you can use to debug your evaluation run:
96
109
97
110
:::image type="content" source="../media/evaluations/view-results/evaluation-log.png" alt-text="Screenshot of logs that you can use to debug your evaluation run." lightbox="../media/evaluations/view-results/evaluation-log.png":::
98
111
99
-
And here's an example of the tracing and debugging view:
112
+
If you're evaluating a prompt flow, you can select the **View in flow** button to navigate to the evaluated flow page to make update to your flow. For example, adding additional meta prompt instruction, or change some parameters and re-evaluate.
113
+
114
+
### Manage and share view with view options
100
115
101
-
:::image type="content" source="../media/evaluations/view-results/evaluation-trace.png" alt-text="Screenshot of the trace that you can use to debug your evaluation run." lightbox="../media/evaluations/view-results/evaluation-trace.png":::
116
+
On the Evaluation Details page, you can customize your view by adding custom charts or editing columns. Once customized, you have the option to save the view and/or share it with others using the view options. This enables you to review evaluation results in a format tailored to your preferences and facilitates collaboration with colleagues.
102
117
103
-
If you're evaluating a prompt flow, you can select the **View in flow**button to navigate to the evaluated flow page to make update to your flow. For example, adding additional meta prompt instruction, or change some parameters and re-evaluate.
118
+
:::image type="content" source="../media/evaluations/view-results/view-options-evaluation-details.png" alt-text="Screenshot of the view options button dropdown." lightbox="../media/evaluations/view-results/view-options-evaluation-details.png":::
Copy file name to clipboardExpand all lines: articles/ai-studio/how-to/monitor-quality-safety.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -100,9 +100,9 @@ The parameters that are configured in your data asset dictate what metrics you c
100
100
| Groundedness | Required | Required | Required|
101
101
| Relevance | Required | Required | Required|
102
102
103
-
For more information on the specific data mapping requirements for each metric, see [Question answering metric requirements](evaluate-generative-ai-app.md#question-answering-metric-requirements).
103
+
For more information on the specific data mapping requirements for each metric, see [Query and response metric requirements](evaluate-generative-ai-app.md#query-and-response-metric-requirements).
104
104
105
-
## Set up monitoring for prompt flow
105
+
## Set up monitoring for prompt flow
106
106
107
107
To set up monitoring for your prompt flow application, you first have to deploy your prompt flow application with inferencing data collection, then you can configure monitoring for the deployed application.
0 commit comments