You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-foundry/how-to/evaluate-results.md
+7-9Lines changed: 7 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -29,7 +29,7 @@ In this article, you learn how to:
29
29
30
30
## Find your evaluation results
31
31
32
-
After you submit your evaluation, you can locate the submitted evaluation run within the run list. Navigate to the **Evaluation** page.
32
+
After you submit your evaluation, you can locate the submitted evaluation run within the run list. Go to the **Evaluation** page.
33
33
34
34
You can monitor and manage your evaluation runs within the run list. You have the flexibility to modify the columns by using the column editor and implement filters, and you can customize and create your own version of the run list. Additionally, you can swiftly review the aggregated evaluation metrics across the runs and perform quick comparisons.
35
35
@@ -104,26 +104,24 @@ Evaluation results might have different meanings for different audiences. For ex
104
104
105
105
:::image type="content" source="../media/evaluations/view-results/risk-safety-metric-human-feedback.png" alt-text="Screenshot that shows risk and safety metrics results with human feedback." lightbox="../media/evaluations/view-results/risk-safety-metric-human-feedback.png":::
106
106
107
-
To understand each content risk metric, you can view metric definitions by navigating back to the **Report** section, or you can review the test in the **Metric dashboard** section.
107
+
To understand each content risk metric, you can view metric definitions by going back to the **Report** section, or you can review the test in the **Metric dashboard** section.
108
108
109
-
If there's something wrong with the run, you can also use the logs to debug your evaluation run.
110
-
111
-
Here are some examples of logs that you can use to debug your evaluation run:
109
+
If there's something wrong with the run, you can also use the logs to debug your evaluation run. Here are some examples of logs that you can use to debug your evaluation run:
112
110
113
111
:::image type="content" source="../media/evaluations/view-results/evaluation-log.png" alt-text="Screenshot that shows logs that you can use to debug your evaluation run." lightbox="../media/evaluations/view-results/evaluation-log.png":::
114
112
115
113
If you're evaluating a prompt flow, you can select the **View in flow** button to go to the evaluated flow page and update your flow. For example, you can add extra meta prompt instructions, or change some parameters and reevaluate.
116
114
117
115
## Compare the evaluation results
118
116
119
-
To facilitate a comprehensive comparison between two or more runs, you can select the desired runs and initiate the process. Select either the **Compare** button or, for a general detailed dashboard view, the **Switch to dashboard view** button. You're empowered to analyze and contrast the performance and outcomes of multiple runs, allowing for more informed decisionmaking and targeted improvements.
117
+
To facilitate a comprehensive comparison between two or more runs, you can select the desired runs and initiate the process. Select either the **Compare** button or (for a general detailed dashboard view) the **Switch to dashboard view** button. You're empowered to analyze and contrast the performance and outcomes of multiple runs, allowing for more informed decision-making and targeted improvements.
120
118
121
119
:::image type="content" source="../media/evaluations/view-results/evaluation-list-compare.png" alt-text="Screenshot that shows the option to compare evaluations." lightbox="../media/evaluations/view-results/evaluation-list-compare.png":::
122
120
123
121
In the dashboard view, you have access to two valuable components: the metric distribution comparison **Chart** and the comparison **Table**. You can use these tools to perform a side-by-side analysis of the selected evaluation runs. You can compare various aspects of each data sample with ease and precision.
124
122
125
123
> [!NOTE]
126
-
> Older evaluation runs will, by default, have matching rows between columns. However, newly run evaluations have to be intentionally configured to have matching columns during evaluation creation. Ensure the same name is used as the **Criteria Name** across all evaluations you want to compare.
124
+
> By default, older evaluation runs have matching rows between columns. However, newly run evaluations have to be intentionally configured to have matching columns during evaluation creation. Ensure that the same name is used as the **Criteria Name**value across all evaluations that you want to compare.
127
125
128
126
The following screenshot shows the experience when the fields are the same:
129
127
@@ -133,7 +131,7 @@ When a user doesn't use the same **Criteria Name** in creating the evaluation, f
133
131
134
132
:::image type="content" source="../media/evaluations/view-results/evaluation-criteria-name-mismatch.png" alt-text="Screenshot that shows automated evaluations when the fields aren't the same." lightbox="../media/evaluations/view-results/evaluation-criteria-name-mismatch.png":::
135
133
136
-
Within the comparison **Table**, you can establish a baseline for your comparison by hovering over the specific run you want to use as the reference point and set as baseline. You can also activate the **Show delta** toggle to readily visualize the differences between the baseline run and the other runs for numerical values. Additionally, you can select the **Show only difference** toggle so that the table displays only the rows that differ among the selected runs, aiding in the identification of distinct variations.
134
+
Within the comparison table, you can establish a baseline for your comparison by hovering over the specific run that you want to use as the reference point and set as baseline. You can also activate the **Show delta** toggle to readily visualize the differences between the baseline run and the other runs for numerical values. Additionally, you can select the **Show only difference** toggle so that the table displays only the rows that differ among the selected runs, aiding in the identification of distinct variations.
137
135
138
136
By using these comparison features, you can make an informed decision to select the best version:
139
137
@@ -147,7 +145,7 @@ By using these comparison tools effectively, you can identify which version of y
147
145
148
146
## Measure jailbreak vulnerability
149
147
150
-
Evaluating jailbreak vulnerability is a comparative measurement, not an AI-assisted metric. Run evaluations on two different, red-teamed datasets: a baseline adversarial test dataset versus the same adversarial test dataset with jailbreak injections in the first turn. You can use the adversarial data simulator to generate the dataset with or without jailbreak injections. Ensure the **Criteria Name** is the same for each evaluation metric when you configure the runs.
148
+
Evaluating jailbreak vulnerability is a comparative measurement, not an AI-assisted metric. Run evaluations on two different, red-teamed datasets: a baseline adversarial test dataset versus the same adversarial test dataset with jailbreak injections in the first turn. You can use the adversarial data simulator to generate the dataset with or without jailbreak injections. Ensure that the **Criteria Name** is the same for each evaluation metric when you configure the runs.
151
149
152
150
To understand if your application is vulnerable to jailbreak, you can specify the baseline and then turn on the **Jailbreak defect rates** toggle in the comparison table. The jailbreak defect rate is the percentage of instances in your test dataset where a jailbreak injection generated a higher severity score for *any* content risk metric with respect to a baseline over the whole dataset size. You can select multiple evaluations in your **Compare** dashboard to view the difference in defect rates.
0 commit comments