You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-studio/concepts/evaluation-metrics-built-in.md
+7-5Lines changed: 7 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,6 +30,7 @@ In the development and deployment of generative AI models and applications, the
30
30
Another consideration for evaluators is whether they're AI-assisted (using models as judge like GPT-4 to assess AI-generated output, especially when no defined ground truth is available) or NLP metrics, like F1 score, which measures similarity between AI-generated responses and ground truths.
31
31
32
32
- Risk and safety evaluators
33
+
33
34
These evaluators focus on identifying potential content and security risks and on ensuring the safety of the generated content.
34
35
35
36
> [!WARNING]
@@ -46,6 +47,7 @@ Another consideration for evaluators is whether they're AI-assisted (using model
46
47
| Indirect attack jailbreak (XPIA, Cross-domain Prompt Injected Attack) | Indirect attacks, also known as cross-domain prompt injected attacks (XPIA), occur when jailbreak attacks are injected into the context of a document or source that may result in altered, unexpected behavior on the part of the LLM. |
47
48
48
49
- Generation quality evaluators
50
+
49
51
These evaluators focus on various scenarios for quality measurement.
50
52
51
53
| Recommended scenario | Evaluator Type | Why use this evaluator? | Evaluators |
@@ -77,9 +79,9 @@ The risk and safety evaluators draw on insights gained from our previous Large L
You can measure these risk and safety evaluators on your own data or test dataset through red-teaming or on a synthetic test dataset generated by [our adversarial simulator](../how-to/develop/simulator-interaction-data.md#generate-adversarial-simulations-for-safety-evaluation). This outputs an annotated test dataset with content risk severity levels (very low, low, medium, or high) and [show your results in Azure AI ](../how-to/evaluate-results.md), which provide you with overall defect rate across whole test dataset and instance view of each content risk label and reasoning.
84
+
You can measure these risk and safety evaluators on your own data or test dataset through red-teaming or on a synthetic test dataset generated by [our adversarial simulator](../how-to/develop/simulator-interaction-data.md#generate-adversarial-simulations-for-safety-evaluation). This outputs an annotated test dataset with content risk severity levels (very low, low, medium, or high) and [shows your results in Azure AI ](../how-to/evaluate-results.md), which provides you with overall defect rate across whole test dataset and instance view of each content risk label and reasoning.
83
85
84
86
> [!NOTE]
85
87
> AI-assisted risk and safety evaluators are hosted by Azure AI Foundry safety evaluations back-end service and are only available in the following regions: East US 2, France Central, Sweden Central, Switzerland West. Protected Material evaluation is only available in East US 2.
@@ -218,7 +220,7 @@ Indirect attacks, also known as cross-domain prompt injected attacks (XPIA), are
218
220
219
221
Generation quality metrics are used to assess the overall quality of the content produced by generative AI applications. All metrics or evaluators will output a score and an explanation for the score (except for SimilarityEvaluator which currently outputs a score only). Here's a breakdown of what these metrics entail:
220
222
221
-
:::image type="content" source="../media/evaluations/quality-evaluation-diagram.png" alt-text="Diagram showing how AI-assisted data generator or customer's test dataset uses test prompts to go to your endpoint then the app responses and goes to the AI-assisted quality evaluator and then the evaluations results display in the portal." lightbox="../media/evaluations/quality-evaluation-diagram.png":::
223
+
:::image type="content" source="../media/evaluations/quality-evaluation-diagram.png" alt-text="Diagram of generation quality metric workflow." lightbox="../media/evaluations/quality-evaluation-diagram.png":::
222
224
223
225
### AI-assisted: Groundedness
224
226
@@ -393,7 +395,7 @@ Fluency refers to the effectiveness and clarity of written communication, focusi
393
395
| When to use it? | The recommended scenario is NLP tasks with a user query. Use it when you want an objective evaluation of an AI model's performance, particularly in text generation tasks where you have access to ground truth responses. Similarity enables you to assess the generated text's semantic alignment with the desired content, helping to gauge the model's quality and accuracy. |
394
396
| What does it need as input? | Query, Response, Ground Truth |
395
397
396
-
Our definition and grading rubrics to be used by the Large Language Model judge to score this metric
398
+
Our definition and grading rubrics to be used by the Large Language Model judge to score this metric:
397
399
398
400
```
399
401
GPT-Similarity, as a metric, measures the similarity between the predicted answer and the correct answer. If the information and content in the predicted answer is similar or equivalent to the correct answer, then the value of the Equivalence metric should be high, else it should be low. Given the question, correct answer, and predicted answer, determine the value of Equivalence metric using the following rating scale:
@@ -435,7 +437,7 @@ This rating value should always be an integer between 1 and 5. So the rating pro
435
437
| Score characteristics | Score details |
436
438
| ----- | --- |
437
439
| Score range | Float [0-1] (higher means better quality) |
438
-
| What is this metric? | ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate automatic summarization and machine translation. It measures the overlap between generated text and reference summaries. ROUGE focuses on recall-oriented measures to assess how well the generated text covers the reference text. The ROUGE score comprises of precision, recall, and F1 score. |
440
+
| What is this metric? | ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate automatic summarization and machine translation. It measures the overlap between generated text and reference summaries. ROUGE focuses on recall-oriented measures to assess how well the generated text covers the reference text. The ROUGE score is composed of precision, recall, and F1 score. |
439
441
| When to use it? | The recommended scenario is Natural Language Processing (NLP) tasks. Text summarization and document comparison are among the recommended use cases for ROUGE, particularly in scenarios where text coherence and relevance are critical.
440
442
| What does it need as input? | Response, Ground Truth |
0 commit comments