Skip to content

Commit 9879fe6

Browse files
committed
fixes
1 parent 3d4653c commit 9879fe6

File tree

1 file changed

+7
-5
lines changed

1 file changed

+7
-5
lines changed

articles/ai-studio/concepts/evaluation-metrics-built-in.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ In the development and deployment of generative AI models and applications, the
3030
Another consideration for evaluators is whether they're AI-assisted (using models as judge like GPT-4 to assess AI-generated output, especially when no defined ground truth is available) or NLP metrics, like F1 score, which measures similarity between AI-generated responses and ground truths.
3131

3232
- Risk and safety evaluators
33+
3334
These evaluators focus on identifying potential content and security risks and on ensuring the safety of the generated content.
3435

3536
> [!WARNING]
@@ -46,6 +47,7 @@ Another consideration for evaluators is whether they're AI-assisted (using model
4647
| Indirect attack jailbreak (XPIA, Cross-domain Prompt Injected Attack) | Indirect attacks, also known as cross-domain prompt injected attacks (XPIA), occur when jailbreak attacks are injected into the context of a document or source that may result in altered, unexpected behavior on the part of the LLM. |
4748

4849
- Generation quality evaluators
50+
4951
These evaluators focus on various scenarios for quality measurement.
5052

5153
| Recommended scenario | Evaluator Type | Why use this evaluator? | Evaluators |
@@ -77,9 +79,9 @@ The risk and safety evaluators draw on insights gained from our previous Large L
7779
- Direct attack jailbreak
7880
- Protected material content
7981

80-
:::image type="content" source="../media/evaluations/automated-safety-evaluation-steps.png" alt-text="Diagram of automated safety evaluation steps; targeted prompts, AI-assisted simulation, AI-generated data, AI-assisted evaluation." lightbox="../media/evaluations/automated-safety-evaluation-steps.png":::
82+
:::image type="content" source="../media/evaluations/automated-safety-evaluation-steps.png" alt-text="Diagram of automated safety evaluation steps: targeted prompts, AI-assisted simulation, AI-generated data, AI-assisted evaluation." lightbox="../media/evaluations/automated-safety-evaluation-steps.png":::
8183

82-
You can measure these risk and safety evaluators on your own data or test dataset through red-teaming or on a synthetic test dataset generated by [our adversarial simulator](../how-to/develop/simulator-interaction-data.md#generate-adversarial-simulations-for-safety-evaluation). This outputs an annotated test dataset with content risk severity levels (very low, low, medium, or high) and [show your results in Azure AI ](../how-to/evaluate-results.md), which provide you with overall defect rate across whole test dataset and instance view of each content risk label and reasoning.
84+
You can measure these risk and safety evaluators on your own data or test dataset through red-teaming or on a synthetic test dataset generated by [our adversarial simulator](../how-to/develop/simulator-interaction-data.md#generate-adversarial-simulations-for-safety-evaluation). This outputs an annotated test dataset with content risk severity levels (very low, low, medium, or high) and [shows your results in Azure AI ](../how-to/evaluate-results.md), which provides you with overall defect rate across whole test dataset and instance view of each content risk label and reasoning.
8385

8486
> [!NOTE]
8587
> AI-assisted risk and safety evaluators are hosted by Azure AI Foundry safety evaluations back-end service and are only available in the following regions: East US 2, France Central, Sweden Central, Switzerland West. Protected Material evaluation is only available in East US 2.
@@ -218,7 +220,7 @@ Indirect attacks, also known as cross-domain prompt injected attacks (XPIA), are
218220

219221
Generation quality metrics are used to assess the overall quality of the content produced by generative AI applications. All metrics or evaluators will output a score and an explanation for the score (except for SimilarityEvaluator which currently outputs a score only). Here's a breakdown of what these metrics entail:
220222

221-
:::image type="content" source="../media/evaluations/quality-evaluation-diagram.png" alt-text="Diagram showing how AI-assisted data generator or customer's test dataset uses test prompts to go to your endpoint then the app responses and goes to the AI-assisted quality evaluator and then the evaluations results display in the portal." lightbox="../media/evaluations/quality-evaluation-diagram.png":::
223+
:::image type="content" source="../media/evaluations/quality-evaluation-diagram.png" alt-text="Diagram of generation quality metric workflow." lightbox="../media/evaluations/quality-evaluation-diagram.png":::
222224

223225
### AI-assisted: Groundedness
224226

@@ -393,7 +395,7 @@ Fluency refers to the effectiveness and clarity of written communication, focusi
393395
| When to use it? | The recommended scenario is NLP tasks with a user query. Use it when you want an objective evaluation of an AI model's performance, particularly in text generation tasks where you have access to ground truth responses. Similarity enables you to assess the generated text's semantic alignment with the desired content, helping to gauge the model's quality and accuracy. |
394396
| What does it need as input? | Query, Response, Ground Truth |
395397

396-
Our definition and grading rubrics to be used by the Large Language Model judge to score this metric
398+
Our definition and grading rubrics to be used by the Large Language Model judge to score this metric:
397399

398400
```
399401
GPT-Similarity, as a metric, measures the similarity between the predicted answer and the correct answer. If the information and content in the predicted answer is similar or equivalent to the correct answer, then the value of the Equivalence metric should be high, else it should be low. Given the question, correct answer, and predicted answer, determine the value of Equivalence metric using the following rating scale:
@@ -435,7 +437,7 @@ This rating value should always be an integer between 1 and 5. So the rating pro
435437
| Score characteristics | Score details |
436438
| ----- | --- |
437439
| Score range | Float [0-1] (higher means better quality) |
438-
| What is this metric? | ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate automatic summarization and machine translation. It measures the overlap between generated text and reference summaries. ROUGE focuses on recall-oriented measures to assess how well the generated text covers the reference text. The ROUGE score comprises of precision, recall, and F1 score. |
440+
| What is this metric? | ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate automatic summarization and machine translation. It measures the overlap between generated text and reference summaries. ROUGE focuses on recall-oriented measures to assess how well the generated text covers the reference text. The ROUGE score is composed of precision, recall, and F1 score. |
439441
| When to use it? | The recommended scenario is Natural Language Processing (NLP) tasks. Text summarization and document comparison are among the recommended use cases for ROUGE, particularly in scenarios where text coherence and relevance are critical.
440442
| What does it need as input? | Response, Ground Truth |
441443

0 commit comments

Comments
 (0)