fixes

lgayhardt · lgayhardt · commit 9879fe6daf25 · 2024-11-11T13:26:00.000-08:00
diff --git a/articles/ai-studio/concepts/evaluation-metrics-built-in.md b/articles/ai-studio/concepts/evaluation-metrics-built-in.md
@@ -30,6 +30,7 @@ In the development and deployment of generative AI models and applications, the
 Another consideration for evaluators is whether they're AI-assisted (using models as judge like GPT-4 to assess AI-generated output, especially when no defined ground truth is available) or NLP metrics, like F1 score, which measures similarity between AI-generated responses and ground truths.
 
 - Risk and safety evaluators
+
     These evaluators focus on identifying potential content and security risks and on ensuring the safety of the generated content.
 
     > [!WARNING]
@@ -46,6 +47,7 @@ Another consideration for evaluators is whether they're AI-assisted (using model
     | Indirect attack jailbreak (XPIA, Cross-domain Prompt Injected Attack) | Indirect attacks, also known as cross-domain prompt injected attacks (XPIA), occur when jailbreak attacks are injected into the context of a document or source that may result in altered, unexpected behavior on the part of the LLM. |
 
 - Generation quality evaluators
+
     These evaluators focus on various scenarios for quality measurement.
 
     | Recommended scenario | Evaluator Type | Why use this evaluator? | Evaluators |
@@ -77,9 +79,9 @@ The risk and safety evaluators draw on insights gained from our previous Large L
 - Direct attack jailbreak
 - Protected material content
 
-:::image type="content" source="../media/evaluations/automated-safety-evaluation-steps.png" alt-text="Diagram of automated safety evaluation steps; targeted prompts, AI-assisted simulation, AI-generated data, AI-assisted evaluation." lightbox="../media/evaluations/automated-safety-evaluation-steps.png":::
+:::image type="content" source="../media/evaluations/automated-safety-evaluation-steps.png" alt-text="Diagram of automated safety evaluation steps: targeted prompts, AI-assisted simulation, AI-generated data, AI-assisted evaluation." lightbox="../media/evaluations/automated-safety-evaluation-steps.png":::
 
-You can measure these risk and safety evaluators on your own data or test dataset through red-teaming or on a synthetic test dataset generated by [our adversarial simulator](../how-to/develop/simulator-interaction-data.md#generate-adversarial-simulations-for-safety-evaluation). This outputs an annotated test dataset with content risk severity levels (very low, low, medium, or high) and  [show your results in Azure AI ](../how-to/evaluate-results.md), which provide you with overall defect rate across whole test dataset and instance view of each content risk label and reasoning.
+You can measure these risk and safety evaluators on your own data or test dataset through red-teaming or on a synthetic test dataset generated by [our adversarial simulator](../how-to/develop/simulator-interaction-data.md#generate-adversarial-simulations-for-safety-evaluation). This outputs an annotated test dataset with content risk severity levels (very low, low, medium, or high) and [shows your results in Azure AI ](../how-to/evaluate-results.md), which provides you with overall defect rate across whole test dataset and instance view of each content risk label and reasoning.
 
 > [!NOTE]
 > AI-assisted risk and safety evaluators are hosted by Azure AI Foundry safety evaluations back-end service and are only available in the following regions: East US 2, France Central, Sweden Central, Switzerland West. Protected Material evaluation is only available in East US 2.
@@ -218,7 +220,7 @@ Indirect attacks, also known as cross-domain prompt injected attacks (XPIA), are
 
 Generation quality metrics are used to assess the overall quality of the content produced by generative AI applications. All metrics or evaluators will output a score and an explanation for the score (except for SimilarityEvaluator which currently outputs a score only). Here's a breakdown of what these metrics entail:
 
-:::image type="content" source="../media/evaluations/quality-evaluation-diagram.png" alt-text="Diagram showing how AI-assisted data generator or customer's test dataset uses test prompts to go to your endpoint then the app responses and goes to the AI-assisted quality evaluator and then the evaluations results display in the portal." lightbox="../media/evaluations/quality-evaluation-diagram.png":::
+:::image type="content" source="../media/evaluations/quality-evaluation-diagram.png" alt-text="Diagram of generation quality metric workflow." lightbox="../media/evaluations/quality-evaluation-diagram.png":::
 
 ### AI-assisted: Groundedness
 
@@ -393,7 +395,7 @@ Fluency refers to the effectiveness and clarity of written communication, focusi
 | When to use it?   | The recommended scenario is NLP tasks with a user query. Use it when you want an objective evaluation of an AI model's performance, particularly in text generation tasks where you have access to ground truth responses. Similarity enables you to assess the generated text's semantic alignment with the desired content, helping to gauge the model's quality and accuracy. |
 | What does it need as input?  | Query, Response, Ground Truth  | 
 
-Our definition and grading rubrics to be used by the Large Language Model judge to score this metric
+Our definition and grading rubrics to be used by the Large Language Model judge to score this metric:
 
 ```
 GPT-Similarity, as a metric, measures the similarity between the predicted answer and the correct answer. If the information and content in the predicted answer is similar or equivalent to the correct answer, then the value of the Equivalence metric should be high, else it should be low. Given the question, correct answer, and predicted answer, determine the value of Equivalence metric using the following rating scale: 
@@ -435,7 +437,7 @@ This rating value should always be an integer between 1 and 5. So the rating pro
 | Score characteristics | Score details  | 
 | ----- | --- | 
 | Score range | Float [0-1] (higher means better quality)   | 
-|  What is this metric? | ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate automatic summarization and machine translation. It measures the overlap between generated text and reference summaries. ROUGE focuses on recall-oriented measures to assess how well the generated text covers the reference text. The ROUGE score comprises of precision, recall, and F1 score.  |
+|  What is this metric? | ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate automatic summarization and machine translation. It measures the overlap between generated text and reference summaries. ROUGE focuses on recall-oriented measures to assess how well the generated text covers the reference text. The ROUGE score is composed of precision, recall, and F1 score.  |
 | When to use it?   |  The recommended scenario is Natural Language Processing (NLP) tasks. Text summarization and document comparison are among the recommended use cases for ROUGE, particularly in scenarios where text coherence and relevance are critical. 
 | What does it need as input?  | Response, Ground Truth   |