Acrolinx fixes and readability changes

theresa-i · theresa-i · commit 14804fb82422 · 2025-07-11T10:59:59.000-04:00
diff --git a/learn-pr/wwl-data-ai/evaluate-language-models-azure-databricks/includes/2-compare-evaluations.md b/learn-pr/wwl-data-ai/evaluate-language-models-azure-databricks/includes/2-compare-evaluations.md
@@ -6,19 +6,11 @@ These LLM-specific evaluation challenges affect how you choose evaluation method
 
 LLM evaluation focuses on language quality dimensions that present unique measurement challenges. When you evaluate text generation, you're assessing not just correctness but also coherence, creativity, and contextual appropriateness.
 
-The complexity comes from the subjective nature of language itself. What makes good writing varies based on audience, purpose, and context. A technical explanation needs precision and clarity, while creative content might prioritize originality and emotional impact. LLMs must handle this variability, which means your evaluation approach must account for multiple quality dimensions simultaneously.
+The complexity comes from the subjective nature of language. Good writing varies based on audience, purpose, and context. A technical explanation needs precision and clarity, while creative content might prioritize originality and emotional impact. Since LLMs must handle this variability, your evaluation approach must account for multiple quality dimensions simultaneously.
 
 Unlike simple classification tasks where you can measure accuracy against known labels, language generation produces open-ended outputs where multiple responses could be equally valid. This reality shapes how you design evaluation frameworks and interpret results.
 
-## Understand how to interpret evaluation metrics
-
-LLM evaluation metrics are designed to capture the nuanced aspects of language generation. These metrics often require careful interpretation and contextual understanding.
-
-LLM evaluation combines multiple approaches:
-- **Automated metrics** for specific tasks
-- **Human evaluation** for subjective quality assessment
-- **Task-specific measures** that align with your application's goals
-- **Contextual relevance** scoring for real-world applicability
+To address these challenges, LLM evaluation typically combines several complementary approaches. The following sections explore key evaluation strategies and considerations for implementing effective LLM assessment.
 
 ## Include human evaluations
 
@@ -40,9 +32,9 @@ This interpretability gap affects building user trust and debugging unexpected o
 
 LLM evaluation must assess how well models generalize to new, unseen inputs while maintaining consistent performance across different contexts.
 
-Generalization challenges specific to LLMs present themselves in several ways. **Domain adaptation** becomes critical when you need your model to perform well across different subject areas - a model trained primarily on technical documentation might struggle with creative writing tasks. **Context length handling** affects how well the model maintains quality when working with varying input lengths, from short prompts to lengthy documents.
+Generalization challenges specific to LLMs present themselves in several ways. **Domain adaptation** becomes critical when you need your model to perform well across different subject areas - a model trained primarily on technical documentation might struggle with creative writing tasks. **Instruction following** tests whether your LLM consistently follows different types of prompts and maintains performance across various task formats.
 
-**Instruction following** tests whether your LLM consistently follows different types of prompts and maintains performance across various task formats. Finally, **robustness** measures the model's stability when facing adversarial or unexpected inputs that might try to confuse or mislead it. These challenges require careful evaluation strategies that go beyond simple accuracy measurements.
+**Context length handling** affects how well the model maintains quality when working with varying input lengths, from short prompts to lengthy documents. Finally, **robustness** measures the model's stability when facing adversarial or unexpected inputs that might try to confuse or mislead it. These challenges require careful evaluation strategies that go beyond simple accuracy measurements.
 
 ## Implement evaluation dynamically
 
diff --git a/learn-pr/wwl-data-ai/evaluate-language-models-azure-databricks/includes/4-standard-metrics.md b/learn-pr/wwl-data-ai/evaluate-language-models-azure-databricks/includes/4-standard-metrics.md
@@ -44,6 +44,14 @@ Effective human evaluation requires:
 
 Combining automated metrics with human evaluation provides a more comprehensive assessment of model performance. Automated metrics offer efficiency and consistency, while human evaluation provides nuanced insights into subjective quality aspects.
 
+## Interpret evaluation metrics effectively
+
+LLM evaluation metrics require interpretation because they sometimes don't tell the complete story about model performance. Understanding what each metric actually measures helps you make better decisions about your model.
+
+When interpreting LLM metrics, consider the context and limitations of each measurement. A high BLEU score indicates good overlap with reference text, but it doesn't guarantee that the generated text is coherent or appropriate for the situation. Similarly, low perplexity suggests the model is confident in its predictions, but this doesn't mean the content is factually correct or useful.
+
+Multiple metrics together provide a more complete picture than any single score. For example, a model might have excellent fluency scores but poor accuracy on factual questions, or high similarity to reference texts but low creativity ratings. Always evaluate metrics in combination and consider what aspects of performance matter most for your specific use case.
+
 ## Track evaluation metrics with MLflow
 
 Once you start running evaluations, you'll want to keep track of all your results and experiments. This is where MLflow can help. It's supported in Azure Databricks and helps you organize your evaluation data, experiment results, and model versions.
diff --git a/learn-pr/wwl-data-ai/evaluate-language-models-azure-databricks/index.yml b/learn-pr/wwl-data-ai/evaluate-language-models-azure-databricks/index.yml
@@ -14,10 +14,10 @@ title: Evaluate language models with Azure Databricks
 summary: Explore Large Language Model (LLM) evaluation, understand their relationship with AI system evaluation, and explore various LLM evaluation metrics and specific task-related evaluations.
 abstract: |
   In this module, you learn how to:
-  - Explore LLM evaluation challenges and approaches.
-  - Describe the relationship between LLM evaluation and evaluation of entire AI systems.
-  - Describe generic LLM evaluation metrics like accuracy, perplexity, and toxicity.
-  - Describe LLM-as-a-judge for evaluation.
+  - Evaluate LLM evaluation models
+  - Describe the relationship between LLM evaluation and AI system evaluation
+  - Describe standard LLM evaluation metrics like accuracy, perplexity, and toxicity
+  - Describe LLM-as-a-judge for evaluation
 prerequisites: |
   Before starting this module, you should be familiar with Azure Databricks. Consider completing [Explore Azure Databricks](/training/modules/explore-azure-databricks?azure-portal=true) before starting this module.
 iconUrl: /learn/achievements/describe-azure-databricks.svg