Updates for clarity

theresa-i · theresa-i · commit 339c68133198 · 2025-07-11T13:37:28.000-04:00
diff --git a/learn-pr/wwl-data-ai/evaluate-language-models-azure-databricks/includes/1-introduction.md b/learn-pr/wwl-data-ai/evaluate-language-models-azure-databricks/includes/1-introduction.md
@@ -1,4 +1,4 @@
-Large Language Models (LLMs) have transformed how we build applications, powering everything from chatbots to content generation systems. As you deploy these models to production, a critical question emerges: how do you know if your LLM is working well?
+Large Language Models (LLMs) have transformed how we build applications, powering everything from chatbots to content generation systems. As you deploy these models to production, you need to determine if your LLM is working well.
 
 Evaluation is essential for successfully deploying LLMs to production. You need to understand how well your model performs, whether it produces reliable outputs, and how it behaves across different scenarios.
 
diff --git a/learn-pr/wwl-data-ai/evaluate-language-models-azure-databricks/includes/2-compare-evaluations.md b/learn-pr/wwl-data-ai/evaluate-language-models-azure-databricks/includes/2-compare-evaluations.md
@@ -28,23 +28,14 @@ This black box nature means you can't easily trace how the model arrived at its
 
 This interpretability gap affects building user trust and debugging unexpected outputs, making it a consideration in your evaluation strategy.
 
-## Avoid overfitting and generalization
+## Assess generalization across contexts
 
-LLM evaluation must assess how well models generalize to new, unseen inputs while maintaining consistent performance across different contexts.
+Generalization refers to a model's ability to perform well on data or tasks it hasn't seen during training, rather than just memorizing specific examples. For LLMs, good generalization means the model can handle new topics, writing styles, and use cases beyond what it was specifically trained on.
 
-Generalization challenges specific to LLMs present themselves in several ways. **Domain adaptation** becomes critical when you need your model to perform well across different subject areas - a model trained primarily on technical documentation might struggle with creative writing tasks. **Instruction following** tests whether your LLM consistently follows different types of prompts and maintains performance across various task formats.
+Consider a customer service LLM trained primarily on technical support conversations. Good generalization means it can adapt when customers ask about billing, use casual language, or need help with different products. Poor generalization would show up as the model giving overly technical responses to simple questions or failing to understand requests outside its training domain.
 
-**Context length handling** affects how well the model maintains quality when working with varying input lengths, from short prompts to lengthy documents. Finally, **robustness** measures the model's stability when facing adversarial or unexpected inputs that might try to confuse or mislead it. These challenges require careful evaluation strategies that go beyond simple accuracy measurements.
+Evaluating generalization helps ensure your LLM remains useful across the varied scenarios it encounters in real applications.
 
-## Implement evaluation dynamically
+## Implement evaluation with MLflow
 
-LLM evaluation requires continuous monitoring and adaptation, especially for deployed applications where language requirements and user expectations evolve.
-
-Dynamic evaluation refers to ongoing assessment of model performance in real-world conditions, rather than one-time testing with static datasets. This approach recognizes that LLM performance can shift over time due to changing user patterns, new domain requirements, or evolving quality standards. Several key practices support effective dynamic evaluation:
-
-- **Real-time monitoring**: Tracks output quality as your model encounters new user inputs and communication patterns
-- **A/B testing**: Compares different model versions or prompt strategies against live interactions
-- **Feedback integration**: Incorporates actual user responses into your evaluation metrics
-- **Adaptive benchmarking**: Updates evaluation criteria as your application grows and user needs change
-
-MLflow provides comprehensive tools to support these dynamic evaluation needs, offering experiment tracking, model comparison, and automated metric computation tailored for LLM applications. MLflow is integrated with Azure Databricks, providing support for LLM evaluation workflows within the platform.
+Azure Databricks integrates MLflow to support LLM evaluation workflows. You can use MLflow to track experiments, log evaluation metrics, compare model performance, and manage evaluation datasets. The platform integrates evaluation capabilities with other Azure Databricks features, enabling you to iterate and improve your LLM applications systematically.
diff --git a/learn-pr/wwl-data-ai/evaluate-language-models-azure-databricks/includes/4-standard-metrics.md b/learn-pr/wwl-data-ai/evaluate-language-models-azure-databricks/includes/4-standard-metrics.md
@@ -1,34 +1,32 @@
 When you evaluate an LLM, you need to measure how well it performs specific tasks. For text generation tasks like translation and summarization, you use metrics that compare generated text to reference examples. For classification tasks, you measure how often the model makes correct predictions. You also need to assess safety and quality through toxicity metrics and human evaluation.
 
-## Use BLEU and ROUGE to evaluate text generation
+## Evaluate text generation
 
-When you need to evaluate text generation tasks like translation or summarization, you compare the generated text to reference examples. Reference text is the ideal or expected output for a given input - for example, a human-written translation or a professionally written summary. Two common metrics for this comparison are BLEU and ROUGE.
+When you need to evaluate text generation tasks like translation or summarization, you compare the generated text to reference examples. Reference text is the ideal or expected output for a given input like a human-written translation or a professionally written summary. Two common metrics for this comparison are BLEU and ROUGE.
 
 **BLEU (Bilingual Evaluation Understudy)** measures how much of your generated text matches the reference text. It gives you a score between 0 and 1, where higher scores mean better matches.
 
 **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)** is a set of metrics used to evaluate the quality of generated text by comparing it to one or more reference texts. It primarily measures overlap between the generated and reference texts, focusing on recall or how much of the reference content is captured in the generated output.
 
 ## Measure accuracy for classification tasks
 
-Classification tasks involve choosing one answer from a set of predefined categories - for example, determining if a review is positive or negative, or selecting the correct answer from multiple choices.
+Classification tasks involve choosing one answer from a set of predefined categories like determining if a review is positive or negative, or selecting the correct answer from multiple choices.
 
 **Accuracy** measures how often a model makes correct predictions. For classification tasks like sentiment analysis or multiple choice questions, accuracy is calculated as the number of correct answers divided by total questions.
 
 Accuracy works well when there's one clearly correct answer, but it's not suitable for open-ended text generation where multiple responses could be valid.
 
-## Evaluate text fluency with perplexity
+## Evaluate text predictability with perplexity
 
-Text fluency refers to how natural and readable the generated text sounds. For example, does it flow well and sound like something a person would write?
+**Perplexity** measures how predictable your generated text is to the language model. It evaluates how well the model predicts what words should come next in a sentence. Lower perplexity scores indicate the text follows expected language patterns.
 
-**Perplexity** measures how well a language model predicts what words come next in a sentence. Lower perplexity scores mean the model is better at predicting the right words, which typically leads to more natural-sounding text.
-
-Perplexity measures the model's uncertainty - lower perplexity generally means better, more fluent text. It's helpful when comparing different models to see which one produces more natural language.
+You use perplexity to compare models and see which one produces more predictable text patterns. 
 
 ## Assess content safety with toxicity metrics
 
 When you deploy an LLM to serve real users, you need to ensure it doesn't generate harmful, offensive, or biased content. This is crucial because LLMs trained on large internet datasets can learn and reproduce toxic language or biases present in the training data.
 
-**Toxicity** metrics evaluate whether your model generates problematic content. These metrics help identify potential risks before deployment, allowing you to implement safeguards or additional training to reduce harmful outputs.
+**Toxicity** metrics evaluate whether your model generates this type of content. These metrics help identify potential risks before deployment, allowing you to implement safeguards or additional training to reduce harmful outputs.
 
 Tools like the [Perspective API](https://perspectiveapi.com/?azure-portal=true) can assess text toxicity, providing scores that indicate the likelihood of content being perceived as harmful or offensive.
 
diff --git a/learn-pr/wwl-data-ai/evaluate-language-models-azure-databricks/includes/8-summary.md b/learn-pr/wwl-data-ai/evaluate-language-models-azure-databricks/includes/8-summary.md
@@ -1,3 +1,3 @@
 Large Language Models need specialized evaluation methods because they generate text and handle complex language tasks. LLMs require assessment across multiple dimensions including content quality, safety, and task performance.
 
-This module taught you the fundamentals of LLM evaluation in Azure Databricks. You learned about the unique challenges of evaluating text-generating models and how LLM evaluation fits within broader AI system assessment. You explored key evaluation metrics for measuring model performance and discovered how to use one LLM to evaluate another when human review isn't feasible.
+In this module, you learned about the fundamentals of LLM evaluation. You learned about the unique challenges of evaluating text-generating models and how LLM evaluation fits within broader AI system assessment. You also explored key evaluation metrics for measuring model performance and discovered how to use one LLM to evaluate another when human review isn't feasible.
diff --git a/learn-pr/wwl-data-ai/evaluate-language-models-azure-databricks/index.yml b/learn-pr/wwl-data-ai/evaluate-language-models-azure-databricks/index.yml
@@ -11,7 +11,7 @@ metadata:
   ai-usage: ai-assisted
   ms.collection: wwl-ai-copilot
 title: Evaluate language models with Azure Databricks
-summary: Explore Large Language Model (LLM) evaluation, understand their relationship with AI system evaluation, and explore various LLM evaluation metrics and specific task-related evaluations.
+summary: In this module, you explore Large Language Model evaluation using various metrics and approaches, learn about evaluation challenges and best practices, and discover automated evaluation techniques including LLM-as-a-judge methods.
 abstract: |
   In this module, you learn how to:
   - Evaluate LLM evaluation models

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-Large Language Models (LLMs) have transformed how we build applications, powering everything from chatbots to content generation systems. As you deploy these models to production, a critical question emerges: how do you know if your LLM is working well?`
	`1`	`+Large Language Models (LLMs) have transformed how we build applications, powering everything from chatbots to content generation systems. As you deploy these models to production, you need to determine if your LLM is working well.`
`2`	`2`
`3`	`3`	`Evaluation is essential for successfully deploying LLMs to production. You need to understand how well your model performs, whether it produces reliable outputs, and how it behaves across different scenarios.`
`4`	`4`
Original file line number	Diff line number	Diff line change
`@@ -1,3 +1,3 @@`
`1`	`1`	`Large Language Models need specialized evaluation methods because they generate text and handle complex language tasks. LLMs require assessment across multiple dimensions including content quality, safety, and task performance.`
`2`	`2`
`3`		`-This module taught you the fundamentals of LLM evaluation in Azure Databricks. You learned about the unique challenges of evaluating text-generating models and how LLM evaluation fits within broader AI system assessment. You explored key evaluation metrics for measuring model performance and discovered how to use one LLM to evaluate another when human review isn't feasible.`
	`3`	`+In this module, you learned about the fundamentals of LLM evaluation. You learned about the unique challenges of evaluating text-generating models and how LLM evaluation fits within broader AI system assessment. You also explored key evaluation metrics for measuring model performance and discovered how to use one LLM to evaluate another when human review isn't feasible.`