Skip to content

Commit 339c681

Browse files
committed
Updates for clarity
1 parent 14804fb commit 339c681

File tree

5 files changed

+16
-27
lines changed

5 files changed

+16
-27
lines changed

learn-pr/wwl-data-ai/evaluate-language-models-azure-databricks/includes/1-introduction.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Large Language Models (LLMs) have transformed how we build applications, powering everything from chatbots to content generation systems. As you deploy these models to production, a critical question emerges: how do you know if your LLM is working well?
1+
Large Language Models (LLMs) have transformed how we build applications, powering everything from chatbots to content generation systems. As you deploy these models to production, you need to determine if your LLM is working well.
22

33
Evaluation is essential for successfully deploying LLMs to production. You need to understand how well your model performs, whether it produces reliable outputs, and how it behaves across different scenarios.
44

learn-pr/wwl-data-ai/evaluate-language-models-azure-databricks/includes/2-compare-evaluations.md

Lines changed: 6 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -28,23 +28,14 @@ This black box nature means you can't easily trace how the model arrived at its
2828

2929
This interpretability gap affects building user trust and debugging unexpected outputs, making it a consideration in your evaluation strategy.
3030

31-
## Avoid overfitting and generalization
31+
## Assess generalization across contexts
3232

33-
LLM evaluation must assess how well models generalize to new, unseen inputs while maintaining consistent performance across different contexts.
33+
Generalization refers to a model's ability to perform well on data or tasks it hasn't seen during training, rather than just memorizing specific examples. For LLMs, good generalization means the model can handle new topics, writing styles, and use cases beyond what it was specifically trained on.
3434

35-
Generalization challenges specific to LLMs present themselves in several ways. **Domain adaptation** becomes critical when you need your model to perform well across different subject areas - a model trained primarily on technical documentation might struggle with creative writing tasks. **Instruction following** tests whether your LLM consistently follows different types of prompts and maintains performance across various task formats.
35+
Consider a customer service LLM trained primarily on technical support conversations. Good generalization means it can adapt when customers ask about billing, use casual language, or need help with different products. Poor generalization would show up as the model giving overly technical responses to simple questions or failing to understand requests outside its training domain.
3636

37-
**Context length handling** affects how well the model maintains quality when working with varying input lengths, from short prompts to lengthy documents. Finally, **robustness** measures the model's stability when facing adversarial or unexpected inputs that might try to confuse or mislead it. These challenges require careful evaluation strategies that go beyond simple accuracy measurements.
37+
Evaluating generalization helps ensure your LLM remains useful across the varied scenarios it encounters in real applications.
3838

39-
## Implement evaluation dynamically
39+
## Implement evaluation with MLflow
4040

41-
LLM evaluation requires continuous monitoring and adaptation, especially for deployed applications where language requirements and user expectations evolve.
42-
43-
Dynamic evaluation refers to ongoing assessment of model performance in real-world conditions, rather than one-time testing with static datasets. This approach recognizes that LLM performance can shift over time due to changing user patterns, new domain requirements, or evolving quality standards. Several key practices support effective dynamic evaluation:
44-
45-
- **Real-time monitoring**: Tracks output quality as your model encounters new user inputs and communication patterns
46-
- **A/B testing**: Compares different model versions or prompt strategies against live interactions
47-
- **Feedback integration**: Incorporates actual user responses into your evaluation metrics
48-
- **Adaptive benchmarking**: Updates evaluation criteria as your application grows and user needs change
49-
50-
MLflow provides comprehensive tools to support these dynamic evaluation needs, offering experiment tracking, model comparison, and automated metric computation tailored for LLM applications. MLflow is integrated with Azure Databricks, providing support for LLM evaluation workflows within the platform.
41+
Azure Databricks integrates MLflow to support LLM evaluation workflows. You can use MLflow to track experiments, log evaluation metrics, compare model performance, and manage evaluation datasets. The platform integrates evaluation capabilities with other Azure Databricks features, enabling you to iterate and improve your LLM applications systematically.

learn-pr/wwl-data-ai/evaluate-language-models-azure-databricks/includes/4-standard-metrics.md

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,32 @@
11
When you evaluate an LLM, you need to measure how well it performs specific tasks. For text generation tasks like translation and summarization, you use metrics that compare generated text to reference examples. For classification tasks, you measure how often the model makes correct predictions. You also need to assess safety and quality through toxicity metrics and human evaluation.
22

3-
## Use BLEU and ROUGE to evaluate text generation
3+
## Evaluate text generation
44

5-
When you need to evaluate text generation tasks like translation or summarization, you compare the generated text to reference examples. Reference text is the ideal or expected output for a given input - for example, a human-written translation or a professionally written summary. Two common metrics for this comparison are BLEU and ROUGE.
5+
When you need to evaluate text generation tasks like translation or summarization, you compare the generated text to reference examples. Reference text is the ideal or expected output for a given input like a human-written translation or a professionally written summary. Two common metrics for this comparison are BLEU and ROUGE.
66

77
**BLEU (Bilingual Evaluation Understudy)** measures how much of your generated text matches the reference text. It gives you a score between 0 and 1, where higher scores mean better matches.
88

99
**ROUGE (Recall-Oriented Understudy for Gisting Evaluation)** is a set of metrics used to evaluate the quality of generated text by comparing it to one or more reference texts. It primarily measures overlap between the generated and reference texts, focusing on recall or how much of the reference content is captured in the generated output.
1010

1111
## Measure accuracy for classification tasks
1212

13-
Classification tasks involve choosing one answer from a set of predefined categories - for example, determining if a review is positive or negative, or selecting the correct answer from multiple choices.
13+
Classification tasks involve choosing one answer from a set of predefined categories like determining if a review is positive or negative, or selecting the correct answer from multiple choices.
1414

1515
**Accuracy** measures how often a model makes correct predictions. For classification tasks like sentiment analysis or multiple choice questions, accuracy is calculated as the number of correct answers divided by total questions.
1616

1717
Accuracy works well when there's one clearly correct answer, but it's not suitable for open-ended text generation where multiple responses could be valid.
1818

19-
## Evaluate text fluency with perplexity
19+
## Evaluate text predictability with perplexity
2020

21-
Text fluency refers to how natural and readable the generated text sounds. For example, does it flow well and sound like something a person would write?
21+
**Perplexity** measures how predictable your generated text is to the language model. It evaluates how well the model predicts what words should come next in a sentence. Lower perplexity scores indicate the text follows expected language patterns.
2222

23-
**Perplexity** measures how well a language model predicts what words come next in a sentence. Lower perplexity scores mean the model is better at predicting the right words, which typically leads to more natural-sounding text.
24-
25-
Perplexity measures the model's uncertainty - lower perplexity generally means better, more fluent text. It's helpful when comparing different models to see which one produces more natural language.
23+
You use perplexity to compare models and see which one produces more predictable text patterns.
2624

2725
## Assess content safety with toxicity metrics
2826

2927
When you deploy an LLM to serve real users, you need to ensure it doesn't generate harmful, offensive, or biased content. This is crucial because LLMs trained on large internet datasets can learn and reproduce toxic language or biases present in the training data.
3028

31-
**Toxicity** metrics evaluate whether your model generates problematic content. These metrics help identify potential risks before deployment, allowing you to implement safeguards or additional training to reduce harmful outputs.
29+
**Toxicity** metrics evaluate whether your model generates this type of content. These metrics help identify potential risks before deployment, allowing you to implement safeguards or additional training to reduce harmful outputs.
3230

3331
Tools like the [Perspective API](https://perspectiveapi.com/?azure-portal=true) can assess text toxicity, providing scores that indicate the likelihood of content being perceived as harmful or offensive.
3432

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
Large Language Models need specialized evaluation methods because they generate text and handle complex language tasks. LLMs require assessment across multiple dimensions including content quality, safety, and task performance.
22

3-
This module taught you the fundamentals of LLM evaluation in Azure Databricks. You learned about the unique challenges of evaluating text-generating models and how LLM evaluation fits within broader AI system assessment. You explored key evaluation metrics for measuring model performance and discovered how to use one LLM to evaluate another when human review isn't feasible.
3+
In this module, you learned about the fundamentals of LLM evaluation. You learned about the unique challenges of evaluating text-generating models and how LLM evaluation fits within broader AI system assessment. You also explored key evaluation metrics for measuring model performance and discovered how to use one LLM to evaluate another when human review isn't feasible.

learn-pr/wwl-data-ai/evaluate-language-models-azure-databricks/index.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ metadata:
1111
ai-usage: ai-assisted
1212
ms.collection: wwl-ai-copilot
1313
title: Evaluate language models with Azure Databricks
14-
summary: Explore Large Language Model (LLM) evaluation, understand their relationship with AI system evaluation, and explore various LLM evaluation metrics and specific task-related evaluations.
14+
summary: In this module, you explore Large Language Model evaluation using various metrics and approaches, learn about evaluation challenges and best practices, and discover automated evaluation techniques including LLM-as-a-judge methods.
1515
abstract: |
1616
In this module, you learn how to:
1717
- Evaluate LLM evaluation models

0 commit comments

Comments
 (0)