Skip to content

Commit 4be04c6

Browse files
authored
Merge pull request #51340 from theresa-i/evaluate-language-models
Updated module for clarity
2 parents 4169e65 + b19a529 commit 4be04c6

16 files changed

+253
-284
lines changed
Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
1-
### YamlMime:ModuleUnit
2-
uid: learn.wwl.evaluate-language-models-azure-databricks.introduction
3-
title: Introduction
4-
metadata:
5-
title: Introduction
6-
description: "Introduction"
7-
ms.date: 03/20/2025
8-
author: wwlpublish
9-
ms.author: theresai
10-
ms.topic: unit
11-
azureSandbox: false
12-
labModal: false
13-
durationInMinutes: 2
14-
content: |
15-
[!include[](includes/1-introduction.md)]
16-
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.evaluate-language-models-azure-databricks.introduction
3+
title: Introduction
4+
metadata:
5+
title: Introduction
6+
description: "Introduction"
7+
ms.date: 07/10/2025
8+
author: theresa-i
9+
ms.author: theresai
10+
ms.topic: unit
11+
azureSandbox: false
12+
labModal: false
13+
durationInMinutes: 2
14+
content: |
15+
[!include[](includes/1-introduction.md)]
16+
Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
1-
### YamlMime:ModuleUnit
2-
uid: learn.wwl.evaluate-language-models-azure-databricks.compare-evaluations
3-
title: Compare LLM and traditional ML evaluations
4-
metadata:
5-
title: Compare LLM and traditional ML evaluations
6-
description: "Compare Large Language Model and traditional Machine Learning evaluations"
7-
ms.date: 03/20/2025
8-
author: wwlpublish
9-
ms.author: theresai
10-
ms.topic: unit
11-
azureSandbox: false
12-
labModal: false
13-
durationInMinutes: 7
14-
content: |
15-
[!include[](includes/2-compare-evaluations.md)]
16-
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.evaluate-language-models-azure-databricks.compare-evaluations
3+
title: Explore LLM evaluation
4+
metadata:
5+
title: Explore LLM evaluation
6+
description: "Explore Large Language Model evaluation"
7+
ms.date: 07/10/2025
8+
author: theresa-i
9+
ms.author: theresai
10+
ms.topic: unit
11+
azureSandbox: false
12+
labModal: false
13+
durationInMinutes: 7
14+
content: |
15+
[!include[](includes/2-compare-evaluations.md)]
16+
Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
1-
### YamlMime:ModuleUnit
2-
uid: learn.wwl.evaluate-language-models-azure-databricks.ai-systems
3-
title: Evaluate LLMs and AI systems
4-
metadata:
5-
title: Evaluate LLMs and AI systems
6-
description: "Describe the relationship between LLM evaluation and evaluation of entire AI systems"
7-
ms.date: 03/20/2025
8-
author: wwlpublish
9-
ms.author: theresai
10-
ms.topic: unit
11-
azureSandbox: false
12-
labModal: false
13-
durationInMinutes: 5
14-
content: |
15-
[!include[](includes/3-ai-systems.md)]
16-
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.evaluate-language-models-azure-databricks.ai-systems
3+
title: Evaluate LLMs and AI systems
4+
metadata:
5+
title: Evaluate LLMs and AI systems
6+
description: "Describe the relationship between LLM evaluation and evaluation of entire AI systems"
7+
ms.date: 07/10/2025
8+
author: theresa-i
9+
ms.author: theresai
10+
ms.topic: unit
11+
azureSandbox: false
12+
labModal: false
13+
durationInMinutes: 5
14+
content: |
15+
[!include[](includes/3-ai-systems.md)]
16+
Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
1-
### YamlMime:ModuleUnit
2-
uid: learn.wwl.evaluate-language-models-azure-databricks.standard-metrics
3-
title: Evaluate LLMs with standard metrics
4-
metadata:
5-
title: Evaluate LLMs with standard metrics
6-
description: "Evaluate LLMs with standard metrics"
7-
ms.date: 03/20/2025
8-
author: wwlpublish
9-
ms.author: theresai
10-
ms.topic: unit
11-
azureSandbox: false
12-
labModal: false
13-
durationInMinutes: 7
14-
content: |
15-
[!include[](includes/4-standard-metrics.md)]
16-
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.evaluate-language-models-azure-databricks.standard-metrics
3+
title: Evaluate LLMs with standard metrics
4+
metadata:
5+
title: Evaluate LLMs with standard metrics
6+
description: "Evaluate LLMs with standard metrics"
7+
ms.date: 07/10/2025
8+
author: theresa-i
9+
ms.author: theresai
10+
ms.topic: unit
11+
azureSandbox: false
12+
labModal: false
13+
durationInMinutes: 7
14+
content: |
15+
[!include[](includes/4-standard-metrics.md)]
16+
Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
1-
### YamlMime:ModuleUnit
2-
uid: learn.wwl.evaluate-language-models-azure-databricks.language-model-judge
3-
title: Describe LLM-as-a-judge for evaluation
4-
metadata:
5-
title: Describe LLM-as-a-judge for evaluation
6-
description: "Describe LLM-as-a-judge for evaluation"
7-
ms.date: 03/20/2025
8-
author: wwlpublish
9-
ms.author: theresai
10-
ms.topic: unit
11-
azureSandbox: false
12-
labModal: false
13-
durationInMinutes: 7
14-
content: |
15-
[!include[](includes/5-language-model-judge.md)]
16-
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.evaluate-language-models-azure-databricks.language-model-judge
3+
title: Describe LLM-as-a-judge for evaluation
4+
metadata:
5+
title: Describe LLM-as-a-judge for evaluation
6+
description: "Describe LLM-as-a-judge for evaluation"
7+
ms.date: 07/10/2025
8+
author: theresa-i
9+
ms.author: theresai
10+
ms.topic: unit
11+
azureSandbox: false
12+
labModal: false
13+
durationInMinutes: 7
14+
content: |
15+
[!include[](includes/5-language-model-judge.md)]
16+
Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
1-
### YamlMime:ModuleUnit
2-
uid: learn.wwl.evaluate-language-models-azure-databricks.exercise
3-
title: Exercise - Evaluate an Azure OpenAI model
4-
metadata:
5-
title: Exercise - Evaluate an Azure OpenAI model
6-
description: "Exercise - Evaluate an Azure OpenAI model"
7-
ms.date: 03/20/2025
8-
author: wwlpublish
9-
ms.author: theresai
10-
ms.topic: unit
11-
azureSandbox: false
12-
labModal: false
13-
durationInMinutes: 30
14-
content: |
15-
[!include[](includes/6-exercise.md)]
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.evaluate-language-models-azure-databricks.exercise
3+
title: Exercise - Evaluate an Azure OpenAI model
4+
metadata:
5+
title: Exercise - Evaluate an Azure OpenAI model
6+
description: "Exercise - Evaluate an Azure OpenAI model"
7+
ms.date: 07/10/2025
8+
author: theresa-i
9+
ms.author: theresai
10+
ms.topic: unit
11+
azureSandbox: false
12+
labModal: false
13+
durationInMinutes: 30
14+
content: |
15+
[!include[](includes/6-exercise.md)]
Lines changed: 49 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -1,49 +1,49 @@
1-
### YamlMime:ModuleUnit
2-
uid: learn.wwl.evaluate-language-models-azure-databricks.knowledge-check
3-
title: Module assessment
4-
metadata:
5-
title: Module assessment
6-
description: "Knowledge check"
7-
ms.date: 03/20/2025
8-
author: wwlpublish
9-
ms.author: theresai
10-
ms.topic: unit
11-
module_assessment: true
12-
azureSandbox: false
13-
labModal: false
14-
durationInMinutes: 3
15-
quiz:
16-
questions:
17-
- content: "What is the primary purpose of evaluating a Large Language Model (LLM)?"
18-
choices:
19-
- content: "To improve its computational efficiency."
20-
isCorrect: false
21-
explanation: "Incorrect. Evaluating an LLM doesn't improve its computational efficiency."
22-
- content: "To assess its accuracy and performance on specific tasks."
23-
isCorrect: true
24-
explanation: "Correct. The primary purpose of evaluating an LLM is to determine its effectiveness and accuracy."
25-
- content: "To increase its training data size."
26-
isCorrect: false
27-
explanation: "Incorrect. Evaluating an LLM doesn't increase the training data size."
28-
- content: "In the context of evaluating language models, what does perplexity measure?"
29-
choices:
30-
- content: "The size of the training dataset."
31-
isCorrect: false
32-
explanation: "Incorrect. Perplexity doesn't measure the size of the training dataset."
33-
- content: "The diversity of generated text."
34-
isCorrect: false
35-
explanation: "Incorrect. Perplexity doesn't measure the diversity of generated text."
36-
- content: "The uncertainty of the model in predicting the next word."
37-
isCorrect: true
38-
explanation: "Correct. Perplexity is a measure of how uncertain a language model is when predicting the next word in a sequence. Lower perplexity indicates a better-performing model."
39-
- content: "When you evaluate a large language model (LLM) for bias, what is a common approach?"
40-
choices:
41-
- content: "Measuring the model's training time"
42-
isCorrect: false
43-
explanation: "Incorrect. Measuring the model's training time doesn't evaluate an LLM for bias."
44-
- content: "Analyzing the model's outputs for harmful stereotypes"
45-
isCorrect: true
46-
explanation: "Correct. Evaluating a model for bias typically involves analyzing its outputs to identify and mitigate harmful stereotypes or biased predictions, ensuring the model is fair and ethical in its responses."
47-
- content: "Counting the number of model parameters"
48-
isCorrect: false
49-
explanation: "Incorrect. Counting the number of model parameters doesn't evaluate an LLM for bias."
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.evaluate-language-models-azure-databricks.knowledge-check
3+
title: Module assessment
4+
metadata:
5+
title: Module assessment
6+
description: "Knowledge check"
7+
ms.date: 07/10/2025
8+
author: theresa-i
9+
ms.author: theresai
10+
ms.topic: unit
11+
module_assessment: true
12+
azureSandbox: false
13+
labModal: false
14+
durationInMinutes: 3
15+
quiz:
16+
questions:
17+
- content: "What is the primary purpose of evaluating a Large Language Model (LLM)?"
18+
choices:
19+
- content: "To improve its computational efficiency."
20+
isCorrect: false
21+
explanation: "Incorrect. Evaluating an LLM doesn't improve its computational efficiency."
22+
- content: "To assess its accuracy and performance on specific tasks."
23+
isCorrect: true
24+
explanation: "Correct. The primary purpose of evaluating an LLM is to determine its effectiveness and accuracy."
25+
- content: "To increase its training data size."
26+
isCorrect: false
27+
explanation: "Incorrect. Evaluating an LLM doesn't increase the training data size."
28+
- content: "In the context of evaluating language models, what does perplexity measure?"
29+
choices:
30+
- content: "The size of the training dataset."
31+
isCorrect: false
32+
explanation: "Incorrect. Perplexity doesn't measure the size of the training dataset."
33+
- content: "The diversity of generated text."
34+
isCorrect: false
35+
explanation: "Incorrect. Perplexity doesn't measure the diversity of generated text."
36+
- content: "The uncertainty of the model in predicting the next word."
37+
isCorrect: true
38+
explanation: "Correct. Perplexity is a measure of how uncertain a language model is when predicting the next word in a sequence. Lower perplexity indicates a better-performing model."
39+
- content: "When you evaluate a large language model (LLM) for bias, what is a common approach?"
40+
choices:
41+
- content: "Measuring the model's training time"
42+
isCorrect: false
43+
explanation: "Incorrect. Measuring the model's training time doesn't evaluate an LLM for bias."
44+
- content: "Analyzing the model's outputs for harmful stereotypes"
45+
isCorrect: true
46+
explanation: "Correct. Evaluating a model for bias typically involves analyzing its outputs to identify and mitigate harmful stereotypes or biased predictions, ensuring the model is fair and ethical in its responses."
47+
- content: "Counting the number of model parameters"
48+
isCorrect: false
49+
explanation: "Incorrect. Counting the number of model parameters doesn't evaluate an LLM for bias."
Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
1-
### YamlMime:ModuleUnit
2-
uid: learn.wwl.evaluate-language-models-azure-databricks.summary
3-
title: Summary
4-
metadata:
5-
title: Summary
6-
description: "Summary"
7-
ms.date: 03/20/2025
8-
author: wwlpublish
9-
ms.author: theresai
10-
ms.topic: unit
11-
azureSandbox: false
12-
labModal: false
13-
durationInMinutes: 1
14-
content: |
15-
[!include[](includes/8-summary.md)]
16-
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.evaluate-language-models-azure-databricks.summary
3+
title: Summary
4+
metadata:
5+
title: Summary
6+
description: "Summary"
7+
ms.date: 07/10/2025
8+
author: theresa-i
9+
ms.author: theresai
10+
ms.topic: unit
11+
azureSandbox: false
12+
labModal: false
13+
durationInMinutes: 1
14+
content: |
15+
[!include[](includes/8-summary.md)]
16+
Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
Evaluating Large Language Models (LLMs) is crucial in artificial intelligence because they're central to many applications, from natural language processing to automated decision-making systems.
1+
Large Language Models (LLMs) have transformed how we build applications, powering everything from chatbots to content generation systems. As you deploy these models to production, you need to determine if your LLM is working well.
22

3-
By assessing their performance, interpretability, and ethical implications, you gain insights into their strengths and limitations, enabling more effective deployment in real-world scenarios.
3+
Evaluation is essential for successfully deploying LLMs to production. You need to understand how well your model performs, whether it produces reliable outputs, and how it behaves across different scenarios.
44

5-
This evaluation includes traditional metrics like accuracy and efficiency, as well as broader aspects such as fairness, bias, and generalization across diverse tasks, ensuring that LLMs are reliable, transparent, and aligned with human values.
5+
In this module, you'll learn to evaluate LLMs by comparing evaluation approaches, and understanding how individual model evaluation fits into broader AI system assessment. You'll also learn about standard metrics like accuracy and perplexity, and implementing LLM-as-a-judge techniques for scalable evaluation.

0 commit comments

Comments
 (0)