Skip to content

Commit ee8929a

Browse files
author
gitName
committed
AB#1056267: Introduction to data for machine learning
1 parent 5b6e820 commit ee8929a

18 files changed

+419
-278
lines changed
Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
1-
### YamlMime:ModuleUnit
2-
uid: learn.machinelearning.introduction-to-data-for-machine-learning.introduction
3-
title: Introduction
4-
metadata:
5-
title: Introduction
6-
description: Introduction to data for machine learning module.
7-
ms.date: 10/10/2024
8-
author: fbsolo-ms1
9-
ms.author: franksolomon
10-
ms.reviewer: franksolomon
11-
ms.topic: unit
12-
durationInMinutes: 2
13-
content: |
14-
[!include[](includes/1-introduction.md)]
1+
### YamlMime:ModuleUnit
2+
uid: learn.machinelearning.introduction-to-data-for-machine-learning.introduction
3+
title: Introduction
4+
metadata:
5+
title: Introduction
6+
description: Introduction to data for machine learning module.
7+
ms.date: 05/21/2025
8+
author: fbsolo-ms1
9+
ms.author: franksolomon
10+
ms.reviewer: franksolomon
11+
ms.topic: unit
12+
durationInMinutes: 2
13+
content: |
14+
[!include[](includes/1-introduction.md)]
Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
1-
### YamlMime:ModuleUnit
2-
uid: learn.machinelearning.introduction-to-data-for-machine-learning.detect-correct-data
3-
title: Good, bad, and missing data
4-
metadata:
5-
title: Good, bad, and missing data
6-
description: Conceptual unit introducing types of data in machine learning
7-
ms.date: 10/10/2024
8-
author: fbsolo-ms1
9-
ms.author: franksolomon
10-
ms.reviewer: franksolomon
11-
ms.topic: unit
12-
durationInMinutes: 3
13-
content: |
14-
[!include[](includes/2-detect-correct-data.md)]
15-
1+
### YamlMime:ModuleUnit
2+
uid: learn.machinelearning.introduction-to-data-for-machine-learning.detect-correct-data
3+
title: Good, bad, and missing data
4+
metadata:
5+
title: Good, bad, and missing data
6+
description: Conceptual unit introducing types of data in machine learning
7+
ms.date: 05/21/2025
8+
author: fbsolo-ms1
9+
ms.author: franksolomon
10+
ms.reviewer: franksolomon
11+
ms.topic: unit
12+
durationInMinutes: 3
13+
content: |
14+
[!include[](includes/2-detect-correct-data.md)]
15+
Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
1-
### YamlMime:ModuleUnit
2-
uid: learn.machinelearning.introduction-to-data-for-machine-learning.exercise-detect-visualize-missing-data
3-
title: Exercise - Visualize missing data
4-
metadata:
5-
title: Exercise - Visualize missing data
6-
description: Learn how to detect and visualize missing data.
7-
ms.date: 10/10/2024
8-
author: fbsolo-ms1
9-
ms.author: franksolomon
10-
ms.topic: unit
11-
durationInMinutes: 8
12-
sandbox: true
13-
notebook: notebooks/3-3-exercise-detect-visualize-missing-data.ipynb
14-
1+
### YamlMime:ModuleUnit
2+
uid: learn.machinelearning.introduction-to-data-for-machine-learning.exercise-detect-visualize-missing-data
3+
title: Exercise - Visualize missing data
4+
metadata:
5+
title: Exercise - Visualize missing data
6+
description: Learn how to detect and visualize missing data.
7+
ms.date: 05/21/2025
8+
author: fbsolo-ms1
9+
ms.author: franksolomon
10+
ms.topic: unit
11+
durationInMinutes: 8
12+
sandbox: true
13+
notebook: notebooks/3-3-exercise-detect-visualize-missing-data.ipynb
14+
Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
1-
### YamlMime:ModuleUnit
2-
uid: learn.machinelearning.introduction-to-data-for-machine-learning.examine-data-types
3-
title: Examine different types of data
4-
metadata:
5-
title: Examine different types of data
6-
description: Conceptual unit about examining different types of data in machine learning
7-
ms.date: 10/10/2024
8-
author: fbsolo-ms1
9-
ms.author: franksolomon
10-
ms.topic: unit
11-
ms.reviewer: franksolomon
12-
durationInMinutes: 4
13-
content: |
14-
[!include[](includes/4-examine-data-types.md)]
15-
1+
### YamlMime:ModuleUnit
2+
uid: learn.machinelearning.introduction-to-data-for-machine-learning.examine-data-types
3+
title: Examine different types of data
4+
metadata:
5+
title: Examine different types of data
6+
description: Conceptual unit about examining different types of data in machine learning
7+
ms.date: 05/21/2025
8+
author: fbsolo-ms1
9+
ms.author: franksolomon
10+
ms.topic: unit
11+
ms.reviewer: franksolomon
12+
durationInMinutes: 4
13+
content: |
14+
[!include[](includes/4-examine-data-types.md)]
15+
Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
1-
### YamlMime:ModuleUnit
2-
uid: learn.machinelearning.introduction-to-data-for-machine-learning.exercise-normalize-data-predict-missing-values
3-
title: Exercise - Work with data to predict missing values
4-
metadata:
5-
title: Exercise - Work with data to predict missing values
6-
description: Exercise unit about predicting missing values in machine learning
7-
ms.date: 10/10/2024
8-
author: fbsolo-ms1
9-
ms.author: franksolomon
10-
ms.topic: unit
11-
ms.reviewer: franksolomon
12-
durationInMinutes: 8
13-
sandbox: true
14-
notebook: notebooks/3-5-exercise-normalize-data-predict-missing-values.ipynb
15-
1+
### YamlMime:ModuleUnit
2+
uid: learn.machinelearning.introduction-to-data-for-machine-learning.exercise-normalize-data-predict-missing-values
3+
title: Exercise - Work with data to predict missing values
4+
metadata:
5+
title: Exercise - Work with data to predict missing values
6+
description: Exercise unit about predicting missing values in machine learning
7+
ms.date: 05/21/2025
8+
author: fbsolo-ms1
9+
ms.author: franksolomon
10+
ms.topic: unit
11+
ms.reviewer: franksolomon
12+
durationInMinutes: 8
13+
sandbox: true
14+
notebook: notebooks/3-5-exercise-normalize-data-predict-missing-values.ipynb
15+
Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
1-
### YamlMime:ModuleUnit
2-
uid: learn.machinelearning.introduction-to-data-for-machine-learning.evaluate-image-language-data
3-
title: One-hot vectors
4-
metadata:
5-
title: One-hot vectors
6-
description: Conceptual unit about one-hot vectors in machine learning
7-
ms.date: 10/10/2024
8-
author: fbsolo-ms1
9-
ms.author: franksolomon
10-
ms.topic: unit
11-
ms.reviewer: franksolomon
12-
durationInMinutes: 5
13-
content: |
14-
[!include[](includes/6-evaluate-image-language-data.md)]
1+
### YamlMime:ModuleUnit
2+
uid: learn.machinelearning.introduction-to-data-for-machine-learning.evaluate-image-language-data
3+
title: One-hot vectors
4+
metadata:
5+
title: One-hot vectors
6+
description: Conceptual unit about one-hot vectors in machine learning
7+
ms.date: 05/21/2025
8+
author: fbsolo-ms1
9+
ms.author: franksolomon
10+
ms.topic: unit
11+
ms.reviewer: franksolomon
12+
durationInMinutes: 5
13+
content: |
14+
[!include[](includes/6-evaluate-image-language-data.md)]
Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
1-
### YamlMime:ModuleUnit
2-
uid: learn.machinelearning.introduction-to-data-for-machine-learning.exercise-one-hot-vectors
3-
title: Exercise - Predict unknown values using one-hot vectors
4-
metadata:
5-
title: Exercise - Predict unknown values using one-hot vectors
6-
description: Exercise unit using one-hot vectors to predict unknown values in machine learning
7-
ms.date: 10/10/2024
8-
author: fbsolo-ms1
9-
ms.author: franksolomon
10-
ms.topic: unit
11-
ms.reviewer: franksolomon
12-
durationInMinutes: 10
13-
notebook: notebooks/3-7-exercise-one-hot-vectors.ipynb
14-
sandbox: true
15-
1+
### YamlMime:ModuleUnit
2+
uid: learn.machinelearning.introduction-to-data-for-machine-learning.exercise-one-hot-vectors
3+
title: Exercise - Predict unknown values using one-hot vectors
4+
metadata:
5+
title: Exercise - Predict unknown values using one-hot vectors
6+
description: Exercise unit using one-hot vectors to predict unknown values in machine learning
7+
ms.date: 05/21/2025
8+
author: fbsolo-ms1
9+
ms.author: franksolomon
10+
ms.topic: unit
11+
ms.reviewer: franksolomon
12+
durationInMinutes: 10
13+
notebook: notebooks/3-7-exercise-one-hot-vectors.ipynb
14+
sandbox: true
15+
Lines changed: 60 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -1,60 +1,60 @@
1-
### YamlMime:ModuleUnit
2-
uid: learn.machinelearning.introduction-to-data-for-machine-learning.knowledge-check
3-
title: Module assessment
4-
metadata:
5-
title: Module assessment
6-
description: Multiple-choice questions
7-
ms.date: 10/10/2024
8-
author: fbsolo-ms1
9-
ms.author: franksolomon
10-
ms.reviewer: franksolomon
11-
ms.topic: unit
12-
durationInMinutes: 3
13-
quiz:
14-
title: Check your knowledge
15-
questions:
16-
- content: 'Why do we clean our data before training?'
17-
choices:
18-
- content: "Removing rows of data makes our model more powerful."
19-
isCorrect: false
20-
explanation: "Incorrect. While removing bad data makes models perform better, only removing data doesn't make models more powerful."
21-
- content: "Cleaning data helps us select features that help the performance of the model."
22-
isCorrect: false
23-
explanation: "Incorrect. Cleaning data might help us select features, but cleaning data is used to fix problems with the data."
24-
- content: "Removing rows that have errors prevents these rows from misleading the training process."
25-
isCorrect: true
26-
explanation: "Correct. Cleaning data helps prevent errors from incomplete or error-prone data points."
27-
- content: 'What kind of data are best encoded with one-hot vectors?'
28-
choices:
29-
- content: "Ordinal data"
30-
isCorrect: false
31-
explanation: "Incorrect. One-hot vectors are best used in other areas where we have clear classes."
32-
- content: "Categorical data with two possible values"
33-
isCorrect: false
34-
explanation: "Incorrect. This kind of data can be encoded in a single column as a 0 and a 1."
35-
- content: "Categorical data with three or more values"
36-
isCorrect: true
37-
explanation: "Correct. One-hot vectors are best used with multiple classes or categories so that models can better interpret them."
38-
- content: 'What is a data sample? What is a population?'
39-
choices:
40-
- content: "A sample is all possible data we care about. A population is the subset of that data which we actually have on hand."
41-
isCorrect: false
42-
explanation: "Incorrect. A sample is a portion, or subset, of the data we care about. A population is all the available data."
43-
- content: "Both population and sample refer to data we use to train our model."
44-
isCorrect: false
45-
explanation: "Incorrect. Although we can train models with population and sample data, they mean different things."
46-
- content: "A population is all possible data we care about. A sample is the subset of that data which we actually have on hand."
47-
isCorrect: true
48-
explanation: "Correct. A population is all the possible data we could collect for a data set, and a sample is a portion of the data which we already have."
49-
- content: "You have a model that doesn't perform well. Which of these options definitely do **not** help improve its performance?"
50-
choices:
51-
- content: "Adding more samples (rows)."
52-
isCorrect: false
53-
explanation: "Incorrect. Adding rows of data likely helps your dataset become more representative, and so helps your model train."
54-
- content: "Adding a few features (columns) that you know relate to what the model is trying to predict."
55-
isCorrect: false
56-
explanation: "Incorrect. So long as you have enough rows of data, adding relevant features is likely to help your model train."
57-
- content: "Adding a large number of features that you know have no relation to what the model is trying to predict."
58-
isCorrect: true
59-
explanation: "Correct. Adding more features that aren't relevant probably harms its performance"
60-
1+
### YamlMime:ModuleUnit
2+
uid: learn.machinelearning.introduction-to-data-for-machine-learning.knowledge-check
3+
title: Module assessment
4+
metadata:
5+
title: Module assessment
6+
description: Multiple-choice questions
7+
ms.date: 05/21/2025
8+
author: fbsolo-ms1
9+
ms.author: franksolomon
10+
ms.reviewer: franksolomon
11+
ms.topic: unit
12+
durationInMinutes: 3
13+
quiz:
14+
title: Check your knowledge
15+
questions:
16+
- content: 'Why do we clean our data before training?'
17+
choices:
18+
- content: "Removing rows of data makes our model more powerful."
19+
isCorrect: false
20+
explanation: "Incorrect. While removing bad data makes models perform better, only removing data doesn't make models more powerful."
21+
- content: "Cleaning data helps us select features that help the performance of the model."
22+
isCorrect: false
23+
explanation: "Incorrect. Cleaning data might help us select features, but cleaning data is used to fix problems with the data."
24+
- content: "Removing rows that have errors prevents these rows from misleading the training process."
25+
isCorrect: true
26+
explanation: "Correct. Cleaning data helps prevent errors from incomplete or error-prone data points."
27+
- content: 'What kind of data are best encoded with one-hot vectors?'
28+
choices:
29+
- content: "Ordinal data"
30+
isCorrect: false
31+
explanation: "Incorrect. One-hot vectors are best used in other areas where we have clear classes."
32+
- content: "Categorical data with two possible values"
33+
isCorrect: false
34+
explanation: "Incorrect. This kind of data can be encoded in a single column as a 0 and a 1."
35+
- content: "Categorical data with three or more values"
36+
isCorrect: true
37+
explanation: "Correct. One-hot vectors are best used with multiple classes or categories so that models can better interpret them."
38+
- content: 'What is a data sample? What is a population?'
39+
choices:
40+
- content: "A sample is all possible data we care about. A population is the subset of that data which we actually have on hand."
41+
isCorrect: false
42+
explanation: "Incorrect. A sample is a portion, or subset, of the data we care about. A population is all the available data."
43+
- content: "Both population and sample refer to data we use to train our model."
44+
isCorrect: false
45+
explanation: "Incorrect. Although we can train models with population and sample data, they mean different things."
46+
- content: "A population is all possible data we care about. A sample is the subset of that data which we actually have on hand."
47+
isCorrect: true
48+
explanation: "Correct. A population is all the possible data we could collect for a data set, and a sample is a portion of the data which we already have."
49+
- content: "You have a model that doesn't perform well. Which of these options definitely do **not** help improve its performance?"
50+
choices:
51+
- content: "Adding more samples (rows)."
52+
isCorrect: false
53+
explanation: "Incorrect. Adding rows of data likely helps your dataset become more representative, and so helps your model train."
54+
- content: "Adding a few features (columns) that you know relate to what the model is trying to predict."
55+
isCorrect: false
56+
explanation: "Incorrect. So long as you have enough rows of data, adding relevant features is likely to help your model train."
57+
- content: "Adding a large number of features that you know have no relation to what the model is trying to predict."
58+
isCorrect: true
59+
explanation: "Correct. Adding more features that aren't relevant probably harms its performance"
60+
Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
1-
### YamlMime:ModuleUnit
2-
uid: learn.machinelearning.introduction-to-data-for-machine-learning.summary
3-
title: Summary
4-
metadata:
5-
title: Summary
6-
description: An overview of the content covered in the module.
7-
ms.date: 10/10/2024
8-
author: fbsolo-ms1
9-
ms.author: franksolomon
10-
ms.reviewer: franksolomon
11-
ms.topic: unit
12-
durationInMinutes: 2
13-
content: |
14-
[!include[](includes/9-summary.md)]
1+
### YamlMime:ModuleUnit
2+
uid: learn.machinelearning.introduction-to-data-for-machine-learning.summary
3+
title: Summary
4+
metadata:
5+
title: Summary
6+
description: An overview of the content covered in the module.
7+
ms.date: 05/21/2025
8+
author: fbsolo-ms1
9+
ms.author: franksolomon
10+
ms.reviewer: franksolomon
11+
ms.topic: unit
12+
durationInMinutes: 2
13+
content: |
14+
[!include[](includes/9-summary.md)]

learn-pr/azure/introduction-to-data-for-machine-learning/includes/1-introduction.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
Machine learning gets its predictive power from the data that shapes it. To build effective models, you must understand the data you use.
22

3-
Here, we explore how both humans and computers categorize, store, and interpret data. We examine what makes a good dataset, and how to fix issues in our available data. We also practice exploration of new data, and we see how deep thinking about a dataset can help us build better predictive models.
3+
Here, we explore how both humans and computers categorize, store, and interpret data. We examine what makes a good dataset, and how to fix issues in our available data. We also practice exploring new data, and we see how deep thinking about a dataset can help us build better predictive models.
44

55
## Scenario: the last voyage of the Titanic
66

7-
As an eager marine archaeologist, you have an unusually keen interest in maritime disasters. Late one night, while clicking between images of whale bones and ancient scrolls about Atlantis, you find a public dataset that lists known passengers and crew of the first, and last, voyage of the Titanic. Drawn in by the balance between fate and chance, you wonder, what factors determined the survival of a Titanic passenger? Data from this period are somewhat incomplete. Much information for certain passengers is unknown. You must find ways to patch up this data before you can fully analyze it.
7+
As an eager marine archaeologist, you have an unusually keen interest in maritime disasters. Late one night, while clicking between images of whale bones and ancient scrolls about Atlantis, you find a public dataset that lists known passengers and crew of the first (and last) voyage of the Titanic. Drawn in by the balance between fate and chance, you wonder, what factors determined the survival of a Titanic passenger? Data from this period are somewhat incomplete. Information for certain passengers is unknown. You must find ways to patch up this data before you can fully analyze it.
88

99
## Prerequisites
1010

11-
- Some familiarity with machine learning concepts (such as models and cost) helps, but it's not required.
11+
- Some familiarity with machine-learning concepts (such as models and cost) helps, but it's not required.
1212

1313
## Learning objectives
1414

0 commit comments

Comments
 (0)