You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- content: 'Why do we clean our data before training?'
17
-
choices:
18
-
- content: "Removing rows of data makes our model more powerful."
19
-
isCorrect: false
20
-
explanation: "Incorrect. While removing bad data makes models perform better, only removing data doesn't make models more powerful."
21
-
- content: "Cleaning data helps us select features that help the performance of the model."
22
-
isCorrect: false
23
-
explanation: "Incorrect. Cleaning data might help us select features, but cleaning data is used to fix problems with the data."
24
-
- content: "Removing rows that have errors prevents these rows from misleading the training process."
25
-
isCorrect: true
26
-
explanation: "Correct. Cleaning data helps prevent errors from incomplete or error-prone data points."
27
-
- content: 'What kind of data are best encoded with one-hot vectors?'
28
-
choices:
29
-
- content: "Ordinal data"
30
-
isCorrect: false
31
-
explanation: "Incorrect. One-hot vectors are best used in other areas where we have clear classes."
32
-
- content: "Categorical data with two possible values"
33
-
isCorrect: false
34
-
explanation: "Incorrect. This kind of data can be encoded in a single column as a 0 and a 1."
35
-
- content: "Categorical data with three or more values"
36
-
isCorrect: true
37
-
explanation: "Correct. One-hot vectors are best used with multiple classes or categories so that models can better interpret them."
38
-
- content: 'What is a data sample? What is a population?'
39
-
choices:
40
-
- content: "A sample is all possible data we care about. A population is the subset of that data which we actually have on hand."
41
-
isCorrect: false
42
-
explanation: "Incorrect. A sample is a portion, or subset, of the data we care about. A population is all the available data."
43
-
- content: "Both population and sample refer to data we use to train our model."
44
-
isCorrect: false
45
-
explanation: "Incorrect. Although we can train models with population and sample data, they mean different things."
46
-
- content: "A population is all possible data we care about. A sample is the subset of that data which we actually have on hand."
47
-
isCorrect: true
48
-
explanation: "Correct. A population is all the possible data we could collect for a data set, and a sample is a portion of the data which we already have."
49
-
- content: "You have a model that doesn't perform well. Which of these options definitely do **not** help improve its performance?"
50
-
choices:
51
-
- content: "Adding more samples (rows)."
52
-
isCorrect: false
53
-
explanation: "Incorrect. Adding rows of data likely helps your dataset become more representative, and so helps your model train."
54
-
- content: "Adding a few features (columns) that you know relate to what the model is trying to predict."
55
-
isCorrect: false
56
-
explanation: "Incorrect. So long as you have enough rows of data, adding relevant features is likely to help your model train."
57
-
- content: "Adding a large number of features that you know have no relation to what the model is trying to predict."
58
-
isCorrect: true
59
-
explanation: "Correct. Adding more features that aren't relevant probably harms its performance"
- content: 'Why do we clean our data before training?'
17
+
choices:
18
+
- content: "Removing rows of data makes our model more powerful."
19
+
isCorrect: false
20
+
explanation: "Incorrect. While removing bad data makes models perform better, only removing data doesn't make models more powerful."
21
+
- content: "Cleaning data helps us select features that help the performance of the model."
22
+
isCorrect: false
23
+
explanation: "Incorrect. Cleaning data might help us select features, but cleaning data is used to fix problems with the data."
24
+
- content: "Removing rows that have errors prevents these rows from misleading the training process."
25
+
isCorrect: true
26
+
explanation: "Correct. Cleaning data helps prevent errors from incomplete or error-prone data points."
27
+
- content: 'What kind of data are best encoded with one-hot vectors?'
28
+
choices:
29
+
- content: "Ordinal data"
30
+
isCorrect: false
31
+
explanation: "Incorrect. One-hot vectors are best used in other areas where we have clear classes."
32
+
- content: "Categorical data with two possible values"
33
+
isCorrect: false
34
+
explanation: "Incorrect. This kind of data can be encoded in a single column as a 0 and a 1."
35
+
- content: "Categorical data with three or more values"
36
+
isCorrect: true
37
+
explanation: "Correct. One-hot vectors are best used with multiple classes or categories so that models can better interpret them."
38
+
- content: 'What is a data sample? What is a population?'
39
+
choices:
40
+
- content: "A sample is all possible data we care about. A population is the subset of that data which we actually have on hand."
41
+
isCorrect: false
42
+
explanation: "Incorrect. A sample is a portion, or subset, of the data we care about. A population is all the available data."
43
+
- content: "Both population and sample refer to data we use to train our model."
44
+
isCorrect: false
45
+
explanation: "Incorrect. Although we can train models with population and sample data, they mean different things."
46
+
- content: "A population is all possible data we care about. A sample is the subset of that data which we actually have on hand."
47
+
isCorrect: true
48
+
explanation: "Correct. A population is all the possible data we could collect for a data set, and a sample is a portion of the data which we already have."
49
+
- content: "You have a model that doesn't perform well. Which of these options definitely do **not** help improve its performance?"
50
+
choices:
51
+
- content: "Adding more samples (rows)."
52
+
isCorrect: false
53
+
explanation: "Incorrect. Adding rows of data likely helps your dataset become more representative, and so helps your model train."
54
+
- content: "Adding a few features (columns) that you know relate to what the model is trying to predict."
55
+
isCorrect: false
56
+
explanation: "Incorrect. So long as you have enough rows of data, adding relevant features is likely to help your model train."
57
+
- content: "Adding a large number of features that you know have no relation to what the model is trying to predict."
58
+
isCorrect: true
59
+
explanation: "Correct. Adding more features that aren't relevant probably harms its performance"
Copy file name to clipboardExpand all lines: learn-pr/azure/introduction-to-data-for-machine-learning/includes/1-introduction.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,14 +1,14 @@
1
1
Machine learning gets its predictive power from the data that shapes it. To build effective models, you must understand the data you use.
2
2
3
-
Here, we explore how both humans and computers categorize, store, and interpret data. We examine what makes a good dataset, and how to fix issues in our available data. We also practice exploration of new data, and we see how deep thinking about a dataset can help us build better predictive models.
3
+
Here, we explore how both humans and computers categorize, store, and interpret data. We examine what makes a good dataset, and how to fix issues in our available data. We also practice exploring new data, and we see how deep thinking about a dataset can help us build better predictive models.
4
4
5
5
## Scenario: the last voyage of the Titanic
6
6
7
-
As an eager marine archaeologist, you have an unusually keen interest in maritime disasters. Late one night, while clicking between images of whale bones and ancient scrolls about Atlantis, you find a public dataset that lists known passengers and crew of the first, and last, voyage of the Titanic. Drawn in by the balance between fate and chance, you wonder, what factors determined the survival of a Titanic passenger? Data from this period are somewhat incomplete. Much information for certain passengers is unknown. You must find ways to patch up this data before you can fully analyze it.
7
+
As an eager marine archaeologist, you have an unusually keen interest in maritime disasters. Late one night, while clicking between images of whale bones and ancient scrolls about Atlantis, you find a public dataset that lists known passengers and crew of the first (and last) voyage of the Titanic. Drawn in by the balance between fate and chance, you wonder, what factors determined the survival of a Titanic passenger? Data from this period are somewhat incomplete. Information for certain passengers is unknown. You must find ways to patch up this data before you can fully analyze it.
8
8
9
9
## Prerequisites
10
10
11
-
- Some familiarity with machinelearning concepts (such as models and cost) helps, but it's not required.
11
+
- Some familiarity with machine-learning concepts (such as models and cost) helps, but it's not required.
0 commit comments