You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: learn-pr/azure/machine-learning-architectures-and-hyperparameters/includes/1-introduction.md
+7-5Lines changed: 7 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,16 +2,18 @@ Not all models are simple mathematical equations that can be plotted as a line.
2
2
3
3
## Scenario: Predicting sports results using machine learning
4
4
5
-
Throughout this module, we’ll refer to the following example scenario as we explain concepts surrounding model architecture and hyperparameters. This scenario is designed to appear complex at first but as the exercises progress we'll see how it can be tackled using a little critical thinking and experimentation.
5
+
Throughout this module, we'll refer to the following example scenario as we explain concepts surrounding model architecture and hyperparameters. This scenario is designed to appear complex at first, but as the exercises progress we'll learn how you can tackle it using a little critical thinking and experimentation.
6
6
7
-
The Games’ motto consists of three Latin words: Citius - Altius - Fortius. These words mean Faster - Higher - Stronger. Since this motto was established, the variety of games has grown enormously to include shooting, sailing, and team sports. We would like to explore the role that basic physical features still play in predicting who wins a medal at one of the most prestigious sporting events on the planet. To this end, we'll explore rhythmic gymnastics: a modern addition to the games that combines dance, gymnastics, and calisthenics. One might expect that basic characteristics of age, height, and weight play only a limited role, given the need for agility, flexibility, dexterity, and coordination. Let’s use some more advanced machine learning models to see how critical these basic factors really are.
7
+
The Games' motto consists of three Latin words: Citius - Altius - Fortius. These words mean Faster - Higher - Stronger. Since this motto was established, the variety of games has grown enormously to include shooting, sailing, and team sports. We'd like to explore the role that basic physical features still play in predicting who wins a medal at one of the most prestigious sporting events on the planet. To this end, we'll explore rhythmic gymnastics: a modern addition to the games that combines dance, gymnastics, and calisthenics. One might expect that basic characteristics of age, height, and weight play only a limited role, given the need for agility, flexibility, dexterity, and coordination. Let's use some more advanced machine learning models to see how critical these basic factors really are.
8
8
9
9
## Prerequisites
10
10
11
11
* Familiarity with machine learning models
12
12
13
13
## Learning objectives
14
14
15
-
* Discover new model types– decision trees and random forests.
16
-
* Learn how model architecture can affect performance
17
-
* Practice working with hyperparameters to improve training effectiveness
15
+
In this module, you will:
16
+
17
+
* Discover new model types: decision trees and random forests.
18
+
* Learn how model architecture can affect performance.
19
+
* Practice working with hyperparameters to improve training effectiveness.
Copy file name to clipboardExpand all lines: learn-pr/azure/machine-learning-architectures-and-hyperparameters/includes/2-decision-trees.md
+12-12Lines changed: 12 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,20 +1,20 @@
1
-
When we talk of architecture, we often think of buildings. Architecture is responsible for how a building is structured – its height, depth, the number of floors, and how things are connected internally. This architecture also dictates how we use a building – where we enter it, and what we can “get out of it”, practically speaking.
1
+
When we talk about architecture, we often think of buildings. Architecture is responsible for how a building is structured; its height, its depth, the number of floors, and how things are connected internally. This architecture also dictates how we use a building: where we enter it and what we can "get out of it," practically speaking.
2
2
3
-
In machine learning, architecture refers to a similar concept. How many parameters does it have, and how are they linked together to achieve a calculation? Do we calculate a lot in parallel (width) or do we've serial operations that rely on a previous calculation (depth)? How can we provide inputs to this model, and how can we receive outputs? Such architectural decisions only typically apply to more complex models, and architectural decisions can range from simple to complex. These decisions are usually made before the model is trained, though in some circumstances there's room to make changes post-training.
3
+
In machine learning, architecture refers to a similar concept. How many parameters does it have, and how are they linked together to achieve a calculation? Do we calculate a lot in parallel (width) or do we have serial operations that rely on a previous calculation (depth)? How can we provide inputs to this model, and how can we receive outputs? Such architectural decisions only typically apply to more complex models, and architectural decisions can range from simple to complex. These decisions are usually made before the model is trained, though in some circumstances there's room to make changes post-training.
4
4
5
5
Let’s explore this more concretely with decision trees as an example.
6
6
7
7
## What's a decision tree?
8
8
9
9
In essence, a decision tree is a flow chart. Decision trees are a categorization model that breaks down decisions into multiple steps.
10
10
11
-

11
+

12
12
13
-
The sample if provided at the entry point (top, in the diagram above) and each exit point has a label (bottom in the diagram). At each node, a simple ‘if’ statement decides which branch the sample passes to next. Once the branch has reached the end of the tree (the leaves), it will be assigned to a label.
13
+
The sample if provided at the entry point (top, in the diagram above) and each exit point has a label (bottom in the diagram). At each node, a simple "if" statement decides which branch the sample passes to next. Once the branch has reached the end of the tree (the leaves), it will be assigned to a label.
14
14
15
15
### How are decision trees trained?
16
16
17
-
Decision trees are trained one node, or decision point, at a time. At the first node, the entire training-set is assessed. From there a feature is selected that can best separate the set into two subsets that have more homogenous labels. For example, imagine our training set was as follows:
17
+
Decision trees are trained one node, or decision point, at a time. At the first node, the entire training-set is assessed. From there, a feature is selected that can best separate the set into two subsets that have more homogenous labels. For example, imagine our training set was as follows:
18
18
19
19
| Weight (Feature) | Age (Feature) | Won a medal (Label) |
@@ -27,7 +27,7 @@ Decision trees are trained one node, or decision point, at a time. At the first
27
27
| 85 | 26 | Yes |
28
28
| 90 | 25 | Yes |
29
29
30
-
If we're to do our best to find a simple rule to split this data, we might split by age, at around 24 years old, because most medal winners were over 24. This split would give us two subsets of data.
30
+
If we're doing our best to find a rule to split this data, we might split by age at around 24 years old, because most medal winners were over 24. This split would give us two subsets of data.
31
31
32
32
**Subset 1**
33
33
@@ -47,24 +47,24 @@ If we're to do our best to find a simple rule to split this data, we might split
47
47
| 85 | 26 | Yes |
48
48
| 90 | 25 | Yes |
49
49
50
-
If we stop here, we've a simple model with one node and two leaves. Leaf 1 contains non-medal winners, and is 75% accurate on our training set. Leaf 2 contains medal winners, and is also 75% accurate on the training set.
50
+
If we stop here, we have a simple model with one node and two leaves. Leaf 1 contains non-medal winners, and is 75% accurate on our training set. Leaf 2 contains medal winners, and is also 75% accurate on the training set.
51
51
52
52
We don’t need to stop here, though. We can continue this process by splitting the leaves further.
53
53
54
-
In subset 1, the first new node could split by weight, because the only medal winner had a weight less than people who didn't win a medal. The rule might be set to “weight < 65”. People with weight < 65 are predicted to have won a medal. While anyone with weight ≥65 don't meet this criterion, and might be predicted to not win a medal.
54
+
In subset 1, the first new node could split by weight, because the only medal winner had a weight less than people who didn't win a medal. The rule might be set to "weight < 65". People with weight < 65 are predicted to have won a medal, while anyone with weight ≥65 don't meet this criterion, and might be predicted to not win a medal.
55
55
56
-
In subset 2, the second new node might also split by weight, but this time predicts that anyone with a weight over 70 would have won a medal, while those under it would not.
56
+
In subset 2, the second new node might also split by weight, but this time predicts that anyone with a weight over 70 would have won a medal, while those under it wouldn't.
57
57
58
58
This would provide us with a tree that could achieve 100% accuracy on the training set.
59
59
60
60
### Strengths and weaknesses of decision trees
61
61
62
62
Decision trees are considered to have low bias. This means that they're usually good at identifying features that are important in order to label something correctly.
63
63
64
-
The major weakness of decision trees is overfitting. Consider the example given above: the model gives an exact way to calculate who is likely to win a medal, and this will predict 100% of the training dataset correctly. This level of accuracy is unusual for machine learning models, which normally make numerous errors on training dataset. Good training performance isn't a bad thing in itself, but the tree has become so specialized to the training set that it probably won't do well on the test set. This is because the tree has managed to learn relationships in the training set that probably aren't real – such as that having a weight of 60 kg guarantees a medal if you are under 25 years old.
64
+
The major weakness of decision trees is overfitting. Consider the example given previously: the model gives an exact way to calculate who is likely to win a medal, and this will predict 100% of the training dataset correctly. This level of accuracy is unusual for machine learning models, which normally make numerous errors on training dataset. Good training performance isn't a bad thing in itself, but the tree has become so specialized to the training set that it probably won't do well on the test set. This is because the tree has managed to learn relationships in the training set that probably aren't real, such as that having a weight of 60 kg guarantees a medal if you're under 25 years old.
65
65
66
66
## Model architecture affects overfitting
67
67
68
-
How we structure our decision tree is key to avoiding its weaknesses. The deeper the tree is, the more likely it's to overfit the training set. For example, in the simple tree above, if we limited the tree to only the first node, it would make errors on the training set, but probably do better on the test set. This is because it would have more general rules about who wins medal, such as “athletes over 24”, rather than extremely specific rules that might only apply to the training set.
68
+
How we structure our decision tree is key to avoiding its weaknesses. The deeper the tree is, the more likely it is to overfit the training set. For example, in the simple tree above, if we limited the tree to only the first node, it would make errors on the training set, but probably do better on the test set. This is because it would have more general rules about who wins medals, such as "athletes over 24," rather than extremely specific rules that might only apply to the training set.
69
69
70
-
Although we're focused on trees here, other complex models often have similar weakness that can be mitigated through decisions about how they're structured, or how they're allowed to be manipulated by the training.
70
+
Although we're focused on trees here, other complex models often have similar weakness that we can mitigate through decisions about how they're structured or how they're allowed to be manipulated by the training.
Experimentation with architectures is often a key focus of building effective modern models. We've done so to a basic level with decision trees, but the only limit to this is our imagination – and perhaps our computer’s memory. In fact, thinking more broadly on decision trees resulted in a highly popular model architecture that reduces its decision trees’ tendency to overfit data.
1
+
Experimentation with architectures is often a key focus of building effective modern models. We've done so to a basic level with decision trees, but the only limit to this is our imagination, and perhaps our computer’s memory. In fact, thinking more broadly on decision trees resulted in a highly popular model architecture that reduces its decision trees' tendency to overfit data.
2
2
3
3
## What’s a random forest?
4
4
5
-
A random forest is a collection of decision trees, which are used together to estimate which label a sample should be assigned. For example, if we were to train a random forest to predict medal winners, we might train 100 different decision trees. To make a prediction, we would use all trees independently. These would effectively ‘vote’ for whether the athlete would win a medal, providing a final decision.
5
+
A random forest is a collection of decision trees that are used together to estimate which label a sample should be assigned. For example, if we were to train a random forest to predict medal winners, we might train 100 different decision trees. To make a prediction, we would use all trees independently. These would effectively "vote" for whether the athlete would win a medal, providing a final decision.
6
6
7
7
### How is a random forest trained?
8
8
9
-
Random forests are built on the idea that while a single decision tree is highly biased, or overfit, if we train several decision trees, they'll be biased in different ways. This requires that each tree is trained independently, and each on a slightly different training set.
9
+
Random forests are built on the idea that while a single decision tree is highly biased, or overfit, if we train several decision trees, they'll be biased in different ways. This requires that each tree is trained independently and each on a slightly different training set.
10
10
11
-
To train a single decision tree a certain number of samples, athletes in our scenario, are extracted from the full training set. Each sample can be selected more than once, and this takes place randomly. The tree is then trained in the standard way. This process is repeated for each tree. As each tree gets a different combination of training examples, each tree ends up trained, and biased, differently to the others.
11
+
To train a single decision tree, a certain number of samples–athletes in our scenario–are extracted from the full training set. Each sample can be selected more than once, and this takes place randomly. The tree is then trained in the standard way. This process is repeated for each tree. As each tree gets a different combination of training examples, each tree ends up trained, and biased, differently to the others.
12
12
13
13
### Advantages of random forest
14
14
15
-
The performance of random forests is often impressive and so comparisons are often best made against neural networks, which are another popular and high-performance model type. Unlike neural networks, randomforest models are easy to train: modern frameworks provide helpful methods that let you do so in only a few lines of code. Random forests are also fast to train and don't need large datasets to perform well. This separates them from neural networks, which can often take minutes or days to train, substantial experience, and often require very large datasets. The architectural decisions for random forests are, while more complex than models such as linear regression, much simpler than neural networks.
15
+
The performance of random forests is often impressive and so comparisons are often best made against neural networks, which are another popular and high-performance model type. Unlike neural networks, random-forest models are easy to train: modern frameworks provide helpful methods that let you do so in only a few lines of code. Random forests are also fast to train and don't need large datasets to perform well. This separates them from neural networks, which can often take minutes or days to train, require substantial experience, and often require very large datasets. The architectural decisions for random forests are, while more complex than models such as linear regression, much simpler than neural networks.
16
16
17
17
### Disadvantages of random forest
18
18
19
-
The major disadvantage of random forests is that they're difficult to understand. Specifically, while these models are fully transparent – each tree can be inspected and understood – they often contain so many trees that doing so is virtually impossible.
19
+
The major disadvantage of random forests is that they're difficult to understand. Specifically, while these models are fully transparent–each tree can be inspected and understood–they often contain so many trees that doing so is virtually impossible.
20
20
21
21
## How can I customize these architectures?
22
22
23
-
Like several models, random forests have various architectural options. The easiest to consider is the size of the forest – how many trees are involved, along with the size of these trees. For example, it would be possible to request a forest to predict medal winners containing 100 trees, each with a maximum depth of six nodes. This means that the final decision as to whether an athlete will win a medal must be made with no more than six ‘if’ statements.
23
+
Like several models, random forests have various architectural options. The easiest to consider is the size of the forest: how many trees are involved, along with the size of these trees. For example, it would be possible to request a forest to predict medal winners containing 100 trees, each with a maximum depth of six nodes. This means that the final decision as to whether an athlete will win a medal must be made with no more than six "if" statements.
24
24
25
-
As we’ve already seen, increasing the size of a tree (in terms of depth or number of leaves) makes it more likely to overfit the data it's trained on. This limitation also applies to random forests. However, with random forests we can counter this by increasing the number of trees, assuming that each tree will be biased in a different way. We can also restrict each tree to only a certain number of features, or disallowing leaves to be created when it would make only a marginal difference to the training performance. The ability for a random forest to make good predictions isn't infinite. At some point, increasing the size and number of trees gives no further improvement due to the limited variety of training data that we've.
25
+
As we’ve already learned, increasing the size of a tree (in terms of depth or number of leaves) makes it more likely to overfit the data on which it's trained. This limitation also applies to random forests. However, with random forests we can counter this by increasing the number of trees, assuming that each tree will be biased in a different way. We can also restrict each tree to only a certain number of features, or by disallowing leaves to be created when it would make only a marginal difference to the training performance. The ability for a random forest to make good predictions isn't infinite. At some point, increasing the size and number of trees gives no further improvement due to the limited variety of training data that we have.
0 commit comments