You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We can assess our classification models in terms of the kinds of mistakes that they make, such as false negatives and false positives. This can give insight into the kinds of mistakes a model makes but doesn't necessarily give deep information on how the model could perform if slight adjustments were made to its decision criteria. Here we'll discuss receiver operator characteristic (ROC) curves, which build on the idea of a confusion matrix but provide us with deeper information that lets us improve our models to a greater degree.
1
+
We can assess our classification models in terms of the kinds of mistakes that they make, such as false negatives and false positives. This can give insight into the kinds of mistakes a model makes, but doesn't necessarily provide deep information on how the model could perform if slight adjustments were made to its decision criteria. Here, we'll discuss receiver operator characteristic (ROC) curves, which build on the idea of a confusion matrix but provide us with deeper information that lets us improve our models to a greater degree.
2
2
3
3
## Scenario:
4
4
5
5
Throughout this module, we’ll be using the following example scenario to explain and practice working with ROC curves.
6
6
7
-
Your avalanche-rescue charity has successfully built a machine learning model that can estimate whether an object detected by lightweight sensors is a hiker or a natural object, such as a tree or rock. This lets you keep track of how many people are on the mountain, so you know whether a rescue team is needed when an avalanche strikes. The model does reasonably well, though you wonder if there's room for improvement. Internally, the model must make a binary decision as to whether an object is a hiker or not, but this is based on probabilities. Can this decision-making process be tweaked to improve its performance?
7
+
Your avalanche-rescue charity has successfully built a machine learning model that can estimate whether an object detected by lightweight sensors is a hiker or a natural object, such as a tree or a rock. This lets you keep track of how many people are on the mountain, so you know whether a rescue team is needed when an avalanche strikes. The model does reasonably well, though you wonder if there's room for improvement. Internally, the model must make a binary decision as to whether an object is a hiker or not, but this is based on probabilities. Can this decision-making process be tweaked to improve its performance?
8
8
9
9
## Prerequisites
10
10
@@ -14,6 +14,6 @@ Your avalanche-rescue charity has successfully built a machine learning model th
14
14
15
15
In this module, you will:
16
16
17
-
* Understand how to create ROC curves
18
-
* Explore how to assess and compare models using these curves
19
-
* Practice fine-tuning a model using characteristics plotted on ROC curves
17
+
* Understand how to create ROC curves.
18
+
* Explore how to assess and compare models using these curves.
19
+
* Practice fine-tuning a model using characteristics plotted on ROC curves.
Classification models must assign a sample to a category. For example, it must use features such as size, color, and motion to determine whether an object is a hiker or a tree.
2
2
3
-
We can improve classification models many ways. For example, we can ensure our data are balanced, clean, and scaled. We can also alter our model architecture, and use hyperparameters to squeeze as much performance as we possibly can out of our data and architecture. Eventually, we find no better way to improve performance on our test (or hold-out) set and declare our model ready.
3
+
We can improve classification models in many ways. For example, we can ensure our data are balanced, clean, and scaled. We can also alter our model architecture and use hyperparameters to squeeze as much performance as we possibly can out of our data and architecture. Eventually, we find no better way to improve performance on our test (or hold-out) set and declare our model ready.
4
4
5
-
Model tuning to this point can be complex, but a final simple step can be used to further improve how well our model works. To understand this, though, we need to go back to basics.
5
+
Model tuning to this point can be complex, but we can use a final simple step to further improve how well our model works. To understand this, though, we need to go back to basics.
6
6
7
7
## Probabilities and categories
8
8
9
-
Many models have multiple decision-making stages, and the final one often is simply a binarization step. During binarization, probabilities are converted into a hard label. For example, let’s say that the model is provided with features and calculates that there's a 75% chance that it was shown a hiker, and 25% chance it was shown a tree. An object cannot be 75% hiker and 25% tree – it's one or the other! As such, the model applies a threshold, which is normally 50%. As the hiker class is larger than 50%, the object is declared to be a hiker.
9
+
Many models have multiple decision-making stages, and the final one often is simply a binarization step. During binarization, probabilities are converted into a hard label. For example, let's say that the model is provided with features and calculates that there's a 75% chance that it was shown a hiker, and 25% chance it was shown a tree. An object can't be 75% hiker and 25% tree; it's one or the other! As such, the model applies a threshold, which is normally 50%. As the hiker class is larger than 50%, the object is declared to be a hiker.
10
10
11
-
The 50% threshold is logical – it means that the most likely label according to the model is always chosen. If the model is biased, however, this 50% threshold might not be appropriate. For example, if the model has a slight tendency to pick trees more than hikers – picking trees 10% more frequently than it should – we could adjust our decision threshold to account for this.
11
+
The 50% threshold is logical; it means that the most likely label according to the model is always chosen. If the model is biased, however, this 50% threshold might not be appropriate. For example, if the model has a slight tendency to pick trees more than hikers, picking trees 10% more frequently than it should, we could adjust our decision threshold to account for this.
12
12
13
13
## Refresher on decision matrices
14
14
15
15
Decision matrices are a great way to assess the kinds of mistakes a model is making. This gives us the rates of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN)
16
16
17
-

17
+

18
18
19
19
We can calculate some handy characteristics from the confusion matrix. Two popular characteristics are:
20
20
21
-
* True Positive Rate (sensitivity): how often ‘True’ labels are correctly identified as ‘True’. For example, how often the model predicts ‘hiker’ when the sample it's shown is in fact a hiker.
22
-
* False Positive Rate (false alarm rate): how often ‘False’ labels are incorrectly identified as ‘True’. For example, how often the model predicts ‘Hiker’ when it's shown a tree.
21
+
***True Positive Rate (sensitivity)**: how often "True" labels are correctly identified as "True." For example, how often the model predicts "hiker" when the sample it's shown is in fact a hiker.
22
+
***False Positive Rate (false alarm rate)**: how often "False" labels are incorrectly identified as "True." For example, how often the model predicts "hiker" when it's shown a tree.
23
23
24
-
Looking at true positive and false positive rates can help us understand a model’s performance.
24
+
Looking at true positive and false positive rates can help us understand a model's performance.
25
25
26
-
Consider our hiker example. Ideally, the true positive rate is very high, and the false positive rate is very low, because this means that the model identifies hikers well, and doesn’t identify trees as hikers very often. Yet, if the true positive rate is very high, but the false positive rate is also very high, then the model is biased: it's identifying almost everything it encounters as hiker. Similarly, we don’t want a model with a low true positive rate, because then when the model encounters a hiker, it'll label them as a tree.
26
+
Consider our hiker example. Ideally, the true positive rate is very high, and the false positive rate is very low, because this means that the model identifies hikers well and doesn't identify trees as hikers very often. Yet, if the true positive rate is very high, but the false positive rate is also very high, then the model is biased; it's identifying almost everything it encounters as hiker. Similarly, we don't want a model with a low true positive rate, because then when the model encounters a hiker, it'll label them as a tree.
27
27
28
28
## ROC curves
29
29
30
30
Receiver operator characteristic (ROC) curves are a graph where we plot true positive rate versus false positive rate.
31
31
32
-
ROC curves can be confusing for beginners for two main reasons. The first reason is that, beginners know that a model only has one value for true positive and true negative rates. So an ROC plot must look like this:
32
+
ROC curves can be confusing for beginners for two main reasons. The first reason is that beginners know that a model only has one value for true positive and true negative rates, so an ROC plot must look like this:
33
33
34
34

35
35
36
-
If you’re also thinking this, you’re right. A trained model only produces one point. However, remember that our models have a threshold—normally 50%—that is used to decide whether the true (hiker) or false (tree) label should be used. If we change this threshold to 30% and recalculate true positive and false positive rates, we get another point:
36
+
If you're also thinking this, you're right. A trained model only produces one point. However, remember that our models have a threshold—normally 50%—that's used to decide whether the true (hiker) or false (tree) label should be used. If we change this threshold to 30% and recalculate true positive and false positive rates, we get another point:
37
37
38
38

39
39
40
-
If we do this for thresholds between 0% - 100%, we might get a graph like this:
40
+
If we do this for thresholds between 0%-100%, we might get a graph like this:
41
41
42
42

43
43
@@ -51,4 +51,4 @@ The second reason these graphs can be confusing is the jargon involved. Remember
51
51
52
52
## Good ROC, bad ROC
53
53
54
-
Understanding good and bad ROC curves is something best done in an interactive environment. When you’re ready, jump into the next exercise to explore this topic.
54
+
Understanding good and bad ROC curves is something best done in an interactive environment. When you're ready, jump into the next exercise to explore this topic.
Copy file name to clipboardExpand all lines: learn-pr/azure/optimize-model-performance-roc-auc/includes/4-compare-optimize-curves.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,28 +1,28 @@
1
-
Receiver operator characteristic (ROC) curves let us compare models to one another and tune our selected model. Let’s discuss how and why these are done.
1
+
Receiver operator characteristic (ROC) curves let us compare models to one another and tune our selected model. Let's discuss how and why these are done.
2
2
3
3
## Tuning a model
4
4
5
-
The most obvious use for an ROC curve is to choose a decision threshold that gives the best performance. Recall that our models provide us with probabilities, such as a 65% chance that the sample is a hiker. The decision threshold is the point above which a sample is assigned true (hiker) or below which it's assigned `false` (tree). If our decision threshold was 50%, then 65% would be assigned to ‘true’ (hiker). If our decision threshold was 70%, however, a probability of 65% would be too small, and be assigned to false (‘tree’).
5
+
The most obvious use for an ROC curve is to choose a decision threshold that gives the best performance. Recall that our models provide us with probabilities, such as a 65% chance that the sample is a hiker. The decision threshold is the point above which a sample is assigned true (hiker) or below which it's assigned `false` (tree). If our decision threshold was 50%, then 65% would be assigned to "true" (hiker). If our decision threshold was 70%, however, a probability of 65% would be too small, and be assigned to "false" (tree).
6
6
7
-
We’ve seen in the previous exercise that when we construct an ROC curve, we're just changing the decision threshold and assessing how well the model works. When we do this, we can find the threshold that gives the optimal results.
7
+
We've seen in the previous exercise that when we construct an ROC curve, we're just changing the decision threshold and assessing how well the model works. When we do this, we can find the threshold that gives the optimal results.
8
8
9
-
Usually there isn't a single threshold that gives both the best true positive rate (TPR) and the lower false positive rate (FPR). This means that the optimal threshold depends on what you are trying to achieve. For example, in our scenario, it’s very important to have a high true positive rate because if a hiker isn't identified and an avalanche occurs the team won't know to rescue them. There's a trade-off, though – if the false positive rate is too high, then the rescue team may repeatedly be sent out to rescue people who simply don't exist. In other situations, the false positive rate is considered more important. For example, science has a low tolerance for false-positive results – if the false-positive rate of scientific experiments was higher, there would be an endless flurry of contradictory claims and it would be impossible to make sense of what is real.
9
+
Usually there isn't a single threshold that gives both the best true positive rate (TPR) and the lower false positive rate (FPR). This means that the optimal threshold depends on what you're trying to achieve. For example, in our scenario, it's very important to have a high true positive rate, because if a hiker isn't identified and an avalanche occurs, the team won't know to rescue them. There's a trade-off, though: if the false positive rate is too high, then the rescue team may repeatedly be sent out to rescue people who simply don't exist. In other situations, the false positive rate is considered more important. For example, science has a low tolerance for false-positive results. If the false-positive rate of scientific experiments was higher, there would be an endless flurry of contradictory claims, and it would be impossible to make sense of what's real.
10
10
11
11
## Comparing models with AUC
12
12
13
-
ROC curves can be used to compare models to each other, just like cost functions can. ROC curve for a model shows how well it will work for a variety of decision thresholds. At the end of the day, what is most important in a model is how it will perform in the real world – where there's only one decision threshold. Why then, would we want to compare models using thresholds we'll never use? There are two answers for this.
13
+
You can use ROC curves to compare models to each other, just like you can with cost functions. An ROC curve for a model shows how well it will work for a variety of decision thresholds. At the end of the day, what's most important in a model is how it will perform in the real world, where there's only one decision threshold. Why then would we want to compare models using thresholds we'll never use? There are two answers for this.
14
14
15
15
Firstly, comparing ROC curves in particular ways is like performing a statistical test that tells us not just that one model did better on this particular test set, but whether it's likely to continue to perform better in the future. This is out of the scope of this learning material, but it's worth keeping in mind.
16
16
17
-
Secondly, the ROC curve shows, to some degree, how reliant the model is on having the perfect threshold. For example, if our model only works well when we have a decision threshold of 0.9, but terribly above or below this value, it's not a good design. We would probably prefer to work with a model that works reasonably well for various thresholds, knowing that if the real-world data we come across is slightly different to our test set, our model’s performance won't necessarily collapse.
17
+
Secondly, the ROC curve shows, to some degree, how reliant the model is on having the perfect threshold. For example, if our model only works well when we have a decision threshold of 0.9, but terribly above or below this value, it's not a good design. We'd probably prefer to work with a model that works reasonably well for various thresholds, knowing that if the real-world data we come across is slightly different to our test set, our model's performance won't necessarily collapse.
18
18
19
19
### How to compare ROCs?
20
20
21
21
The easiest way to compare ROCs numerically is using the area under the curve (AUC). Literally, this is the area of the graph that is below the curve. For example, our perfect model from the last exercise has an AUC of 1:
22
22
23
23

24
24
25
-
While our model that did not better than chance has an area of about 0.5:
25
+
While our model that did no better than chance has an area of about 0.5:
26
26
27
27

We’ve covered receiver operator characteristic (ROC) curves in some depth. We learned they graph how often we mistakenly assign a true label against how often we correctly assign a true label. Each point on the graph represents one threshold that was applied.
1
+
We've covered receiver operator characteristic (ROC) curves in some depth. We learned they graph how often we mistakenly assign a true label against how often we correctly assign a true label. Each point on the graph represents one threshold that was applied.
2
2
3
-
We learned how we can use ROC curves to tune our decision threshold in the final model. We also saw how area-under the curve (AUC) can give us an idea as to how reliant our model is to having the perfect decision threshold. It is also a handy measure to compare two models to one another.
4
-
Congratulations on getting so far! As always, now you have a new technique under your belt the best you can do for your learning is practice using it on data you care about. Through this, you will gain experience and understand nuances that we haven't had time or space to cover here. Good luck!
3
+
We learned how we can use ROC curves to tune our decision threshold in the final model. We also saw how area-under the curve (AUC) can give us an idea as to how reliant our model is to having the perfect decision threshold. It's also a handy measure to compare two models to one another.
4
+
Congratulations on getting so far! As always, now that you have a new technique under your belt, the best you can do for your learning is practice using it on data you care about. By doing so, you'll gain experience and understand nuances that we haven't had time or space to cover here. Good luck!
0 commit comments