|
1 | 1 | ---
|
2 |
| -title: Avoid overfitting & imbalanced data with Automated machine learning |
| 2 | +title: Prevent overfitting and imbalanced data with Automated ML |
3 | 3 | titleSuffix: Azure Machine Learning
|
4 |
| -description: Identify and manage common pitfalls of ML models with Azure Machine Learning's Automated ML solutions. |
| 4 | +description: Identify and manage common pitfalls of machine learning models by using Automated ML solutions in Azure Machine Learning. |
5 | 5 | services: machine-learning
|
6 | 6 | ms.service: machine-learning
|
7 | 7 | ms.subservice: automl
|
8 |
| -ms.topic: conceptual |
| 8 | +ms.topic: concept-article |
9 | 9 | author: ssalgadodev
|
10 | 10 | ms.author: ssalgado
|
11 | 11 | ms.reviewer: manashg
|
12 |
| -ms.date: 06/15/2023 |
| 12 | +ms.date: 07/11/2024 |
| 13 | + |
| 14 | +#customer intent: As a developer, I want to use Automated ML solutions in Azure Machine Learning, so I can find and address common issues like overfitting and imbalanced data. |
13 | 15 | ---
|
14 | 16 |
|
15 | 17 | # Prevent overfitting and imbalanced data with Automated ML
|
16 | 18 |
|
17 |
| -Overfitting and imbalanced data are common pitfalls when you build machine learning models. By default, Azure Machine Learning's Automated ML provides charts and metrics to help you identify these risks, and implements best practices to help mitigate them. |
| 19 | +Overfitting and imbalanced data are common pitfalls when you build machine learning models. By default, the Automated ML feature in Azure Machine Learning provides charts and metrics to help you identify these risks. This article describes how you can implement best practices in Automated ML to help mitigate common issues. |
18 | 20 |
|
19 | 21 | ## Identify overfitting
|
20 | 22 |
|
21 |
| -Overfitting in machine learning occurs when a model fits the training data too well, and as a result can't accurately predict on unseen test data. In other words, the model has memorized specific patterns and noise in the training data, but is not flexible enough to make predictions on real data. |
| 23 | +Overfitting in machine learning occurs when a model fits the training data too well. As a result, the model can't make accurate predictions on unseen test data. The model memorized specific patterns and noise in the training data, and it's not flexible enough to make predictions on real data. |
22 | 24 |
|
23 |
| -Consider the following trained models and their corresponding train and test accuracies. |
| 25 | +Consider the following trained models and their corresponding train and test accuracies: |
24 | 26 |
|
25 | 27 | | Model | Train accuracy | Test accuracy |
|
26 |
| -|-------|----------------|---------------| |
| 28 | +| :---: | :---: | :---: | |
27 | 29 | | A | 99.9% | 95% |
|
28 |
| -| B | 87% | 87% | |
| 30 | +| B | 87% | 87% | |
29 | 31 | | C | 99.9% | 45% |
|
30 | 32 |
|
31 |
| -Consider model **A**, there is a common misconception that if test accuracy on unseen data is lower than training accuracy, the model is overfitted. However, test accuracy should always be less than training accuracy, and the distinction for overfit vs. appropriately fit comes down to *how much* less accurate. |
| 33 | +- Model **A**: The test for this model produces slightly less accuracy than the model training. There's a common misconception that if test accuracy on unseen data is lower than training accuracy, the model is overfitted. However, test accuracy should always be less than training accuracy. The distinction between overfitting versus appropriately fitting data comes down to measuring _how much_ less is the accuracy. |
32 | 34 |
|
33 |
| -Compare models **A** and **B**, model **A** is a better model because it has higher test accuracy, and although the test accuracy is slightly lower at 95%, it is not a significant difference that suggests overfitting is present. You wouldn't choose model **B** because the train and test accuracies are closer together. |
| 35 | +- Model **A** versus model **B**: Model **A** is a better model because it has higher test accuracy. Although the test accuracy is slightly lower at 95%, it's not a significant difference that suggests overfitting is present. Model **B** isn't preferred because the train and test accuracies are similar. |
34 | 36 |
|
35 |
| -Model **C** represents a clear case of overfitting; the training accuracy is high but the test accuracy isn't anywhere near as high. This distinction is subjective, but comes from knowledge of your problem and data, and what magnitudes of error are acceptable. |
| 37 | +- Model **C**: This model represents a clear case of overfitting. The training accuracy is high and the test accuracy is low. This distinction is subjective, but comes from knowledge of your problem and data, and what are the acceptable magnitudes of error. |
36 | 38 |
|
37 | 39 | ## Prevent overfitting
|
38 | 40 |
|
39 |
| -In the most egregious cases, an overfitted model assumes that the feature value combinations seen during training always results in the exact same output for the target. |
| 41 | +In the most egregious cases, an overfitted model assumes the feature value combinations visible during training always result in the exact same output for the target. To avoid overfitting your data, the recommendation is to follow machine learning best practices. The are several methods you can configure in your model implementation. Automated ML also provides other options by default to help prevent overfitting. |
40 | 42 |
|
41 |
| -The best way to prevent overfitting is to follow ML best practices including: |
| 43 | +The following table summarizes common best practices: |
42 | 44 |
|
43 |
| -* Using more training data, and eliminating statistical bias |
44 |
| -* Preventing target leakage |
45 |
| -* Using fewer features |
46 |
| -* **Regularization and hyperparameter optimization** |
47 |
| -* **Model complexity limitations** |
48 |
| -* **Cross-validation** |
| 45 | +| Best practice | Implementation | Automated ML | |
| 46 | +| --- | :---: | :---: | |
| 47 | +| Use more training data, and eliminate statistical bias | X | | |
| 48 | +| Prevent target leakage | X | | |
| 49 | +| Incorporate fewer features | X | | |
| 50 | +| Support regularization and hyperparameter optimization | | X | |
| 51 | +| Apply model complexity limitations | | X | |
| 52 | +| Use cross-validation | | X | |
49 | 53 |
|
50 |
| -In the context of Automated ML, the first three ways lists best practices you implement. The last three bolded items are **best practices Automated ML implements** by default to protect against overfitting. In settings other than Automated ML, all six best practices are worth following to avoid overfitting models. |
| 54 | +## Apply best practices to prevent overfitting |
51 | 55 |
|
52 |
| -## Best practices you implement |
| 56 | +The following sections describe best practices you can use in your machine learning model implementation to prevent overfitting. |
53 | 57 |
|
54 | 58 | ### Use more data
|
55 | 59 |
|
56 |
| -Using more data is the simplest and best possible way to prevent overfitting, and as an added bonus typically increases accuracy. When you use more data, it becomes harder for the model to memorize exact patterns, and it is forced to reach solutions that are more flexible to accommodate more conditions. It's also important to recognize statistical bias, to ensure your training data doesn't include isolated patterns that don't exist in live-prediction data. This scenario can be difficult to solve, because there could be overfitting present when compared to live test data. |
| 60 | +Using more data is the simplest and best possible way to prevent overfitting, and this approach typically increases accuracy. When you use more data, it becomes harder for the model to memorize exact patterns. The model is forced to reach solutions that are more flexible to accommodate more conditions. It's also important to recognize statistical bias, to ensure your training data doesn't include isolated patterns that don't exist in live-prediction data. This scenario can be difficult to solve because there can be overfitting present when compared to live test data. |
57 | 61 |
|
58 | 62 | ### Prevent target leakage
|
59 | 63 |
|
60 |
| -Target leakage is a similar issue, where you may not see overfitting between train/test sets, but rather it appears at prediction-time. Target leakage occurs when your model "cheats" during training by having access to data that it shouldn't normally have at prediction-time. For example, to predict on Monday what a commodity price will be on Friday, if your features accidentally included data from Thursdays, that would be data the model won't have at prediction-time since it can't see into the future. Target leakage is an easy mistake to miss, but is often characterized by abnormally high accuracy for your problem. If you're attempting to predict stock price and trained a model at 95% accuracy, there's likely target leakage somewhere in your features. |
| 64 | +Target leakage is a similar issue. You might not see overfitting between the train and test sets, but the leakage issue appears at prediction-time. Target leakage occurs when your model "cheats" during training by accessing data that it shouldn't normally have at prediction-time. An example is for the model to predict on Monday what the commodity price is for Friday. If your features accidentally include data from Thursdays, the model has access to data not available at prediction-time because it can't see into the future. Target leakage is an easy mistake to miss. It's often visible where you have abnormally high accuracy for your problem. If you're attempting to predict stock price and trained a model at 95% accuracy, there's likely target leakage somewhere in your features. |
61 | 65 |
|
62 |
| -### Use fewer features |
| 66 | +### Incorporate fewer features |
63 | 67 |
|
64 |
| -Removing features can also help with overfitting by preventing the model from having too many fields to use to memorize specific patterns, thus causing it to be more flexible. It can be difficult to measure quantitatively, but if you can remove features and retain the same accuracy, you have likely made the model more flexible and have reduced the risk of overfitting. |
| 68 | +Removing features can also help with overfitting by preventing the model from having too many fields to use to memorize specific patterns, thus causing it to be more flexible. It can be difficult to measure quantitatively. If you can remove features and retain the same accuracy, your model can be more flexible and reduce the risk of overfitting. |
65 | 69 |
|
66 |
| -## Best practices Automated ML implements |
| 70 | +## Review Automated ML features to prevent overfitting |
67 | 71 |
|
68 |
| -### Regularization and hyperparameter tuning |
| 72 | +The following sections describe best practices provided by default in Automated ML to help prevent overfitting. |
69 | 73 |
|
70 |
| -**Regularization** is the process of minimizing a cost function to penalize complex and overfitted models. There's different types of regularization functions, but in general they all penalize model coefficient size, variance, and complexity. Automated ML uses L1 (Lasso), L2 (Ridge), and ElasticNet (L1 and L2 simultaneously) in different combinations with different model hyperparameter settings that control overfitting. Automated ML varies how much a model is regulated and choose the best result. |
| 74 | +### Support regularization and hyperparameter tuning |
71 | 75 |
|
72 |
| -### Model complexity limitations |
| 76 | +**Regularization** is the process of minimizing a cost function to penalize complex and overfitted models. There are different types of regularization functions. In general, all functions penalize model coefficient size, variance, and complexity. Automated ML uses L1 (Lasso), L2 (Ridge), and ElasticNet (L1 and L2 simultaneously) in different combinations with different model hyperparameter settings that control overfitting. Automated ML varies how much a model is regulated and chooses the best result. |
73 | 77 |
|
74 |
| -Automated ML also implements explicit model complexity limitations to prevent overfitting. In most cases, this implementation is specifically for decision tree or forest algorithms, where individual tree max-depth is limited, and the total number of trees used in forest or ensemble techniques are limited. |
| 78 | +### Apply model complexity limitations |
75 | 79 |
|
76 |
| -### Cross-validation |
| 80 | +Automated ML also implements explicit model complexity limitations to prevent overfitting. In most cases, this implementation is specifically for decision tree or forest algorithms. Individual tree max-depth is limited, and the total number of trees used in forest or ensemble techniques are limited. |
77 | 81 |
|
78 |
| -Cross-validation (CV) is the process of taking many subsets of your full training data and training a model on each subset. The idea is that a model could get "lucky" and have great accuracy with one subset, but by using many subsets the model won't achieve this high accuracy every time. When doing CV, you provide a validation holdout dataset, specify your CV folds (number of subsets) and Automated ML trains your model and tune hyperparameters to minimize error on your validation set. One CV fold could be overfitted, but by using many of them it reduces the probability that your final model is overfitted. The tradeoff is that CV results in longer training times and greater cost, because you train a model once for each *n* in the CV subsets. |
| 82 | +### Use cross-validation |
79 | 83 |
|
80 |
| -> [!NOTE] |
81 |
| -> Cross-validation isn't enabled by default; it must be configured in Automated machine learning settings. However, after cross-validation is configured and a validation data set has been provided, the process is automated for you. |
| 84 | +Cross-validation (CV) is the process of taking many subsets of your full training data and training a model on each subset. The idea is that a model might get "lucky" and have great accuracy with one subset, but by using many subsets, the model can't achieve high accuracy every time. When doing CV, you provide a validation holdout dataset, specify your CV folds (number of subsets) and Automated ML trains your model and tunes hyperparameters to minimize error on your validation set. One CV fold might be overfitted, but by using many of them, the process reduces the probability that your final model is overfitted. The tradeoff is that CV results in longer training times and greater cost, because you train a model one time for each *n* in the CV subsets. |
82 | 85 |
|
83 |
| -<a name="imbalance"></a> |
| 86 | +> [!NOTE] |
| 87 | +> Cross-validation isn't enabled by default. This feature must be configured in Automated machine learning settings. However, after cross-validation is configured and a validation data set is provided, the process is automated for you. |
84 | 88 |
|
85 | 89 | ## Identify models with imbalanced data
|
86 | 90 |
|
87 | 91 | Imbalanced data is commonly found in data for machine learning classification scenarios, and refers to data that contains a disproportionate ratio of observations in each class. This imbalance can lead to a falsely perceived positive effect of a model's accuracy, because the input data has bias towards one class, which results in the trained model to mimic that bias.
|
88 | 92 |
|
89 | 93 | In addition, Automated ML jobs generate the following charts automatically. These charts help you understand the correctness of the classifications of your model, and identify models potentially impacted by imbalanced data.
|
90 | 94 |
|
91 |
| -Chart| Description |
92 |
| ----|--- |
93 |
| -[Confusion Matrix](how-to-understand-automated-ml.md#confusion-matrix)| Evaluates the correctly classified labels against the actual labels of the data. |
94 |
| -[Precision-recall](how-to-understand-automated-ml.md#precision-recall-curve)| Evaluates the ratio of correct labels against the ratio of found label instances of the data |
95 |
| -[ROC Curves](how-to-understand-automated-ml.md#roc-curve)| Evaluates the ratio of correct labels against the ratio of false-positive labels. |
| 95 | +| Chart | Description | |
| 96 | +| --- | --- | |
| 97 | +| [Confusion matrix](how-to-understand-automated-ml.md#confusion-matrix) | Evaluates the correctly classified labels against the actual labels of the data. | |
| 98 | +| [Precision-recall](how-to-understand-automated-ml.md#precision-recall-curve) | Evaluates the ratio of correct labels against the ratio of found label instances of the data. | |
| 99 | +| [ROC curves](how-to-understand-automated-ml.md#roc-curve) | Evaluates the ratio of correct labels against the ratio of false-positive labels. | |
96 | 100 |
|
97 | 101 | ## Handle imbalanced data
|
98 | 102 |
|
99 |
| -As part of its goal of simplifying the machine learning workflow, Automated ML has built in capabilities to help deal with imbalanced data such as, |
100 |
| - |
101 |
| -- A weight column: Automated ML creates a column of weights as input to cause rows in the data to be weighted up or down, which can be used to make a class more or less "important." |
102 |
| - |
103 |
| -- The algorithms used by Automated ML detect imbalance when the number of samples in the minority class is equal to or fewer than 20% of the number of samples in the majority class, where minority class refers to the one with fewest samples and majority class refers to the one with most samples. Subsequently, automated machine learning will run an experiment with subsampled data to check if using class weights would remedy this problem and improve performance. If it ascertains a better performance through this experiment, then this remedy is applied. |
| 103 | +As part of the goal to simplify the machine learning workflow, Automated ML offers built-in capabilities to help deal with imbalanced data: |
104 | 104 |
|
105 |
| -- Use a performance metric that deals better with imbalanced data. For example, the AUC_weighted is a primary metric that calculates the contribution of every class based on the relative number of samples representing that class, hence is more robust against imbalance. |
| 105 | +- Automated ML creates a **column of weights** as input to cause rows in the data to be weighted up or down, which can be used to make a class more or less "important." |
106 | 106 |
|
107 |
| -The following techniques are additional options to handle imbalanced data outside of Automated ML. |
| 107 | +- The algorithms used by Automated ML detect imbalance when the number of samples in the minority class is equal to or fewer than 20% of the number of samples in the majority class. The minority class refers to the one with fewest samples and the majority class refers to the one with most samples. Later, automated machine learning runs an experiment with subsampled data to check if using class weights can remedy this problem and improve performance. If it ascertains a better performance through this experiment, it applies the remedy. |
108 | 108 |
|
109 |
| -- Resampling to even the class imbalance, either by up-sampling the smaller classes or down-sampling the larger classes. These methods require expertise to process and analyze. |
| 109 | +- Use a performance metric that deals better with imbalanced data. For example, the AUC_weighted is a primary metric that calculates the contribution of every class based on the relative number of samples representing that class. This metric is more robust against imbalance. |
110 | 110 |
|
111 |
| -- Review performance metrics for imbalanced data. For example, the F1 score is the harmonic mean of precision and recall. Precision measures a classifier's exactness, where higher precision indicates fewer false positives, while recall measures a classifier's completeness, where higher recall indicates fewer false negatives. |
| 111 | +The following techniques are other options to handle imbalanced data outside of Automated ML: |
112 | 112 |
|
113 |
| -## Next steps |
| 113 | +- Resample to even the class imbalance. You can up-sample the smaller classes or down-sample the larger classes. These methods require expertise to process and analyze. |
114 | 114 |
|
115 |
| -See examples and learn how to build models using Automated ML: |
| 115 | +- Review performance metrics for imbalanced data. For example, the F1 score is the harmonic mean of precision and recall. Precision measures a classifier's exactness, where higher precision indicates fewer false positives. Recall measures a classifier's completeness, where higher recall indicates fewer false negatives. |
116 | 116 |
|
117 |
| -+ Follow the [Tutorial: Train an object detection model with automated machine learning and Python](tutorial-auto-train-image-models.md). |
| 117 | +## Next step |
118 | 118 |
|
119 |
| -+ Configure the settings for automatic training experiment: |
120 |
| - + In Azure Machine Learning studio, [use these steps](how-to-use-automated-ml-for-ml-models.md). |
121 |
| - + With the Python SDK, [use these steps](how-to-configure-auto-train.md). |
| 119 | +> [!div class="nextstepaction"] |
| 120 | +> [Train an object detection model with automated machine learning and Python](tutorial-auto-train-image-models.md) |
0 commit comments