You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-use-automated-ml-for-ml-models.md
+24-16Lines changed: 24 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -151,7 +151,6 @@ Variance| Measure of how far spread out this column's data is from its average v
151
151
Skewness| Measure of how different this column's data is from a normal distribution.
152
152
Kurtosis| Measure of how heavily tailed this column's data is compared to a normal distribution.
153
153
154
-
155
154
<aname="featurization"></a>
156
155
157
156
## Advanced featurization options
@@ -160,12 +159,15 @@ Automated machine learning offers preprocessing and data guardrails automaticall
160
159
161
160
### Preprocessing
162
161
162
+
> [!NOTE]
163
+
> If you plan to export your auto ML created models to an [ONNX model](concept-onnx.md), only the featurization options indicated with an * are supported in the ONNX format. Learn more about [converting models to ONNX](concept-automated-ml.md#use-with-onnx).
164
+
163
165
|Preprocessing steps| Description |
164
166
| ------------- | ------------- |
165
-
|Drop high cardinality or no variance features|Drop these from training and validation sets, including features with all values missing, same value across all rows or with extremely high cardinality (for example, hashes, IDs, or GUIDs).|
166
-
|Impute missing values|For numerical features, impute with average of values in the column.<br/><br/>For categorical features, impute with most frequent value.|
167
-
|Generate additional features|For DateTime features: Year, Month, Day, Day of week, Day of year, Quarter, Week of the year, Hour, Minute, Second.<br/><br/>For Text features: Term frequency based on unigrams, bi-grams, and tri-character-grams.|
168
-
|Transform and encode |Numeric features with few unique values are transformed into categorical features.<br/><br/>One-hot encoding is performed for low cardinality categorical; for high cardinality, one-hot-hash encoding.|
167
+
|Drop high cardinality or no variance features*|Drop these from training and validation sets, including features with all values missing, same value across all rows or with extremely high cardinality (for example, hashes, IDs, or GUIDs).|
168
+
|Impute missing values*|For numerical features, impute with average of values in the column.<br/><br/>For categorical features, impute with most frequent value.|
169
+
|Generate additional features*|For DateTime features: Year, Month, Day, Day of week, Day of year, Quarter, Week of the year, Hour, Minute, Second.<br/><br/>For Text features: Term frequency based on unigrams, bi-grams, and tri-character-grams.|
170
+
|Transform and encode *|Numeric features with few unique values are transformed into categorical features.<br/><br/>One-hot encoding is performed for low cardinality categorical; for high cardinality, one-hot-hash encoding.|
169
171
|Word embeddings|Text featurizer that converts vectors of text tokens into sentence vectors using a pre-trained model. Each word's embedding vector in a document is aggregated together to produce a document feature vector.|
170
172
|Target encodings|For categorical features, maps each category with averaged target value for regression problems, and to the class probability for each class for classification problems. Frequency-based weighting and k-fold cross validation is applied to reduce over fitting of the mapping and noise caused by sparse data categories.|
171
173
|Text target encoding|For text input, a stacked linear model with bag-of-words is used to generate the probability of each class.|
@@ -174,27 +176,33 @@ Automated machine learning offers preprocessing and data guardrails automaticall
174
176
175
177
### Data guardrails
176
178
177
-
Data guardrails are applied when automatic featurization is enabled or validation is set to auto. Data guardrails help you identify potential issues with your data (e.g., missing values, class imbalance) and help take corrective actions for improved results. There are many best practices that are available and can be applied to achieve reliable results. Users can review data guardrails in the studio within the **Data guardrails** tab of an Automated ML run or by setting ```show_output=True``` when submitting an experiment using the Python SDK. The following table describes the data guardrails currently supported, and the associated statuses that users may come across when submitting their experiment.
179
+
Data guardrails are applied when automatic featurization is enabled or validation is set to auto. Data guardrails help you identify potential issues with your data (e.g., missing values, class imbalance) and help take corrective actions for improved results.
178
180
179
-
Guardrail|Status|Condition for trigger
180
-
---|---|---
181
-
Missing feature values imputation |**Passed** <br><br><br> **Done**| No missing feature values were detected in your training data. Learn more about [missing value imputation.](https://docs.microsoft.com/azure/machine-learning/how-to-use-automated-ml-for-ml-models#advanced-featurization-options) <br><br> Missing feature values were detected in your training data and imputed.
182
-
High cardinality feature handling |**Passed** <br><br><br> **Done**| Your inputs were analyzed, and no high cardinality features were detected. Learn more about [high cardinality feature detection.](https://docs.microsoft.com/azure/machine-learning/how-to-use-automated-ml-for-ml-models#advanced-featurization-options) <br><br> High cardinality features were detected in your inputs and were handled.
183
-
Validation split handling |**Done**| *The validation configuration was set to 'auto' and the training data contained **less** than 20,000 rows.* <br> Each iteration of the trained model was validated through cross-validation. Learn more about [validation data.](https://docs.microsoft.com/azure/machine-learning/how-to-configure-auto-train#train-and-validation-data) <br><br> *The validation configuration was set to 'auto' and the training data contained **more** than 20,000 rows.* <br> The input data has been split into a training dataset and a validation dataset for validation of the model.
184
-
Class balancing detection |**Passed** <br><br><br><br> **Alerted** | Your inputs were analyzed, and all classes are balanced in your training data. A dataset is considered balanced if each class has good representation in the dataset, as measured by number and ratio of samples. <br><br><br> Imbalanced classes were detected in your inputs. To fix model bias fix the balancing problem. Learn more about [imbalanced data.](https://docs.microsoft.com/azure/machine-learning/concept-automated-ml#imbalance)
185
-
Memory issues detection |**Passed** <br><br><br><br> **Done** |<br> The selected {horizon, lag, rolling window} value(s) were analyzed, and no potential out-of-memory issues were detected. Learn more about time-series [forecasting configurations.](https://docs.microsoft.com/azure/machine-learning/how-to-auto-train-forecast#configure-and-run-experiment) <br><br><br>The selected {horizon, lag, rolling window} values were analyzed and will potentially cause your experiment to run out of memory. The lag or rolling window configurations have been turned off.
186
-
Frequency detection |**Passed** <br><br><br><br> **Done** |<br> The time series was analyzed and all data points are aligned with the detected frequency. <br> <br> The time series was analyzed and data points that do not align with the detected frequency were detected. These data points were removed from the dataset. Learn more about [data preparation for time-series forecasting.](https://docs.microsoft.com/azure/machine-learning/how-to-auto-train-forecast#preparing-data)
181
+
Users can review data guardrails in the studio within the **Data guardrails** tab of an automated ML run or by setting ```show_output=True``` when submitting an experiment using the Python SDK.
187
182
188
183
#### Data Guardrail States
189
-
Data guardrails will display one of three states: 'Passed', 'Done, or 'Alerted'.
184
+
185
+
Data guardrails will display one of three states: **Passed**, **Done**, or **Alerted**.
190
186
191
187
State| Description
192
188
----|----
193
189
Passed| No data problems were detected and no user action is required.
194
190
Done| Changes were applied to your data. We encourage users to review the corrective actions Automated ML took to ensure the changes align with the expected results.
195
191
Alerted| A data issue that could not be remedied was detected. We encourage users to revise and fix the issue.
196
192
197
-
Previous version of the Automated ML displayed a fourth state: 'Fixed'. Newer experiments will not display this state, and all guardrails which displayed the 'Fixed' state will now display 'Done'.
193
+
>[!NOTE]
194
+
> Previous versions of automated ML experiments displayed a fourth state: **Fixed**. Newer experiments will not display this state, and all guardrails which displayed the **Fixed** state will now display **Done**.
195
+
196
+
The following table describes the data guardrails currently supported, and the associated statuses that users may come across when submitting their experiment.
197
+
198
+
Guardrail|Status|Condition for trigger
199
+
---|---|---
200
+
Missing feature values imputation |**Passed** <br><br><br> **Done**| No missing feature values were detected in your training data. Learn more about [missing value imputation.](https://docs.microsoft.com/azure/machine-learning/how-to-use-automated-ml-for-ml-models#advanced-featurization-options) <br><br> Missing feature values were detected in your training data and imputed.
201
+
High cardinality feature handling |**Passed** <br><br><br> **Done**| Your inputs were analyzed, and no high cardinality features were detected. Learn more about [high cardinality feature detection.](https://docs.microsoft.com/azure/machine-learning/how-to-use-automated-ml-for-ml-models#advanced-featurization-options) <br><br> High cardinality features were detected in your inputs and were handled.
202
+
Validation split handling |**Done**| *The validation configuration was set to 'auto' and the training data contained **less** than 20,000 rows.* <br> Each iteration of the trained model was validated through cross-validation. Learn more about [validation data.](https://docs.microsoft.com/azure/machine-learning/how-to-configure-auto-train#train-and-validation-data) <br><br> *The validation configuration was set to 'auto' and the training data contained **more** than 20,000 rows.* <br> The input data has been split into a training dataset and a validation dataset for validation of the model.
203
+
Class balancing detection |**Passed** <br><br><br><br> **Alerted** | Your inputs were analyzed, and all classes are balanced in your training data. A dataset is considered balanced if each class has good representation in the dataset, as measured by number and ratio of samples. <br><br><br> Imbalanced classes were detected in your inputs. To fix model bias fix the balancing problem. Learn more about [imbalanced data.](https://docs.microsoft.com/azure/machine-learning/concept-automated-ml#imbalance)
204
+
Memory issues detection |**Passed** <br><br><br><br> **Done** |<br> The selected {horizon, lag, rolling window} value(s) were analyzed, and no potential out-of-memory issues were detected. Learn more about time-series [forecasting configurations.](https://docs.microsoft.com/azure/machine-learning/how-to-auto-train-forecast#configure-and-run-experiment) <br><br><br>The selected {horizon, lag, rolling window} values were analyzed and will potentially cause your experiment to run out of memory. The lag or rolling window configurations have been turned off.
205
+
Frequency detection |**Passed** <br><br><br><br> **Done** |<br> The time series was analyzed and all data points are aligned with the detected frequency. <br> <br> The time series was analyzed and data points that do not align with the detected frequency were detected. These data points were removed from the dataset. Learn more about [data preparation for time-series forecasting.](https://docs.microsoft.com/azure/machine-learning/how-to-auto-train-forecast#preparing-data)
0 commit comments