You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-configure-auto-features.md
+32-35Lines changed: 32 additions & 35 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,15 +1,15 @@
1
1
---
2
2
title: Featurization in autoML experiments
3
3
titleSuffix: Azure Machine Learning
4
-
description: Learn what featurization options Azure Machine Learning offers with automated ml experiments.
4
+
description: Learn what featurization settings Azure Machine Learning offers, and how feature engineering is supported in automated ml experiments.
5
5
author: nibaccam
6
6
ms.author: nibaccam
7
7
ms.reviewer: nibaccam
8
8
services: machine-learning
9
9
ms.service: machine-learning
10
10
ms.subservice: core
11
11
ms.topic: conceptual
12
-
ms.date: 05/25/2020
12
+
ms.date: 05/28/2020
13
13
ms.custom: seodec18
14
14
---
15
15
@@ -19,53 +19,53 @@ ms.custom: seodec18
19
19
20
20
In this guide, learn what featurization settings are offered, and how to customize them for your [automated machine learning experiments](concept-automated-ml.md).
21
21
22
-
Feature engineering is the process of using domain knowledge of the data to create features that help ML algorithms learn better. In Azure Machine Learning, data scaling and normalization techniques are applied to facilitate feature engineering. Collectively, these techniques and feature engineering are referred to as featurization in automated machine learning experiments.
22
+
Feature engineering is the process of using domain knowledge of the data to create features that help ML algorithms learn better. In Azure Machine Learning, data scaling and normalization techniques are applied to facilitate feature engineering. Collectively, these techniques and feature engineering are referred to as featurization in automated machine learning, autoML, experiments.
23
23
24
-
This article assumes you are already familiar with how to configure an automated machine learning experiment. See the following articles for details,
24
+
This article assumes you are already familiar with how to configure an autoML experiment. See the following articles for details,
25
25
26
26
* For a code first experience: [Configure automated ML experiments with the Python SDK](how-to-configure-auto-train.md).
27
27
* For a low/no code experience: [Create, review, and deploy automated machine learning models with the Azure Machine Learning studio](how-to-use-automated-ml-for-ml-models.md)
28
28
29
29
## Configure featurization
30
30
31
-
In every automated machine learning experiment, [automatic scaling and normalization techniques](#featurization) are applied to your data by default. These scaling and normalization techniques are types of featurization that help *certain* algorithms that are sensitive to features on different scales. However, you can also enable additional featurization, such as missing values imputation, encoding, and transforms.
31
+
In every automated machine learning experiment, [automatic scaling and normalization techniques](#featurization) are applied to your data by default. These scaling and normalization techniques are types of featurization that help *certain* algorithms that are sensitive to features on different scales. However, you can also enable additional featurization, such as **missing values imputation, encoding,** and **transforms**.
> converting text to numeric, etc.) become part of the underlying model. When using the model for
36
36
> predictions, the same featurization steps applied during training are applied to
37
37
> your input data automatically.
38
38
39
-
For experiments configured with the SDK, you can enable/disable the setting `featurization` and further specify the featurization steps that should be used for your experiment. [Learn how to enable featurization via the Azure Machine Learning studio.](how-to-use-automated-ml-for-ml-models.md#customize-featurization)
39
+
For experiments configured with the SDK, you can enable/disable the setting `featurization` and further specify the featurization steps that should be used for your experiment. If you are using the the Azure Machine Learning studio, see how to enable featurization [with these steps](how-to-use-automated-ml-for-ml-models.md#customize-featurization).
40
40
41
41
The following table shows the accepted settings for `featurization` in the [AutoMLConfig class](/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig.automlconfig).
42
42
43
43
Featurization Configuration | Description
44
44
------------- | -------------
45
-
`"featurization": 'auto'`| Indicates that as part of preprocessing, [data guardrails and featurization steps](#featurization) are performed automatically. **Default setting**.
46
-
`"featurization": 'off'`| Indicates featurization steps shouldn't be done automatically.
47
-
`"featurization":` `'FeaturizationConfig'`| Indicates customized featurization step should be used. [Learn how to customize featurization](#customize-featurization).|
45
+
**`"featurization": 'auto'`**| Indicates that as part of preprocessing, [data guardrails and featurization steps](#featurization) are performed automatically. **Default setting**.
46
+
**`"featurization": 'off'`**| Indicates featurization steps shouldn't be done automatically.
47
+
**`"featurization":` `'FeaturizationConfig'`**| Indicates customized featurization step should be used. [Learn how to customize featurization](#customize-featurization).|
48
48
49
49
<aname="featurization"></a>
50
50
51
51
## Automatic featurization
52
52
53
-
Whether you configure your experiment via the SDK or the studio, the following table summarizes the techniques that are automatically applied to your data by default. The same techniques are applied if `"featurization": 'auto'` is specified in your `AutoMLConfig` object.
53
+
The following table summarizes techniques automatically applied to your data. This happens for experiments configured through the SDK or the studio. To disable this behavior, set `"featurization": 'off'` in your `AutoMLConfig` object.
54
54
55
55
> [!NOTE]
56
56
> If you plan to export your auto ML created models to an [ONNX model](concept-onnx.md), only the featurization options indicated with an * are supported in the ONNX format. Learn more about [converting models to ONNX](concept-automated-ml.md#use-with-onnx).
57
57
58
58
|Featurization steps| Description |
59
59
| ------------- | ------------- |
60
-
|Drop high cardinality or no variance features*|Drop these from training and validation sets, including features with all values missing, same value across all rows or with high cardinality (for example, hashes, IDs, or GUIDs).|
61
-
|Impute missing values*|For numerical features, impute with average of values in the column.<br/><br/>For categorical features, impute with most frequent value.|
62
-
|Generate additional features*|For DateTime features: Year, Month, Day, Day of week, Day of year, Quarter, Week of the year, Hour, Minute, Second.<br/><br/>For Text features: Term frequency based on unigrams, bi-grams, and tri-character-grams.|
63
-
|Transform and encode*|Numeric features with few unique values are transformed into categorical features.<br/><br/>One-hot encoding is performed for low cardinality categorical; for high cardinality, one-hot-hash encoding.|
64
-
|Word embeddings|Text featurizer that converts vectors of text tokens into sentence vectors using a pre-trained model. Each word's embedding vector in a document is aggregated together to produce a document feature vector.|
65
-
|Target encodings|For categorical features, maps each category with averaged target value for regression problems, and to the class probability for each class for classification problems. Frequency-based weighting and k-fold cross validation is applied to reduce over fitting of the mapping and noise caused by sparse data categories.|
66
-
|Text target encoding|For text input, a stacked linear model with bag-of-words is used to generate the probability of each class.|
67
-
|Weight of Evidence (WoE)|Calculates WoE as a measure of correlation of categorical columns to the target column. It is calculated as the log of the ratio of in-class vs out-of-class probabilities. This step outputs one numerical feature column per class and removes the need to explicitly impute missing values and outlier treatment.|
68
-
|Cluster Distance|Trains a k-means clustering model on all numerical columns. Outputs k new features, one new numerical feature per cluster, containing the distance of each sample to the centroid of each cluster.|
60
+
|**Drop high cardinality or no variance features***|Drop these from training and validation sets, including features with all values missing, same value across all rows or with high cardinality (for example, hashes, IDs, or GUIDs).|
61
+
|**Impute missing values***|For numerical features, impute with average of values in the column.<br/><br/>For categorical features, impute with most frequent value.|
62
+
|**Generate additional features***|For DateTime features: Year, Month, Day, Day of week, Day of year, Quarter, Week of the year, Hour, Minute, Second.<br/><br/>For Text features: Term frequency based on unigrams, bi-grams, and tri-character-grams.|
63
+
|**Transform and encode***|Numeric features with few unique values are transformed into categorical features.<br/><br/>One-hot encoding is performed for low cardinality categorical; for high cardinality, one-hot-hash encoding.|
64
+
|**Word embeddings**|Text featurizer that converts vectors of text tokens into sentence vectors using a pre-trained model. Each word's embedding vector in a document is aggregated together to produce a document feature vector.|
65
+
|**Target encodings**|For categorical features, maps each category with averaged target value for regression problems, and to the class probability for each class for classification problems. Frequency-based weighting and k-fold cross validation is applied to reduce over fitting of the mapping and noise caused by sparse data categories.|
66
+
|**Text target encoding**|For text input, a stacked linear model with bag-of-words is used to generate the probability of each class.|
67
+
|**Weight of Evidence (WoE)**|Calculates WoE as a measure of correlation of categorical columns to the target column. It is calculated as the log of the ratio of in-class vs out-of-class probabilities. This step outputs one numerical feature column per class and removes the need to explicitly impute missing values and outlier treatment.|
68
+
|**Cluster Distance**|Trains a k-means clustering model on all numerical columns. Outputs k new features, one new numerical feature per cluster, containing the distance of each sample to the centroid of each cluster.|
69
69
70
70
## Data guardrails
71
71
@@ -88,23 +88,20 @@ Data guardrails will display one of three states: **Passed**, **Done**, or **Ale
88
88
89
89
State| Description
90
90
----|----
91
-
Passed| No data problems were detected and no user action is required.
92
-
Done| Changes were applied to your data. We encourage users to review the corrective actions Automated ML took to ensure the changes align with the expected results.
93
-
Alerted| A data issue that could not be remedied was detected. We encourage users to revise and fix the issue.
94
-
95
-
>[!NOTE]
96
-
> Previous versions of automated ML experiments displayed a fourth state: **Fixed**. Newer experiments will not display this state, and all guardrails which displayed the **Fixed** state will now display **Done**.
91
+
**Passed**| No data problems were detected and no user action is required.
92
+
**Done**| Changes were applied to your data. We encourage users to review the corrective actions Automated ML took to ensure the changes align with the expected results.
93
+
**Alerted**| A data issue that could not be remedied was detected. We encourage users to revise and fix the issue.
97
94
98
95
The following table describes the data guardrails currently supported, and the associated statuses that users may come across when submitting their experiment.
99
96
100
97
Guardrail|Status|Condition for trigger
101
98
---|---|---
102
-
Missing feature values imputation |**Passed** <br><br><br> **Done**| No missing feature values were detected in your training data. Learn more about [missing value imputation.](https://docs.microsoft.com/azure/machine-learning/how-to-use-automated-ml-for-ml-models#advanced-featurization-options) <br><br> Missing feature values were detected in your training data and imputed.
103
-
High cardinality feature handling |**Passed** <br><br><br> **Done**| Your inputs were analyzed, and no high cardinality features were detected. Learn more about [high cardinality feature detection.](https://docs.microsoft.com/azure/machine-learning/how-to-use-automated-ml-for-ml-models#advanced-featurization-options) <br><br> High cardinality features were detected in your inputs and were handled.
104
-
Validation split handling |**Done**| *The validation configuration was set to 'auto' and the training data contained **less** than 20,000 rows.* <br> Each iteration of the trained model was validated through cross-validation. Learn more about [validation data.](https://docs.microsoft.com/azure/machine-learning/how-to-configure-auto-train#train-and-validation-data) <br><br> *The validation configuration was set to 'auto' and the training data contained **more** than 20,000 rows.* <br> The input data has been split into a training dataset and a validation dataset for validation of the model.
105
-
Class balancing detection |**Passed** <br><br><br><br>**Alerted** | Your inputs were analyzed, and all classes are balanced in your training data. A dataset is considered balanced if each class has good representation in the dataset, as measured by number and ratio of samples. <br><br><br> Imbalanced classes were detected in your inputs. To fix model bias, fix the balancing problem. Learn more about [imbalanced data.](https://docs.microsoft.com/azure/machine-learning/concept-manage-ml-pitfalls#identify-models-with-imbalanced-data)
106
-
Memory issues detection |**Passed** <br><br><br><br> **Done** |<br> The selected {horizon, lag, rolling window} value(s) were analyzed, and no potential out-of-memory issues were detected. Learn more about time-series [forecasting configurations.](https://docs.microsoft.com/azure/machine-learning/how-to-auto-train-forecast#configure-and-run-experiment) <br><br><br>The selected {horizon, lag, rolling window} values were analyzed and will potentially cause your experiment to run out of memory. The lag or rolling window configurations have been turned off.
107
-
Frequency detection |**Passed** <br><br><br><br> **Done** |<br> The time series was analyzed and all data points are aligned with the detected frequency. <br> <br> The time series was analyzed and data points that do not align with the detected frequency were detected. These data points were removed from the dataset. Learn more about [data preparation for time-series forecasting.](https://docs.microsoft.com/azure/machine-learning/how-to-auto-train-forecast#preparing-data)
99
+
**Missing feature values imputation** |*Passed* <br><br><br> *Done*| No missing feature values were detected in your training data. Learn more about [missing value imputation.](https://docs.microsoft.com/azure/machine-learning/how-to-use-automated-ml-for-ml-models#advanced-featurization-options) <br><br> Missing feature values were detected in your training data and imputed.
100
+
**High cardinality feature handling** |*Passed* <br><br><br> *Done*| Your inputs were analyzed, and no high cardinality features were detected. Learn more about [high cardinality feature detection.](https://docs.microsoft.com/azure/machine-learning/how-to-use-automated-ml-for-ml-models#advanced-featurization-options) <br><br> High cardinality features were detected in your inputs and were handled.
101
+
**Validation split handling** |*Done*| The validation configuration was set to 'auto' and the training data contained **less than 20,000 rows**. <br> Each iteration of the trained model was validated through cross-validation. Learn more about [validation data.](https://docs.microsoft.com/azure/machine-learning/how-to-configure-auto-train#train-and-validation-data) <br><br> The validation configuration was set to 'auto' and the training data contained **more than 20,000 rows**. <br> The input data has been split into a training dataset and a validation dataset for validation of the model.
102
+
**Class balancing detection** |*Passed* <br><br><br><br><br> *Alerted* | Your inputs were analyzed, and all classes are balanced in your training data. A dataset is considered balanced if each class has good representation in the dataset, as measured by number and ratio of samples. <br><br><br> Imbalanced classes were detected in your inputs. To fix model bias, fix the balancing problem. Learn more about [imbalanced data.](https://docs.microsoft.com/azure/machine-learning/concept-manage-ml-pitfalls#identify-models-with-imbalanced-data)
103
+
**Memory issues detection** |*Passed* <br><br><br><br> *Done* |<br> The selected {horizon, lag, rolling window} value(s) were analyzed, and no potential out-of-memory issues were detected. Learn more about time-series [forecasting configurations.](https://docs.microsoft.com/azure/machine-learning/how-to-auto-train-forecast#configure-and-run-experiment) <br><br><br>The selected {horizon, lag, rolling window} values were analyzed and will potentially cause your experiment to run out of memory. The lag or rolling window configurations have been turned off.
104
+
**Frequency detection** |*Passed* <br><br><br><br> *Done* |<br> The time series was analyzed and all data points are aligned with the detected frequency. <br> <br> The time series was analyzed and data points that do not align with the detected frequency were detected. These data points were removed from the dataset. Learn more about [data preparation for time-series forecasting.](https://docs.microsoft.com/azure/machine-learning/how-to-auto-train-forecast#preparing-data)
|Column purpose update|Override feature type for the specified column.|
120
-
|Transformer parameter update |Update parameters for the specified transformer. Currently supports Imputer (mean, most frequent & median) and HashOneHotEncoder.|
121
-
|Drop columns |Columns to drop from being featurized.|
122
-
|Block transformers| Block transformers to be used on featurization process.|
116
+
|**Column purpose update**|Override feature type for the specified column.|
117
+
|**Transformer parameter update**|Update parameters for the specified transformer. Currently supports Imputer (mean, most frequent & median) and HashOneHotEncoder.|
118
+
|**Drop columns**|Columns to drop from being featurized.|
119
+
|**Block transformers**| Block transformers to be used on featurization process.|
123
120
124
121
Create the FeaturizationConfig object using API calls:
0 commit comments