Skip to content

Commit 70eec90

Browse files
committed
Peer + PM feedback
1 parent 5873f76 commit 70eec90

File tree

3 files changed

+35
-38
lines changed

3 files changed

+35
-38
lines changed

articles/machine-learning/concept-automated-ml.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -125,7 +125,7 @@ In every automated machine learning experiment, your data is automatically scale
125125

126126
### Customize featurization
127127

128-
Additional feature engineering techniques such as, encoding and transforms are also available.
128+
Additional feature engineering techniques such as, encoding and transforms are also available.
129129

130130
Enable this setting with:
131131

articles/machine-learning/how-to-configure-auto-features.md

Lines changed: 32 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
11
---
22
title: Featurization in autoML experiments
33
titleSuffix: Azure Machine Learning
4-
description: Learn what featurization options Azure Machine Learning offers with automated ml experiments.
4+
description: Learn what featurization settings Azure Machine Learning offers, and how feature engineering is supported in automated ml experiments.
55
author: nibaccam
66
ms.author: nibaccam
77
ms.reviewer: nibaccam
88
services: machine-learning
99
ms.service: machine-learning
1010
ms.subservice: core
1111
ms.topic: conceptual
12-
ms.date: 05/25/2020
12+
ms.date: 05/28/2020
1313
ms.custom: seodec18
1414
---
1515

@@ -19,53 +19,53 @@ ms.custom: seodec18
1919

2020
In this guide, learn what featurization settings are offered, and how to customize them for your [automated machine learning experiments](concept-automated-ml.md).
2121

22-
Feature engineering is the process of using domain knowledge of the data to create features that help ML algorithms learn better. In Azure Machine Learning, data scaling and normalization techniques are applied to facilitate feature engineering. Collectively, these techniques and feature engineering are referred to as featurization in automated machine learning experiments.
22+
Feature engineering is the process of using domain knowledge of the data to create features that help ML algorithms learn better. In Azure Machine Learning, data scaling and normalization techniques are applied to facilitate feature engineering. Collectively, these techniques and feature engineering are referred to as featurization in automated machine learning, autoML, experiments.
2323

24-
This article assumes you are already familiar with how to configure an automated machine learning experiment. See the following articles for details,
24+
This article assumes you are already familiar with how to configure an autoML experiment. See the following articles for details,
2525

2626
* For a code first experience: [Configure automated ML experiments with the Python SDK](how-to-configure-auto-train.md).
2727
* For a low/no code experience: [Create, review, and deploy automated machine learning models with the Azure Machine Learning studio](how-to-use-automated-ml-for-ml-models.md)
2828

2929
## Configure featurization
3030

31-
In every automated machine learning experiment, [automatic scaling and normalization techniques](#featurization) are applied to your data by default. These scaling and normalization techniques are types of featurization that help *certain* algorithms that are sensitive to features on different scales. However, you can also enable additional featurization, such as missing values imputation, encoding, and transforms.
31+
In every automated machine learning experiment, [automatic scaling and normalization techniques](#featurization) are applied to your data by default. These scaling and normalization techniques are types of featurization that help *certain* algorithms that are sensitive to features on different scales. However, you can also enable additional featurization, such as **missing values imputation, encoding,** and **transforms**.
3232

3333
> [!NOTE]
3434
> Automated machine learning featurization steps (feature normalization, handling missing data,
3535
> converting text to numeric, etc.) become part of the underlying model. When using the model for
3636
> predictions, the same featurization steps applied during training are applied to
3737
> your input data automatically.
3838
39-
For experiments configured with the SDK, you can enable/disable the setting `featurization` and further specify the featurization steps that should be used for your experiment. [Learn how to enable featurization via the Azure Machine Learning studio.](how-to-use-automated-ml-for-ml-models.md#customize-featurization)
39+
For experiments configured with the SDK, you can enable/disable the setting `featurization` and further specify the featurization steps that should be used for your experiment. If you are using the the Azure Machine Learning studio, see how to enable featurization [with these steps](how-to-use-automated-ml-for-ml-models.md#customize-featurization).
4040

4141
The following table shows the accepted settings for `featurization` in the [AutoMLConfig class](/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig.automlconfig).
4242

4343
Featurization Configuration | Description
4444
------------- | -------------
45-
`"featurization": 'auto'`| Indicates that as part of preprocessing, [data guardrails and featurization steps](#featurization) are performed automatically. **Default setting**.
46-
`"featurization": 'off'`| Indicates featurization steps shouldn't be done automatically.
47-
`"featurization":` `'FeaturizationConfig'`| Indicates customized featurization step should be used. [Learn how to customize featurization](#customize-featurization).|
45+
**`"featurization": 'auto'`**| Indicates that as part of preprocessing, [data guardrails and featurization steps](#featurization) are performed automatically. **Default setting**.
46+
**`"featurization": 'off'`**| Indicates featurization steps shouldn't be done automatically.
47+
**`"featurization":` `'FeaturizationConfig'`**| Indicates customized featurization step should be used. [Learn how to customize featurization](#customize-featurization).|
4848

4949
<a name="featurization"></a>
5050

5151
## Automatic featurization
5252

53-
Whether you configure your experiment via the SDK or the studio, the following table summarizes the techniques that are automatically applied to your data by default. The same techniques are applied if `"featurization": 'auto'` is specified in your `AutoMLConfig` object.
53+
The following table summarizes techniques automatically applied to your data. This happens for experiments configured through the SDK or the studio. To disable this behavior, set `"featurization": 'off'` in your `AutoMLConfig` object.
5454

5555
> [!NOTE]
5656
> If you plan to export your auto ML created models to an [ONNX model](concept-onnx.md), only the featurization options indicated with an * are supported in the ONNX format. Learn more about [converting models to ONNX](concept-automated-ml.md#use-with-onnx).
5757
5858
|Featurization&nbsp;steps| Description |
5959
| ------------- | ------------- |
60-
|Drop high cardinality or no variance features* |Drop these from training and validation sets, including features with all values missing, same value across all rows or with high cardinality (for example, hashes, IDs, or GUIDs).|
61-
|Impute missing values* |For numerical features, impute with average of values in the column.<br/><br/>For categorical features, impute with most frequent value.|
62-
|Generate additional features* |For DateTime features: Year, Month, Day, Day of week, Day of year, Quarter, Week of the year, Hour, Minute, Second.<br/><br/>For Text features: Term frequency based on unigrams, bi-grams, and tri-character-grams.|
63-
|Transform and encode *|Numeric features with few unique values are transformed into categorical features.<br/><br/>One-hot encoding is performed for low cardinality categorical; for high cardinality, one-hot-hash encoding.|
64-
|Word embeddings|Text featurizer that converts vectors of text tokens into sentence vectors using a pre-trained model. Each word's embedding vector in a document is aggregated together to produce a document feature vector.|
65-
|Target encodings|For categorical features, maps each category with averaged target value for regression problems, and to the class probability for each class for classification problems. Frequency-based weighting and k-fold cross validation is applied to reduce over fitting of the mapping and noise caused by sparse data categories.|
66-
|Text target encoding|For text input, a stacked linear model with bag-of-words is used to generate the probability of each class.|
67-
|Weight of Evidence (WoE)|Calculates WoE as a measure of correlation of categorical columns to the target column. It is calculated as the log of the ratio of in-class vs out-of-class probabilities. This step outputs one numerical feature column per class and removes the need to explicitly impute missing values and outlier treatment.|
68-
|Cluster Distance|Trains a k-means clustering model on all numerical columns. Outputs k new features, one new numerical feature per cluster, containing the distance of each sample to the centroid of each cluster.|
60+
|**Drop high cardinality or no variance features*** |Drop these from training and validation sets, including features with all values missing, same value across all rows or with high cardinality (for example, hashes, IDs, or GUIDs).|
61+
|**Impute missing values*** |For numerical features, impute with average of values in the column.<br/><br/>For categorical features, impute with most frequent value.|
62+
|**Generate additional features*** |For DateTime features: Year, Month, Day, Day of week, Day of year, Quarter, Week of the year, Hour, Minute, Second.<br/><br/>For Text features: Term frequency based on unigrams, bi-grams, and tri-character-grams.|
63+
|**Transform and encode***|Numeric features with few unique values are transformed into categorical features.<br/><br/>One-hot encoding is performed for low cardinality categorical; for high cardinality, one-hot-hash encoding.|
64+
|**Word embeddings**|Text featurizer that converts vectors of text tokens into sentence vectors using a pre-trained model. Each word's embedding vector in a document is aggregated together to produce a document feature vector.|
65+
|**Target encodings**|For categorical features, maps each category with averaged target value for regression problems, and to the class probability for each class for classification problems. Frequency-based weighting and k-fold cross validation is applied to reduce over fitting of the mapping and noise caused by sparse data categories.|
66+
|**Text target encoding**|For text input, a stacked linear model with bag-of-words is used to generate the probability of each class.|
67+
|**Weight of Evidence (WoE)**|Calculates WoE as a measure of correlation of categorical columns to the target column. It is calculated as the log of the ratio of in-class vs out-of-class probabilities. This step outputs one numerical feature column per class and removes the need to explicitly impute missing values and outlier treatment.|
68+
|**Cluster Distance**|Trains a k-means clustering model on all numerical columns. Outputs k new features, one new numerical feature per cluster, containing the distance of each sample to the centroid of each cluster.|
6969

7070
## Data guardrails
7171

@@ -88,23 +88,20 @@ Data guardrails will display one of three states: **Passed**, **Done**, or **Ale
8888

8989
State| Description
9090
----|----
91-
Passed| No data problems were detected and no user action is required.
92-
Done| Changes were applied to your data. We encourage users to review the corrective actions Automated ML took to ensure the changes align with the expected results.
93-
Alerted| A data issue that could not be remedied was detected. We encourage users to revise and fix the issue.
94-
95-
>[!NOTE]
96-
> Previous versions of automated ML experiments displayed a fourth state: **Fixed**. Newer experiments will not display this state, and all guardrails which displayed the **Fixed** state will now display **Done**.
91+
**Passed**| No data problems were detected and no user action is required.
92+
**Done**| Changes were applied to your data. We encourage users to review the corrective actions Automated ML took to ensure the changes align with the expected results.
93+
**Alerted**| A data issue that could not be remedied was detected. We encourage users to revise and fix the issue.
9794

9895
The following table describes the data guardrails currently supported, and the associated statuses that users may come across when submitting their experiment.
9996

10097
Guardrail|Status|Condition&nbsp;for&nbsp;trigger
10198
---|---|---
102-
Missing feature values imputation |**Passed** <br><br><br> **Done**| No missing feature values were detected in your training data. Learn more about [missing value imputation.](https://docs.microsoft.com/azure/machine-learning/how-to-use-automated-ml-for-ml-models#advanced-featurization-options) <br><br> Missing feature values were detected in your training data and imputed.
103-
High cardinality feature handling |**Passed** <br><br><br> **Done**| Your inputs were analyzed, and no high cardinality features were detected. Learn more about [high cardinality feature detection.](https://docs.microsoft.com/azure/machine-learning/how-to-use-automated-ml-for-ml-models#advanced-featurization-options) <br><br> High cardinality features were detected in your inputs and were handled.
104-
Validation split handling |**Done**| *The validation configuration was set to 'auto' and the training data contained **less** than 20,000 rows.* <br> Each iteration of the trained model was validated through cross-validation. Learn more about [validation data.](https://docs.microsoft.com/azure/machine-learning/how-to-configure-auto-train#train-and-validation-data) <br><br> *The validation configuration was set to 'auto' and the training data contained **more** than 20,000 rows.* <br> The input data has been split into a training dataset and a validation dataset for validation of the model.
105-
Class balancing detection |**Passed** <br><br><br><br> **Alerted** | Your inputs were analyzed, and all classes are balanced in your training data. A dataset is considered balanced if each class has good representation in the dataset, as measured by number and ratio of samples. <br><br><br> Imbalanced classes were detected in your inputs. To fix model bias, fix the balancing problem. Learn more about [imbalanced data.](https://docs.microsoft.com/azure/machine-learning/concept-manage-ml-pitfalls#identify-models-with-imbalanced-data)
106-
Memory issues detection |**Passed** <br><br><br><br> **Done** |<br> The selected {horizon, lag, rolling window} value(s) were analyzed, and no potential out-of-memory issues were detected. Learn more about time-series [forecasting configurations.](https://docs.microsoft.com/azure/machine-learning/how-to-auto-train-forecast#configure-and-run-experiment) <br><br><br>The selected {horizon, lag, rolling window} values were analyzed and will potentially cause your experiment to run out of memory. The lag or rolling window configurations have been turned off.
107-
Frequency detection |**Passed** <br><br><br><br> **Done** |<br> The time series was analyzed and all data points are aligned with the detected frequency. <br> <br> The time series was analyzed and data points that do not align with the detected frequency were detected. These data points were removed from the dataset. Learn more about [data preparation for time-series forecasting.](https://docs.microsoft.com/azure/machine-learning/how-to-auto-train-forecast#preparing-data)
99+
**Missing feature values imputation** |*Passed* <br><br><br> *Done*| No missing feature values were detected in your training data. Learn more about [missing value imputation.](https://docs.microsoft.com/azure/machine-learning/how-to-use-automated-ml-for-ml-models#advanced-featurization-options) <br><br> Missing feature values were detected in your training data and imputed.
100+
**High cardinality feature handling** |*Passed* <br><br><br> *Done*| Your inputs were analyzed, and no high cardinality features were detected. Learn more about [high cardinality feature detection.](https://docs.microsoft.com/azure/machine-learning/how-to-use-automated-ml-for-ml-models#advanced-featurization-options) <br><br> High cardinality features were detected in your inputs and were handled.
101+
**Validation split handling** |*Done*| The validation configuration was set to 'auto' and the training data contained **less than 20,000 rows**. <br> Each iteration of the trained model was validated through cross-validation. Learn more about [validation data.](https://docs.microsoft.com/azure/machine-learning/how-to-configure-auto-train#train-and-validation-data) <br><br> The validation configuration was set to 'auto' and the training data contained **more than 20,000 rows**. <br> The input data has been split into a training dataset and a validation dataset for validation of the model.
102+
**Class balancing detection** |*Passed* <br><br><br><br><br> *Alerted* | Your inputs were analyzed, and all classes are balanced in your training data. A dataset is considered balanced if each class has good representation in the dataset, as measured by number and ratio of samples. <br><br><br> Imbalanced classes were detected in your inputs. To fix model bias, fix the balancing problem. Learn more about [imbalanced data.](https://docs.microsoft.com/azure/machine-learning/concept-manage-ml-pitfalls#identify-models-with-imbalanced-data)
103+
**Memory issues detection** |*Passed* <br><br><br><br> *Done* |<br> The selected {horizon, lag, rolling window} value(s) were analyzed, and no potential out-of-memory issues were detected. Learn more about time-series [forecasting configurations.](https://docs.microsoft.com/azure/machine-learning/how-to-auto-train-forecast#configure-and-run-experiment) <br><br><br>The selected {horizon, lag, rolling window} values were analyzed and will potentially cause your experiment to run out of memory. The lag or rolling window configurations have been turned off.
104+
**Frequency detection** |*Passed* <br><br><br><br> *Done* |<br> The time series was analyzed and all data points are aligned with the detected frequency. <br> <br> The time series was analyzed and data points that do not align with the detected frequency were detected. These data points were removed from the dataset. Learn more about [data preparation for time-series forecasting.](https://docs.microsoft.com/azure/machine-learning/how-to-auto-train-forecast#preparing-data)
108105

109106
## Customize featurization
110107

@@ -116,10 +113,10 @@ Supported customization includes:
116113

117114
|Customization|Definition|
118115
|--|--|
119-
|Column purpose update|Override feature type for the specified column.|
120-
|Transformer parameter update |Update parameters for the specified transformer. Currently supports Imputer (mean, most frequent & median) and HashOneHotEncoder.|
121-
|Drop columns |Columns to drop from being featurized.|
122-
|Block transformers| Block transformers to be used on featurization process.|
116+
|**Column purpose update**|Override feature type for the specified column.|
117+
|**Transformer parameter update** |Update parameters for the specified transformer. Currently supports Imputer (mean, most frequent & median) and HashOneHotEncoder.|
118+
|**Drop columns** |Columns to drop from being featurized.|
119+
|**Block transformers**| Block transformers to be used on featurization process.|
123120

124121
Create the FeaturizationConfig object using API calls:
125122
```python

articles/machine-learning/toc.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -316,8 +316,8 @@
316316
- name: Auto-train a forecast model
317317
displayName: time series
318318
href: how-to-auto-train-forecast.md
319-
- name: Feature engineering in autoML (Python)
320-
displayName: featurization, feature importance
319+
- name: Featurization in autoML (Python)
320+
displayName: feature engineering, feature importance
321321
href: how-to-configure-auto-features.md
322322
- name: Use automated ML in ML pipelines (Python)
323323
displayName: machine learning automl

0 commit comments

Comments
 (0)