Skip to content

Commit 780f8ad

Browse files
authored
Merge pull request #101923 from likebupt/blanca-update-articles
add 1 new module article, update import data article
2 parents 3f5d865 + 24e552c commit 780f8ad

File tree

4 files changed

+121
-6
lines changed

4 files changed

+121
-6
lines changed
Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
---
2+
title: "Convert to Indicator Values"
3+
titleSuffix: Azure Machine Learning
4+
description: Learn how to use the Convert to Indicator Values module in Azure Machine Learning to convert columns that contain categorical values into a series of binary indicator columns.
5+
services: machine-learning
6+
ms.service: machine-learning
7+
ms.subservice: core
8+
ms.topic: reference
9+
10+
author: likebupt
11+
ms.author: keli19
12+
ms.date: 02/11/2020
13+
---
14+
15+
# Convert to Indicator Values
16+
This article describes a module of Azure Machine Learning designer.
17+
18+
Use the **Convert to Indicator Values** module in Azure Machine Learning designer to convert columns that contain categorical values into a series of binary indicator columns.
19+
20+
This module also outputs a definition of the transformation used to convert to indicator values. You can reuse this transformation on other datasets that have the same schema, by using the [Apply Transformation](apply-transformation.md) module.
21+
22+
## How to configure Convert to Indicator Values
23+
24+
1. Find the **Convert to Indicator Values** and drag it to your pipeline draft. You can find this module under **Data Transformation** category.
25+
> [!NOTE]
26+
> You can use the [Edit Metadata](edit-metadata.md) module before the **Convert to Indiciator Values** module to mark the target column(s) as categorical.
27+
28+
1. Connect the **Convert to Indicator Values** module to the dataset containing the columns you want to convert.
29+
30+
1. Select **Edit column** to choose one or more categorical columns.
31+
32+
1. Select the **Overwrite categorical columns** option if you want to output **only** the new Boolean columns. By default, this option is off.
33+
34+
35+
> [!TIP]
36+
> If you choose the option to overwrite, the source column is not actually deleted or modified. Instead, the new columns are generated and presented in the output dataset, and the source column remains available in the workspace.
37+
> If you need to see the original data, you can use the [Add Columns](add-columns.md) module at any time to add the source column back in.
38+
39+
1. Run the pipeline.
40+
41+
## Results
42+
43+
Suppose you have a column with scores that indicate whether a server has a high, medium, or low probability of failure.
44+
45+
| Server ID | Failure score |
46+
| --------- | ------------- |
47+
| 10301 | Low |
48+
| 10302 | Medium |
49+
| 10303 | High |
50+
51+
When you apply **Convert to Indicator Values**, the designer converts a single column of labels into multiple columns containing Boolean values:
52+
53+
| Server ID | Failure score - Low | Failure score - Medium | Failure score - High |
54+
| --------- | ------------------- | ---------------------- | -------------------- |
55+
| 10301 | 1 | 0 | 0 |
56+
| 10302 | 0 | 1 | 0 |
57+
| 10303 | 0 | 0 | 1 |
58+
59+
Here's how the conversion works:
60+
61+
- In the **Failure score** column that describes risk, there are only three possible values (High, Medium, and Low), and no missing values. So, exactly three new columns are created.
62+
63+
- The new indicator columns are named based on the column headings and values of the source column, using this pattern: *\<source column>- \<data value>*.
64+
65+
- There should be a 1 in exactly one indicator column, and 0 in all other indicator columns since each server can have only one risk rating.
66+
67+
You can now use the three indicator columns as features in a machine learning model.
68+
69+
The module returns two outputs:
70+
71+
- **Results dataset**: A dataset with converted indicator values columns. Columns not selected for cleaning are also "passed through".
72+
- **Indicator values transformation**: A data transformation used for converting to indicator values, that can be saved in your workspace and applied to new data later.
73+
74+
## Apply a saved indicator values operation to new data
75+
76+
If you need to repeat indicator values operations often, you can save your data manipulation steps as a *transform* to reuse it with the same dataset. This is useful if you must frequently reimport and then clean data that have the same schema.
77+
78+
1. Add the [Apply Transformation](apply-transformation.md) module to your pipeline.
79+
80+
1. Add the dataset you want to clean, and connect the dataset to the right-hand input port.
81+
82+
1. Expand the **Data Transformation** group in the left-hand pane of designer. Locate the saved transformation and drag it into the pipeline.
83+
84+
1. Connect the saved transformation to the left input port of [Apply Transformation](apply-transformation.md).
85+
86+
When you apply a saved transformation, you cannot select which columns to transform. This is because the transformation has been defined and applies automatically to the data types specified in the original operation.
87+
88+
1. Run the pipeline.
89+
90+
## Technical notes
91+
92+
This section contains implementation details, tips, and answers to frequently asked questions.
93+
94+
### Usage tips
95+
96+
- Only columns that are marked as categorical can be converted to indicator columns. If you see the following error, it is likely that one of the columns you selected is not categorical:
97+
98+
Error 0056: Column with name \<column name> is not in an allowed category.
99+
100+
By default, most string columns are handled as string features, so you must explicitly mark them as categorical using [Edit Metadata](edit-metadata.md).
101+
102+
- There is no limit on the number of columns that you can convert to indicator columns. However, because each column of values can yield multiple indicator columns, you may want to convert and review just a few columns at a time.
103+
104+
- If the column contains missing values, a separate indicator column is created for the missing category, with this name: *\<source column>- Missing*
105+
106+
- If the column that you convert to indicator values contains numbers, they must be marked as categorical like any other feature column. After you have done so, the numbers are treated as discrete values. For example, if you have a numeric column with MPG values ranging from 25 to 30, a new indicator column would be created for each discrete value:
107+
108+
| Make | Highway mpg -25 | Highway mpg -26 | Highway mpg -27 | Highway mpg -28 | Highway mpg -29 | Highway mpg -30 |
109+
| ---------- | --------------- | --------------- | --------------- | --------------- | --------------- | --------------- |
110+
| Contoso Cars | 0 | 0 | 0 | 0 | 0 | 1 |
111+
112+
- To avoid adding too many dimensions to your dataset. We recommend that you first check the number of values in the column, and bin or quantize the data appropriately.
113+
114+
115+
## Next steps
116+
117+
See the [set of modules available](module-reference.md) to Azure Machine Learning.

articles/machine-learning/algorithm-module-reference/import-data.md

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ Before using cloud storage, you need to register a datastore in your Azure Machi
3737

3838
After you define the data you want and connect to the source, **[Import Data](./import-data.md)** infers the data type of each column based on the values it contains, and loads the data into your designer pipeline. The output of **Import Data** is a dataset that can be used with any designer pipeline.
3939

40-
If your source data changes, you can refresh the dataset and add new data by rerunning [Import Data](./import-data.md). However, if you don't want to re-read from the source each time you run the pipeline, set the **Use cached results** option to TRUE. When this option is selected, the module checks whether the pipeline has run previously using the same source and same input options. If a previous run is found, the data in the cache is used, instead of reloading the data from the source.
40+
If your source data changes, you can refresh the dataset and add new data by rerunning [Import Data](./import-data.md).
4141

4242
## How to configure Import Data
4343

@@ -56,11 +56,7 @@ If your source data changes, you can refresh the dataset and add new data by rer
5656

5757
![import-data-preview](media/module/import-data.png)
5858

59-
1. Select the **Use cached results** option if you want to cache the dataset for reuse on successive runs.
6059

61-
Assuming there have been no other changes to module parameters, the pipeline loads the data only the first time the module is run, and thereafter uses a cached version of the dataset.
62-
63-
Deselect this option if you need to reload the data each time you run the pipeline.
6460

6561
1. Run the pipeline.
6662

articles/machine-learning/algorithm-module-reference/module-reference.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ For help with choosing algorithms, see
3030
| Functionality | Description | Module |
3131
| --- |--- | --- |
3232
| Data input and output | Move data from cloud sources into your pipeline. Write your results or intermediate data to Azure Storage, a SQL database, or Hive, while running a pipeline, or use cloud storage to exchange data between pipelines. | [Enter Data Manually](enter-data-manually.md) <br/> [Export Data](export-data.md) <br/> [Import Data](import-data.md) |
33-
| Data transformation | Operations on data that are unique to machine learning, such as normalizing or binning data, dimensionality reduction, and converting data among various file formats.| [Add Columns](add-columns.md) <br/> [Add Rows](add-rows.md) <br/> [Apply Math Operation](apply-math-operation.md) <br/> [Apply SQL Transformation](apply-sql-transformation.md) <br/> [Clean Missing Data](clean-missing-data.md) <br/> [Clip Values](clip-values.md) <br/> [Convert to CSV](convert-to-csv.md) <br/> [Convert to Dataset](convert-to-dataset.md) <br/> [Edit Metadata](edit-metadata.md) <br/> [Join Data](join-data.md) <br/> [Normalize Data](normalize-data.md) <br/> [Partition and Sample](partition-and-sample.md) <br/> [Remove Duplicate Rows](remove-duplicate-rows.md) <br/> [SMOTE](smote.md) <br/> [Select Columns Transform](select-columns-transform.md) <br/> [Select Columns in Dataset](select-columns-in-dataset.md) <br/> [Split Data](split-data.md) |
33+
| Data transformation | Operations on data that are unique to machine learning, such as normalizing or binning data, dimensionality reduction, and converting data among various file formats.| [Add Columns](add-columns.md) <br/> [Add Rows](add-rows.md) <br/> [Apply Math Operation](apply-math-operation.md) <br/> [Apply SQL Transformation](apply-sql-transformation.md) <br/> [Clean Missing Data](clean-missing-data.md) <br/> [Clip Values](clip-values.md) <br/> [Convert to CSV](convert-to-csv.md) <br/> [Convert to Dataset](convert-to-dataset.md) <br/> [Convert to Indicator Values](convert-to-indicator-values.md) <br/> [Edit Metadata](edit-metadata.md) <br/> [Join Data](join-data.md) <br/> [Normalize Data](normalize-data.md) <br/> [Partition and Sample](partition-and-sample.md) <br/> [Remove Duplicate Rows](remove-duplicate-rows.md) <br/> [SMOTE](smote.md) <br/> [Select Columns Transform](select-columns-transform.md) <br/> [Select Columns in Dataset](select-columns-in-dataset.md) <br/> [Split Data](split-data.md) |
3434
| Feature Selection | Select a subset of relevant, useful features to use in building an analytical model. | [Filter Based Feature Selection](filter-based-feature-selection.md) <br/> [Permutation Feature Importance](permutation-feature-importance.md) |
3535
| Statistical Functions | Provide a wide variety of statistical methods related to data science. | [Summarize Data](summarize-data.md)|
3636

articles/machine-learning/algorithm-module-reference/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,8 @@
2626
href: convert-to-csv.md
2727
- name: Convert to Dataset
2828
href: convert-to-dataset.md
29+
- name: Convert to Indicator Values
30+
href: convert-to-indicator-values.md
2931
- name: Clean Missing Data
3032
href: clean-missing-data.md
3133
- name: Edit Metadata

0 commit comments

Comments
 (0)