Skip to content

Commit 3f6a6ac

Browse files
committed
Edit pass
1 parent 57ef83d commit 3f6a6ac

File tree

1 file changed

+24
-15
lines changed

1 file changed

+24
-15
lines changed

articles/machine-learning/how-to-designer-transform-data.md

Lines changed: 24 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -15,18 +15,21 @@ ms.date: 05/04/2020
1515
# Transform data in Azure Machine Learning designer (preview)
1616
[!INCLUDE [applies-to-skus](../../includes/aml-applies-to-enterprise-sku.md)]
1717

18-
In this article, you will learn how to transform and save datasets in the designer so that you can prepare your own data for machine learning. Use the provided [Adult Census Income Binary Classification dataset](sample-designer-datasets.md) to prepare two datasets: one dataset that includes adult census information from only the United States and another dataset that only includes census information from non-US adults.
18+
In this article, you learn how to transform and save datasets in Azure Machine Learning designer so that you can prepare your own data for machine learning.
19+
20+
You will use the sample [Adult Census Income Binary Classification](sample-designer-datasets.md) dataset to prepare two datasets: one dataset that includes adult census information from only the United States and another dataset that includes census information from non-US adults.
1921

2022
In this article, you learn how to:
2123

2224
1. Transform a dataset to prepare it for training.
2325
1. Export the resulting datasets to a datastore.
26+
1. View results.
2427

2528
This how-to is a prerequisite for the [how to retrain designer models](how-to-retrain-designer.md) article. In that article, you will learn how to use the transformed datasets to train multiple models.
2629

2730
## Transform a dataset
2831

29-
In this section, you learn how to import the sample dataset and split the data into US and non-US datasets using the **Split Data** module. For this how to, use **Adult Census Income Binary classification** as your starting point. For more information on how to import your own data into the designer, see [how to import data](how-to-designer-import-data.md).
32+
In this section, you learn how to import the sample dataset and split the data into US and non-US datasets. For more information on how to import your own data into the designer, see [how to import data](how-to-designer-import-data.md).
3033

3134
### Import data
3235

@@ -44,27 +47,27 @@ Use the following steps to import the sample dataset.
4447

4548
1. Select the **Adult Census Income** dataset module.
4649

47-
1. In the details pane that appears to the right of the canvas, select **Outputs**. Then select the visualize icon ![visualize icon](media/how-to-designer-transform-data/visualize-icon.png).
50+
1. In the details pane that appears to the right of the canvas, select **Outputs**. Select the visualize icon ![visualize icon](media/how-to-designer-transform-data/visualize-icon.png).
4851

4952
1. Use the data preview window to explore the dataset. Take note of the "native-country" column values.
5053

5154
### Split the data
5255

53-
In this section, you use the [Split Data module](algorithm-module-reference/split-data.md) to identify rows that contain "United-States" in the "native-country" column.
56+
In this section, you use the [Split Data module](algorithm-module-reference/split-data.md) to identify and split rows that contain "United-States" in the "native-country" column.
5457

5558
1. In the module palette to the left of the canvas, expand the **Data Transformation** section and find the **Split Data** module.
5659

5760
1. Drag the **Split Data** module onto the canvas, and drop the module below the dataset module.
5861

59-
1. Connect the **Adult Census Income Binary classification** dataset to the **Split Data** module.
62+
1. Connect the dataset module to the **Split Data** module.
6063

6164
1. Select the **Split Data** module.
6265

6366
1. In the module details pane to the right of the canvas, set **Splitting mode** to **Regular Expression**.
6467

6568
1. Enter the **Regular Expression**: `\"native-country" United-States`.
6669

67-
The **Regular expression** mode tests a single column for a value. For more information on the Split Data module, see the related [algorithm reference page](algorithm-module-reference/split-data.md).
70+
The **Regular expression** mode tests a single column for a value. For more information on the Split Data module, see the related [algorithm module reference page](algorithm-module-reference/split-data.md).
6871

6972
Your pipeline should look like this:
7073

@@ -73,7 +76,7 @@ Your pipeline should look like this:
7376

7477
## Save the datasets
7578

76-
Now that your pipeline is set up to split the data, you need to specify where to persist the datasets to access them later. For this example, use the **Export Data** module to save your dataset to a datastore.
79+
Now that your pipeline is set up to split the data, you need to specify where to persist the datasets. For this example, use the **Export Data** module to save your dataset to a datastore.
7780

7881
1. In the module palette to the left of the canvas, expand the **Data Input and Output** section and find the **Export Data** module.
7982

@@ -85,9 +88,9 @@ Now that your pipeline is set up to split the data, you need to specify where to
8588

8689
![Screenshot showing how to connect the Export Data modules](media/how-to-designer-transform-data/export-data-pipeline.png).
8790

88-
1. Select the **Export Data** module connected to the *left*-most port of the **Split Data** module.
91+
1. Select the **Export Data** module that is connected to the *left*-most port of the **Split Data** module.
8992

90-
The order of the output ports matter. The first output port (left) contains the rows where the Split Data regular expression is true. In this case, the first port contains rows for the US-based income, and the second port contains rows for the non-US based income.
93+
The order of the output ports matter for the **Split Data** module. The first output port contains the rows where the regular expression is true. In this case, the first port contains rows for the US-based income, and the second port contains rows for the non-US based income.
9194

9295
1. In the module details pane to the right of the canvas, set the following options:
9396

@@ -102,15 +105,15 @@ Now that your pipeline is set up to split the data, you need to specify where to
102105
> [!NOTE]
103106
> This article assumes that you have access to a datastore registered to the current Azure Machine Learning workspace. For instructions on how to setup a datastore, see [Connect to Azure storage services](how-to-access-data.md#azure-machine-learning-studio).
104107
105-
If you don't have a datastore, you can create one now. For example purposes, this article will save the datasets to the default blob storage account associated with the workspace. It will save the datasets into a new folder called `data`.
108+
If you don't have a datastore, you can create one now. For example purposes, this article will save the datasets to the default blob storage account associated with the workspace. It will save the datasets into the `azureml` container in a new folder called `data`.
106109

107110
1. Select the **Export Data** module connected to the *right*-most port of the **Split Data** module.
108111

109112
1. In the module details pane to the right of the canvas, set the following options:
110113

111114
**Datastore type**: Azure Blob Storage
112115

113-
**Datastore**: Select an existing datastore or select "New datastore" to create one now.
116+
**Datastore**: Select the same datastore as above
114117

115118
**Path**: `/data/non-us-income`
116119

@@ -124,17 +127,21 @@ Now that your pipeline is set up to split the data, you need to specify where to
124127

125128
![Screenshot showing how to configure the Export Data modules](media/how-to-designer-transform-data/us-income-export-data.png).
126129

127-
1. At the top of the canvas, select **Submit** to submit the run.
130+
### Submit the run
131+
132+
Now that your pipeline is setup to split and export the data, submit a pipeline run.
133+
134+
1. At the top of the canvas, select **Submit**.
128135

129136
1. In the **Set up pipeline run** dialog, select **Create new**.
130137

131138
1. Provide a descriptive experiment name like "split-census-data".
132139

133140
1. Select **Submit**.
134141

135-
### View results
142+
## View results
136143

137-
After the pipeline finishes running, you can view your results by navigating to your blob storage in the Azure portal. You can also view the intermediary results for the **Split Data** module to confirm that your data split correctly.
144+
After the pipeline finishes running, you can view your results by navigating to your blob storage in the Azure portal. You can also view the intermediary results of the **Split Data** module to confirm that your data has split correctly.
138145

139146
1. Select the **Split Data** module.
140147

@@ -156,4 +163,6 @@ Skip this section if you want to continue on with part 2 of this how to, [Retrai
156163

157164
## Next steps
158165

159-
In this article, you learned how to transform a dataset and save it to a registered datastore. Continue on with [Retrain models with Azure Machine Learning designer](how-to-retrain-designer.md) to use your transformed datasets and pipeline parameters to train machine learning models.
166+
In this article, you learned how to transform a dataset and save it to a registered datastore.
167+
168+
Continue to the next part of this how-to series with [Retrain models with Azure Machine Learning designer](how-to-retrain-designer.md) to use your transformed datasets and pipeline parameters to train machine learning models.

0 commit comments

Comments
 (0)