Skip to content

Commit dc83eaa

Browse files
authored
Merge pull request #113869 from PeterCLu/plu-designer-transform-data
Transform data with the designer
2 parents fd14889 + 4f34b55 commit dc83eaa

File tree

8 files changed

+176
-7
lines changed

8 files changed

+176
-7
lines changed
Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
---
2+
title: Transform data
3+
titleSuffix: Azure Machine Learning
4+
description: Learn how to transform data in Azure Machine Learning designer to create your own datasets.
5+
services: machine-learning
6+
ms.service: machine-learning
7+
ms.subservice: core
8+
ms.topic: how-to
9+
10+
author: peterclu
11+
ms.author: peterlu
12+
ms.date: 05/04/2020
13+
---
14+
15+
# Transform data in Azure Machine Learning designer (preview)
16+
[!INCLUDE [applies-to-skus](../../includes/aml-applies-to-enterprise-sku.md)]
17+
18+
In this article, you learn how to transform and save datasets in Azure Machine Learning designer so that you can prepare your own data for machine learning.
19+
20+
You will use the sample [Adult Census Income Binary Classification](sample-designer-datasets.md) dataset to prepare two datasets: one dataset that includes adult census information from only the United States and another dataset that includes census information from non-US adults.
21+
22+
In this article, you learn how to:
23+
24+
1. Transform a dataset to prepare it for training.
25+
1. Export the resulting datasets to a datastore.
26+
1. View results.
27+
28+
This how-to is a prerequisite for the [how to retrain designer models](how-to-retrain-designer.md) article. In that article, you will learn how to use the transformed datasets to train multiple models with pipeline parameters.
29+
30+
## Transform a dataset
31+
32+
In this section, you learn how to import the sample dataset and split the data into US and non-US datasets. For more information on how to import your own data into the designer, see [how to import data](how-to-designer-import-data.md).
33+
34+
### Import data
35+
36+
Use the following steps to import the sample dataset.
37+
38+
1. Sign in to <a href="https://ml.azure.com?tabs=jre" target="_blank">ml.azure.com</a>, and select the workspace you want to work with.
39+
40+
1. Go to the designer. Select **Easy-to-use-prebuild modules** to create a new pipeline.
41+
42+
1. Select a default compute target to run the pipeline.
43+
44+
1. To the left of the pipeline canvas is a palette of datasets and modules. Select **Datasets**. Then view the **Samples** section.
45+
46+
1. Drag and drop the **Adult Census Income Binary classification** dataset onto the canvas.
47+
48+
1. Select the **Adult Census Income** dataset module.
49+
50+
1. In the details pane that appears to the right of the canvas, select **Outputs**.
51+
52+
1. Select the visualize icon ![visualize icon](media/how-to-designer-transform-data/visualize-icon.png).
53+
54+
1. Use the data preview window to explore the dataset. Take special note of the "native-country" column values.
55+
56+
### Split the data
57+
58+
In this section, you use the [Split Data module](algorithm-module-reference/split-data.md) to identify and split rows that contain "United-States" in the "native-country" column.
59+
60+
1. In the module palette to the left of the canvas, expand the **Data Transformation** section and find the **Split Data** module.
61+
62+
1. Drag the **Split Data** module onto the canvas, and drop the module below the dataset module.
63+
64+
1. Connect the dataset module to the **Split Data** module.
65+
66+
1. Select the **Split Data** module.
67+
68+
1. In the module details pane to the right of the canvas, set **Splitting mode** to **Regular Expression**.
69+
70+
1. Enter the **Regular Expression**: `\"native-country" United-States`.
71+
72+
The **Regular expression** mode tests a single column for a value. For more information on the Split Data module, see the related [algorithm module reference page](algorithm-module-reference/split-data.md).
73+
74+
Your pipeline should look like this:
75+
76+
![Screenshot showing how to configure the pipeline and the Split Data module](media/how-to-designer-transform-data/split-data.png).
77+
78+
79+
## Save the datasets
80+
81+
Now that your pipeline is set up to split the data, you need to specify where to persist the datasets. For this example, use the **Export Data** module to save your dataset to a datastore. For more information on datastores, see [Connect to Azure storage services](how-to-access-data.md)
82+
83+
1. In the module palette to the left of the canvas, expand the **Data Input and Output** section and find the **Export Data** module.
84+
85+
1. Drag and drop two **Export Data** modules below the **Split Data** module.
86+
87+
1. Connect each output port of the **Split Data** module to a different **Export Data** module.
88+
89+
Your pipeline should look something like this:
90+
91+
![Screenshot showing how to connect the Export Data modules](media/how-to-designer-transform-data/export-data-pipeline.png).
92+
93+
1. Select the **Export Data** module that is connected to the *left*-most port of the **Split Data** module.
94+
95+
The order of the output ports matter for the **Split Data** module. The first output port contains the rows where the regular expression is true. In this case, the first port contains rows for US-based income, and the second port contains rows for non-US based income.
96+
97+
1. In the module details pane to the right of the canvas, set the following options:
98+
99+
**Datastore type**: Azure Blob Storage
100+
101+
**Datastore**: Select an existing datastore or select "New datastore" to create one now.
102+
103+
**Path**: `/data/us-income`
104+
105+
**File format**: csv
106+
107+
> [!NOTE]
108+
> This article assumes that you have access to a datastore registered to the current Azure Machine Learning workspace. For instructions on how to setup a datastore, see [Connect to Azure storage services](how-to-access-data.md#azure-machine-learning-studio).
109+
110+
If you don't have a datastore, you can create one now. For example purposes, this article will save the datasets to the default blob storage account associated with the workspace. It will save the datasets into the `azureml` container in a new folder called `data`.
111+
112+
1. Select the **Export Data** module connected to the *right*-most port of the **Split Data** module.
113+
114+
1. In the module details pane to the right of the canvas, set the following options:
115+
116+
**Datastore type**: Azure Blob Storage
117+
118+
**Datastore**: Select the same datastore as above
119+
120+
**Path**: `/data/non-us-income`
121+
122+
**File format**: csv
123+
124+
1. Confirm the **Export Data** module connected to the left port of the **Split Data** has the **Path** `/data/us-income`.
125+
126+
1. Confirm the **Export Data** module connected to the right port has the **Path** `/data/non-us-income`.
127+
128+
Your pipeline and settings should look like this:
129+
130+
![Screenshot showing how to configure the Export Data modules](media/how-to-designer-transform-data/us-income-export-data.png).
131+
132+
### Submit the run
133+
134+
Now that your pipeline is setup to split and export the data, submit a pipeline run.
135+
136+
1. At the top of the canvas, select **Submit**.
137+
138+
1. In the **Set up pipeline run** dialog, select **Create new** to create an experiment.
139+
140+
Experiments logically group together related pipeline runs. If you run this pipeline in the future, you should use the same experiment for logging and tracking purposes.
141+
142+
1. Provide a descriptive experiment name like "split-census-data".
143+
144+
1. Select **Submit**.
145+
146+
## View results
147+
148+
After the pipeline finishes running, you can view your results by navigating to your blob storage in the Azure portal. You can also view the intermediary results of the **Split Data** module to confirm that your data has been split correctly.
149+
150+
1. Select the **Split Data** module.
151+
152+
1. In the module details pane to the right of the canvas, select **Outputs + logs**.
153+
154+
1. Select the visualize icon ![visualize icon](media/how-to-designer-transform-data/visualize-icon.png) next to **Results dataset1**.
155+
156+
1. Verify that the "native-country" column only contains the value "United-States".
157+
158+
1. Select the visualize icon ![visualize icon](media/how-to-designer-transform-data/visualize-icon.png) next to **Results dataset2**.
159+
160+
1. Verify that the "native-country" column does not contain the value "United-States".
161+
162+
## Clean up resources
163+
164+
Skip this section if you want to continue on with part 2 of this how to, [Retrain models with Azure Machine Learning designer](how-to-retrain-designer.md).
165+
166+
[!INCLUDE [aml-ui-cleanup](../../includes/aml-ui-cleanup.md)]
167+
168+
## Next steps
169+
170+
In this article, you learned how to transform a dataset and save it to a registered datastore.
171+
172+
Continue to the next part of this how-to series with [Retrain models with Azure Machine Learning designer](how-to-retrain-designer.md) to use your transformed datasets and pipeline parameters to train machine learning models.

articles/machine-learning/how-to-retrain-designer.md

Lines changed: 1 addition & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -27,13 +27,7 @@ In this article, you learn how to:
2727
## Prerequisites
2828

2929
* An Azure Machine Learning workspace with the Enterprise SKU.
30-
* A dataset accessible to the designer. This can be one of the following:
31-
* An Azure Machine Learning registered dataset
32-
33-
**-or-**
34-
* A data file stored in an Azure Machine Learning datastore.
35-
36-
For information on data access using the designer see [How to import data into the designer](how-to-designer-import-data.md).
30+
* Complete part 1 of this how-to series, [Transform data in the designer](how-to-designer-transform-data.md).
3731

3832
This article also assumes that you have basic knowledge of building pipelines in the designer. For a guided introduction, complete the [tutorial](tutorial-designer-automobile-price-train-score.md).
3933

14.8 KB
Loading
103 KB
Loading
29.7 KB
Loading
126 KB
Loading
251 Bytes
Loading

articles/machine-learning/toc.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -384,6 +384,9 @@
384384
- name: 'Azure Pipelines for CI/CD'
385385
displayName: continuous, integration, delivery
386386
href: /azure/devops/pipelines/targets/azure-machine-learning?context=azure/machine-learning/service/context/ml-context
387+
- name: 'Designer transform data'
388+
displayName: pipeline
389+
href: how-to-designer-transform-data.md
387390
- name: 'Designer retrain using published pipelines'
388391
displayName: retrain, designer, published pipeline
389392
href: how-to-retrain-designer.md

0 commit comments

Comments
 (0)