Skip to content

Commit 8dac8f0

Browse files
author
Larry O'Brien
committed
Final draft
1 parent 89dc119 commit 8dac8f0

File tree

2 files changed

+106
-27
lines changed

2 files changed

+106
-27
lines changed

articles/machine-learning/how-to-move-data-in-and-out-of-pipelines.md

Lines changed: 104 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,38 +1,59 @@
11
---
22
title: 'Input and Output Data from ML Pipelines'
33
titleSuffix: Azure Machine Learning
4-
description:Prepare, consume, and generate data in Azure Machine Learning pipelines
4+
description: Prepare, consume, and generate data in Azure Machine Learning pipelines
55
services: machine-learning
66
ms.service: machine-learning
77
ms.subservice: core
88
ms.topic: conceptual
99
ms.author: laobri
1010
author: lobrien
11-
ms.date: 11/06/2019
11+
ms.date: 04/01/2020
1212
---
1313

14-
# Input and Output Data from ML Pipelines
14+
# Moving data into and between ML pipeline steps (Python)
1515

1616
[!INCLUDE [applies-to-skus](../../includes/aml-applies-to-basic-enterprise-sku.md)]
1717

18-
tk Do SEO keyword search to refine meta-title, meta-desc, H1 and 1st para tk
19-
Azure Machine Learning pipelines allow you to create flexible, efficient, and modular ML solutions. Making data flow into, out from, and between pipeline steps is central to developing pipelines. For an overview of how data works in Azure Machine Learning, see [Data in Machine Learning](tk). This article will show you how to:
18+
Import, transform, and move data between steps in a machine learning pipeline. Machine learning pipelines allow you to create flexible, efficient, and modular ML solutions. Making data flow into, out from, and between pipeline steps is central to developing pipelines. For an overview of how data works in Azure Machine Learning, see [Access data in Azure storage services](how-to-access-data.md). For the benefits and structure of Azure Machine Learning pipelines, see [What are Azure Machine Learning pipelines?](concept-ml-pipelines.md).
19+
20+
This article will show you how to:
2021

2122
- Use `Dataset` objects for pre-existing data
2223
- Access data within your steps
2324
- Move between `Dataset` representations and Pandas and Apache Spark representations
2425
- Split `Dataset` data into subsets, such as training and validation subsets
25-
- Create a `PipelineData` object to transfer data to the next pipeline step
26+
- Create a `PipelineData` object to transfer data to the next pipeline step
27+
- Use `PipelineData` objects as input to pipeline steps
28+
29+
## Prerequisites
30+
31+
You'll need:
32+
33+
- An Azure subscription. If you don't have an Azure subscription, create a free account before you begin. Try the [free or paid version of Azure Machine Learning](https://aka.ms/AMLFree).
34+
35+
- The [Azure Machine Learning SDK for Python](https://docs.microsoft.com/python/api/overview/azure/ml/intro?view=azure-ml-py), or access to [Azure Machine Learning studio](https://ml.azure.com/).
2636

27-
## Prerequisites
37+
- An Azure Machine Learning workspace.
38+
39+
Either [create an Azure Machine Learning workspace](how-to-manage-workspace.md) or use an existing one via the Python SDK. Import the `Workspace` and `Datastore` class, and load your subscription information from the file `config.json` using the function `from_config()`. This looks for the JSON file in the current directory by default, but you can also specify a path parameter to point to the file using `from_config(path="your/file/path")`.
2840

29-
tk
41+
```python
42+
import azureml.core
43+
from azureml.core import Workspace, Datastore
44+
45+
ws = Workspace.from_config()
46+
```
47+
48+
- Some pre-existing data. This article briefly shows the use of an [Azure blob container](https://docs.microsoft.com/azure/storage/blobs/storage-blobs-overview).
49+
50+
- Optional: An existing machine learning pipeline, such as the one described in [Create and run machine learning pipelines with Azure Machine Learning SDK](how-to-create-your-first-pipeline.md).
3051

3152
## Use `Dataset` objects for pre-existing data
3253

33-
The preferred way to ingest data into a pipeline is to use a `Dataset` object. `Dataset` objects represent persistent data available throughout a workspace.
54+
The preferred way to ingest data into a pipeline is to use a `Dataset` object. `Dataset` objects represent persistent data available throughout a workspace.
3455

35-
There are many ways to create and register `Dataset` objects. The simplest programmatic way is to use existing blobs in workspace storage:
56+
There are many ways to create and register `Dataset` objects. Tabular datasets are for delimited data available in one or more files. File datasets are for binary data (such as images) or for data that you'll parse. The simplest programmatic ways to create `Dataset` objects are to use existing blobs in workspace storage or public URLs:
3657

3758
```python
3859
datastore = Datastore.get(workspace, 'training_data')
@@ -48,9 +69,9 @@ For more options on creating datasets with different options and from different
4869

4970
### Pass a dataset to your script
5071

51-
To pass the dataset's path to your script, use use `as_named_input(str)`. You can either pass the resulting `DatasetConsumptionConfig` object to your script as an argument or, by using the `inputs` argument to your pipeline script, you can retrieve the dataset using `Run.get_context().input_datasets[str]`.
72+
To pass the dataset's path to your script, use the `Dataset` object's `as_named_input(str)` method. You can either pass the resulting `DatasetConsumptionConfig` object to your script as an argument or, by using the `inputs` argument to your pipeline script, you can retrieve the dataset using `Run.get_context().input_datasets[str]`.
5273

53-
Once you've created a named input, you can choose its access mode: `as_mount()` or `as_download()`. If your script processes all the files in your dataset and the disk on your compute resource is large enough for the dataset, the download access mode will avoid runtime streaming overhead. If your script accesses a subset of the dataset or it's simply too large for your compute, use the mount access mode. For more information, read [Mount vs. Download](https://docs.microsoft.com/azure/machine-learning/how-to-train-with-datasets#mount-vs-download)
74+
Once you've created a named input, you can choose its access mode: `as_mount()` or `as_download()`. If your script processes all the files in your dataset and the disk on your compute resource is large enough for the dataset, the download access mode will avoid the overhead of streaming the data at runtime. If your script accesses a subset of the dataset or it's simply too large for your compute, use the mount access mode. For more information, read [Mount vs. Download](https://docs.microsoft.com/azure/machine-learning/how-to-train-with-datasets#mount-vs-download)
5475

5576
To pass a dataset to your pipeline step:
5677

@@ -63,9 +84,9 @@ The following snippet shows the common pattern of combining these steps within t
6384
```python
6485

6586
train_step = PythonScriptStep(
66-
name="train_data",
67-
script_name="train.py",
68-
compute_target=cluster,
87+
name="train_data",
88+
script_name="train.py",
89+
compute_target=cluster,
6990
inputs=[iris_dataset.as_named_inputs('iris').as_mount()]
7091
)
7192
```
@@ -74,13 +95,13 @@ In addition, you can use methods such as `random_split()` and `take_sample()` to
7495

7596
```python
7697
seed = 42 # PRNG seed
77-
smaller_dataset = iris_dataset.take_sample(0.1, seed=seed) # 10%
98+
smaller_dataset = iris_dataset.take_sample(0.1, seed=seed) # 10%
7899
train, test = smaller_dataset.random_split(percentage=0.8, seed=seed)
79100

80101
train_step = PythonScriptStep(
81-
name="train_data",
82-
script_name="train.py",
83-
compute_target=cluster,
102+
name="train_data",
103+
script_name="train.py",
104+
compute_target=cluster,
84105
inputs=[train.as_named_inputs('train').as_download(), test.as_named_inputs('test').as_download()]
85106
)
86107
```
@@ -91,7 +112,7 @@ Named inputs to your pipeline step script are available as a dictionary within t
91112

92113
```python
93114
# In pipeline definition script:
94-
# Code for demonstration only: No good reason to use both `arguments` and `inputs`
115+
# Code for demonstration only: It would be very confusing to split datasets between `arguments` and `inputs`
95116
train_step = PythonScriptStep(
96117
name="train_data",
97118
script_name="train.py",
@@ -104,12 +125,12 @@ train_step = PythonScriptStep(
104125
parser = argparse.ArgumentParser()
105126
parser.add_argument('--training-folder', type=str, dest='train_folder', help='training data folder mounting point')
106127
args = parser.parse_args()
107-
training_data_folder = args.train_folder
128+
training_data_folder = args.train_folder
108129

109130
testing_data_folder = Run.get_context().input_datasets['test']
110131
```
111132

112-
The passed value will be the path to the dataset file(s).
133+
The passed value will be the path to the dataset file(s).
113134

114135
It is also possible to access a registered `Dataset` directly. Since registered datasets are persistent and shared across a workspace, you can retrieve them directly:
115136

@@ -121,7 +142,7 @@ ds = Dataset.get_by_name(workspace=ws, name='mnist_opendataset')
121142

122143
## Use `PipelineData` for intermediate data
123144

124-
While `Dataset` objects represent persistent data, `PipelineData` objects are used for temporary data that is output from pipeline steps. Because the lifespan of a `PipelineData` object is longer than a single pipeline step, you define them in the pipeline definition script. When you create a `PipelineData` object, you must provide a name and a datastore to which the data will listen. Pass your `PipelineData` object(s) to your `PythonScriptStep` using _both_ the `arguments` and the `outputs` arguments:
145+
While `Dataset` objects represent persistent data, `PipelineData` objects are used for temporary data that is output from pipeline steps. Because the lifespan of a `PipelineData` object is longer than a single pipeline step, you define them in the pipeline definition script. When you create a `PipelineData` object, you must provide a name and a datastore at which the data will reside. Pass your `PipelineData` object(s) to your `PythonScriptStep` using _both_ the `arguments` and the `outputs` arguments:
125146

126147
```python
127148
default_datastore = workspace.get_default_datastore()
@@ -136,12 +157,16 @@ dataprep_step = PythonScriptStep(
136157
outputs=[dataprep_output]
137158
)
138159
```
139-
tk mount vs upload tk
140160

161+
You may choose to create your `PipelineData` object using an access mode that provides an immediate upload. In that case, when you create your `PipelineData`, set the `upload_mode` to `"upload"` and use the `output_path_on_compute` argument to specify the path to which you will be writing the data:
141162

142-
### Use `PipelineData` as an output of a training step
163+
```python
164+
PipelineData("clean_data", datastore=def_blob_store, output_mode="upload", output_path_on_compute="clean_data_output/")
165+
```
143166

144-
Within your pipeline's `PythonScriptStep`, you can retrieve the available output paths using the program's arguments. If this is the first step and will initialize the output data, you must create the directory at the specified path. You can then write whatever files you wish to be contained in the `PipelineData`.
167+
### Use `PipelineData` as an output of a training step
168+
169+
Within your pipeline's `PythonScriptStep`, you can retrieve the available output paths using the program's arguments. If this is the first step and will initialize the output data, you must create the directory at the specified path. You can then write whatever files you wish to be contained in the `PipelineData`.
145170

146171
```python
147172
parser = argparse.ArgumentParser()
@@ -150,7 +175,59 @@ args = parser.parse_args()
150175

151176
# Make directory for file
152177
os.makedirs(os.path.dirname(args.output_path), exist_ok=True)
153-
with open(args.output_path, 'w') as f:
178+
with open(args.output_path, 'w') as f:
154179
f.write("Step 1's output")
155180
```
156181

182+
### Read `PipelineData` as an input to non-initial steps
183+
184+
After the initial pipeline step writes some data to the `PipelineData` path and it becomes an output of that initial step, it can be used as an input to a subsequent step:
185+
186+
```python
187+
step1_output_data = PipelineData("processed_data", datastore=def_blob_store, output_mode="upload")
188+
189+
step1 = PythonScriptStep(
190+
name="generate_data",
191+
script_name="step1.py",
192+
runconfig = aml_run_config,
193+
arguments = ["--output_path", step1_output_data],
194+
inputs=[],
195+
outputs=[step1_output_data]
196+
)
197+
198+
step2 = PythonScriptStep(
199+
name="read_pipeline_data",
200+
script_name="step2.py",
201+
compute_target=compute,
202+
runconfig = aml_run_config,
203+
arguments = ["--pd", step1_output_data],
204+
inputs=[step1_output_data]
205+
)
206+
207+
pipeline = Pipeline(workspace=ws, steps=[step1, step2])
208+
```
209+
210+
The value of a `PipelineData` input is the path to the previous output. If, as shown previously, the first step wrote a single file, consuming it might look like:
211+
212+
```python
213+
parser = argparse.ArgumentParser()
214+
parser.add_argument('--pd', dest='pd', required=True)
215+
args = parser.parse_args()
216+
217+
with open(args.pd) as f:
218+
print(f.read())
219+
```
220+
221+
## Convert a `PipelineData` object into a registered `Dataset` for further processing
222+
223+
If you'd like to make your `PipelineData` available for longer than the duration of a run, use it's `as_dataset()` function to convert it to a `Dataset`. You may then register the `Dataset`, making it a first-class citizen in your workspace. Since your `PipelineData` object will have a different path every time the pipeline runs, it is highly recommended that you set `create_new_version` to `True` when registering a `Dataset` created from a `PipelineData` object.
224+
225+
```python
226+
step1_output_ds = step1_output_data.as_dataset()
227+
step1_output_ds.register(name="processed_data", create_new_version=True)
228+
```
229+
230+
## Next steps
231+
232+
* [Create an Azure machine learning dataset](how-to-create-register-datasets.md)
233+
* [Create and run machine learning pipelines with Azure Machine Learning SDK](how-to-create-your-first-pipeline.md)

articles/machine-learning/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -335,6 +335,8 @@
335335
items:
336336
- name: Create ML pipelines (Python)
337337
href: how-to-create-your-first-pipeline.md
338+
name: Moving data into and between ML pipeline steps (Python)
339+
href: how-to-move-data-in-and-out-of-pipelines.md
338340
- name: Schedule a pipeline (Python)
339341
href: how-to-schedule-pipelines.md
340342
- name: Trigger a pipeline

0 commit comments

Comments
 (0)