Skip to content

Commit 1bc7ee4

Browse files
authored
Merge pull request #109796 from lobrien/1677813-data-pipelines
Moving data into and between pipeline steps
2 parents f5e86ac + 4504bfd commit 1bc7ee4

File tree

2 files changed

+238
-0
lines changed

2 files changed

+238
-0
lines changed
Lines changed: 236 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,236 @@
1+
---
2+
title: 'Input and output data from ML pipelines'
3+
titleSuffix: Azure Machine Learning
4+
description: Prepare, consume, and generate data in Azure Machine Learning pipelines
5+
services: machine-learning
6+
ms.service: machine-learning
7+
ms.subservice: core
8+
ms.topic: conceptual
9+
ms.author: laobri
10+
author: lobrien
11+
ms.date: 04/01/2020
12+
# As a data scientist using Python, I want to get data into my pipeline and flowing between steps
13+
---
14+
15+
# Moving data into and between ML pipeline steps (Python)
16+
17+
[!INCLUDE [applies-to-skus](../../includes/aml-applies-to-basic-enterprise-sku.md)]
18+
19+
Data is central to machine learning pipelines. This article provides code for importing, transforming, and moving data between steps in an Azure Machine Learning pipeline. For an overview of how data works in Azure Machine Learning, see [Access data in Azure storage services](how-to-access-data.md). For the benefits and structure of Azure Machine Learning pipelines, see [What are Azure Machine Learning pipelines?](concept-ml-pipelines.md).
20+
21+
This article will show you how to:
22+
23+
- Use `Dataset` objects for pre-existing data
24+
- Access data within your steps
25+
- Split `Dataset` data into subsets, such as training and validation subsets
26+
- Create `PipelineData` objects to transfer data to the next pipeline step
27+
- Use `PipelineData` objects as input to pipeline steps
28+
- Create new `Dataset` objects from `PipelineData` you wish to persist
29+
30+
## Prerequisites
31+
32+
You'll need:
33+
34+
- An Azure subscription. If you don't have an Azure subscription, create a free account before you begin. Try the [free or paid version of Azure Machine Learning](https://aka.ms/AMLFree).
35+
36+
- The [Azure Machine Learning SDK for Python](https://docs.microsoft.com/python/api/overview/azure/ml/intro?view=azure-ml-py), or access to [Azure Machine Learning studio](https://ml.azure.com/).
37+
38+
- An Azure Machine Learning workspace.
39+
40+
Either [create an Azure Machine Learning workspace](how-to-manage-workspace.md) or use an existing one via the Python SDK. Import the `Workspace` and `Datastore` class, and load your subscription information from the file `config.json` using the function `from_config()`. This function looks for the JSON file in the current directory by default, but you can also specify a path parameter to point to the file using `from_config(path="your/file/path")`.
41+
42+
```python
43+
import azureml.core
44+
from azureml.core import Workspace, Datastore
45+
46+
ws = Workspace.from_config()
47+
```
48+
49+
- Some pre-existing data. This article briefly shows the use of an [Azure blob container](https://docs.microsoft.com/azure/storage/blobs/storage-blobs-overview).
50+
51+
- Optional: An existing machine learning pipeline, such as the one described in [Create and run machine learning pipelines with Azure Machine Learning SDK](how-to-create-your-first-pipeline.md).
52+
53+
## Use `Dataset` objects for pre-existing data
54+
55+
The preferred way to ingest data into a pipeline is to use a [Dataset](https://docs.microsoft.com/python/api/azureml-core/azureml.core.dataset%28class%29?view=azure-ml-py) object. `Dataset` objects represent persistent data available throughout a workspace.
56+
57+
There are many ways to create and register `Dataset` objects. Tabular datasets are for delimited data available in one or more files. File datasets are for binary data (such as images) or for data that you'll parse. The simplest programmatic ways to create `Dataset` objects are to use existing blobs in workspace storage or public URLs:
58+
59+
```python
60+
datastore = Datastore.get(workspace, 'training_data')
61+
iris_dataset = Dataset.Tabular.from_delimited_files(DataPath(datastore, 'iris.csv'))
62+
63+
cats_dogs_dataset = Dataset.File.from_files(
64+
paths='https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip',
65+
archive_options=ArchiveOptions(archive_type=ArchiveType.ZIP, entry_glob='**/*.jpg')
66+
)
67+
```
68+
69+
For more options on creating datasets with different options and from different sources, registering them and reviewing them in the Azure Machine Learning UI, understanding how data size interacts with compute capacity, and versioning them, see [Create Azure Machine Learning datasets](how-to-create-register-datasets.md).
70+
71+
### Pass datasets to your script
72+
73+
To pass the dataset's path to your script, use the `Dataset` object's `as_named_input()` method. You can either pass the resulting `DatasetConsumptionConfig` object to your script as an argument or, by using the `inputs` argument to your pipeline script, you can retrieve the dataset using `Run.get_context().input_datasets[]`.
74+
75+
Once you've created a named input, you can choose its access mode: `as_mount()` or `as_download()`. If your script processes all the files in your dataset and the disk on your compute resource is large enough for the dataset, the download access mode is the better choice. The download access mode will avoid the overhead of streaming the data at runtime. If your script accesses a subset of the dataset or it's too large for your compute, use the mount access mode. For more information, read [Mount vs. Download](https://docs.microsoft.com/azure/machine-learning/how-to-train-with-datasets#mount-vs-download)
76+
77+
To pass a dataset to your pipeline step:
78+
79+
1. Use `TabularDataset.as_named_inputs()` or `FileDataset.as_named_input()` (no 's' at end) to create a `DatasetConsumptionConfig` object
80+
1. Use `as_mount()` or `as_download()` to set the access mode
81+
1. Pass the datasets to your pipeline steps using either the `arguments` or the `inputs` argument
82+
83+
The following snippet shows the common pattern of combining these steps within the `PythonScriptStep` constructor:
84+
85+
```python
86+
87+
train_step = PythonScriptStep(
88+
name="train_data",
89+
script_name="train.py",
90+
compute_target=cluster,
91+
inputs=[iris_dataset.as_named_inputs('iris').as_mount()]
92+
)
93+
```
94+
95+
You can also use methods such as `random_split()` and `take_sample()` to create multiple inputs or reduce the amount of data passed to your pipeline step:
96+
97+
```python
98+
seed = 42 # PRNG seed
99+
smaller_dataset = iris_dataset.take_sample(0.1, seed=seed) # 10%
100+
train, test = smaller_dataset.random_split(percentage=0.8, seed=seed)
101+
102+
train_step = PythonScriptStep(
103+
name="train_data",
104+
script_name="train.py",
105+
compute_target=cluster,
106+
inputs=[train.as_named_inputs('train').as_download(), test.as_named_inputs('test').as_download()]
107+
)
108+
```
109+
110+
### Access datasets within your script
111+
112+
Named inputs to your pipeline step script are available as a dictionary within the `Run` object. Retrieve the active `Run` object using `Run.get_context()` and then retrieve the dictionary of named inputs using `input_datasets`. If you passed the `DatasetConsumptionConfig` object using the `arguments` argument rather than the `inputs` argument, access the data using `ArgParser` code. Both techniques are demonstrated in the following snippet.
113+
114+
```python
115+
# In pipeline definition script:
116+
# Code for demonstration only: It would be very confusing to split datasets between `arguments` and `inputs`
117+
train_step = PythonScriptStep(
118+
name="train_data",
119+
script_name="train.py",
120+
compute_target=cluster,
121+
arguments=['--training-folder', train.as_named_inputs('train').as_download()]
122+
inputs=[test.as_named_inputs('test').as_download()]
123+
)
124+
125+
# In pipeline script
126+
parser = argparse.ArgumentParser()
127+
parser.add_argument('--training-folder', type=str, dest='train_folder', help='training data folder mounting point')
128+
args = parser.parse_args()
129+
training_data_folder = args.train_folder
130+
131+
testing_data_folder = Run.get_context().input_datasets['test']
132+
```
133+
134+
The passed value will be the path to the dataset file(s).
135+
136+
It's also possible to access a registered `Dataset` directly. Since registered datasets are persistent and shared across a workspace, you can retrieve them directly:
137+
138+
```python
139+
run = Run.get_context()
140+
ws = run.experiment.workspace
141+
ds = Dataset.get_by_name(workspace=ws, name='mnist_opendataset')
142+
```
143+
144+
## Use `PipelineData` for intermediate data
145+
146+
While `Dataset` objects represent persistent data, [PipelineData](https://docs.microsoft.com/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelinedata?view=azure-ml-py) objects are used for temporary data that is output from pipeline steps. Because the lifespan of a `PipelineData` object is longer than a single pipeline step, you define them in the pipeline definition script. When you create a `PipelineData` object, you must provide a name and a datastore at which the data will reside. Pass your `PipelineData` object(s) to your `PythonScriptStep` using _both_ the `arguments` and the `outputs` arguments:
147+
148+
```python
149+
default_datastore = workspace.get_default_datastore()
150+
dataprep_output = PipelineData("clean_data", datastore=default_datastore)
151+
152+
dataprep_step = PythonScriptStep(
153+
name="prep_data",
154+
script_name="dataprep.py",
155+
compute_target=cluster,
156+
arguments=["--output-path", dataprep_output]
157+
inputs=[Dataset.get_by_name(workspace, 'raw_data')],
158+
outputs=[dataprep_output]
159+
)
160+
```
161+
162+
You may choose to create your `PipelineData` object using an access mode that provides an immediate upload. In that case, when you create your `PipelineData`, set the `upload_mode` to `"upload"` and use the `output_path_on_compute` argument to specify the path to which you'll be writing the data:
163+
164+
```python
165+
PipelineData("clean_data", datastore=def_blob_store, output_mode="upload", output_path_on_compute="clean_data_output/")
166+
```
167+
168+
### Use `PipelineData` as outputs of a training step
169+
170+
Within your pipeline's `PythonScriptStep`, you can retrieve the available output paths using the program's arguments. If this step is the first and will initialize the output data, you must create the directory at the specified path. You can then write whatever files you wish to be contained in the `PipelineData`.
171+
172+
```python
173+
parser = argparse.ArgumentParser()
174+
parser.add_argument('--output_path', dest='output_path', required=True)
175+
args = parser.parse_args()
176+
177+
# Make directory for file
178+
os.makedirs(os.path.dirname(args.output_path), exist_ok=True)
179+
with open(args.output_path, 'w') as f:
180+
f.write("Step 1's output")
181+
```
182+
183+
If you created your `PipelineData` with the `is_directory` argument set to `True`, it would be enough to just perform the `os.makedirs()` call and then you would be free to write whatever files you wished to the path. For more details, see the [PipelineData](https://docs.microsoft.com/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelinedata?view=azure-ml-py) reference documentation.
184+
185+
### Read `PipelineData` as inputs to non-initial steps
186+
187+
After the initial pipeline step writes some data to the `PipelineData` path and it becomes an output of that initial step, it can be used as an input to a later step:
188+
189+
```python
190+
step1_output_data = PipelineData("processed_data", datastore=def_blob_store, output_mode="upload")
191+
192+
step1 = PythonScriptStep(
193+
name="generate_data",
194+
script_name="step1.py",
195+
runconfig = aml_run_config,
196+
arguments = ["--output_path", step1_output_data],
197+
inputs=[],
198+
outputs=[step1_output_data]
199+
)
200+
201+
step2 = PythonScriptStep(
202+
name="read_pipeline_data",
203+
script_name="step2.py",
204+
compute_target=compute,
205+
runconfig = aml_run_config,
206+
arguments = ["--pd", step1_output_data],
207+
inputs=[step1_output_data]
208+
)
209+
210+
pipeline = Pipeline(workspace=ws, steps=[step1, step2])
211+
```
212+
213+
The value of a `PipelineData` input is the path to the previous output. If, as shown previously, the first step wrote a single file, consuming it might look like:
214+
215+
```python
216+
parser = argparse.ArgumentParser()
217+
parser.add_argument('--pd', dest='pd', required=True)
218+
args = parser.parse_args()
219+
220+
with open(args.pd) as f:
221+
print(f.read())
222+
```
223+
224+
## Convert `PipelineData` objects to `Dataset`s
225+
226+
If you'd like to make your `PipelineData` available for longer than the duration of a run, use its `as_dataset()` function to convert it to a `Dataset`. You may then register the `Dataset`, making it a first-class citizen in your workspace. Since your `PipelineData` object will have a different path every time the pipeline runs, it's highly recommended that you set `create_new_version` to `True` when registering a `Dataset` created from a `PipelineData` object.
227+
228+
```python
229+
step1_output_ds = step1_output_data.as_dataset()
230+
step1_output_ds.register(name="processed_data", create_new_version=True)
231+
```
232+
233+
## Next steps
234+
235+
* [Create an Azure machine learning dataset](how-to-create-register-datasets.md)
236+
* [Create and run machine learning pipelines with Azure Machine Learning SDK](how-to-create-your-first-pipeline.md)

articles/machine-learning/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -355,6 +355,8 @@
355355
items:
356356
- name: Create ML pipelines (Python)
357357
href: how-to-create-your-first-pipeline.md
358+
- name: Moving data into and between ML pipeline steps (Python)
359+
href: how-to-move-data-in-out-of-pipelines.md
358360
- name: Schedule a pipeline (Python)
359361
href: how-to-schedule-pipelines.md
360362
- name: Trigger a pipeline

0 commit comments

Comments
 (0)