Skip to content

Commit ed69ee2

Browse files
committed
updating per customer feedback
1 parent c819d14 commit ed69ee2

File tree

1 file changed

+23
-12
lines changed

1 file changed

+23
-12
lines changed

articles/machine-learning/v1/how-to-move-data-in-out-of-pipelines.md

Lines changed: 23 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ ms.custom: UpdateFrequency5, contperf-fy20q4, devx-track-python, data4ml, sdkv1,
2020

2121
This article provides code for importing, transforming, and moving data between steps in an Azure Machine Learning pipeline. For an overview of how data works in Azure Machine Learning, see [Access data in Azure storage services](how-to-access-data.md). For the benefits and structure of Azure Machine Learning pipelines, see [What are Azure Machine Learning pipelines?](../concept-ml-pipelines.md)
2222

23-
This article will show you how to:
23+
This article shows you how to:
2424

2525
- Use `Dataset` objects for pre-existing data
2626
- Access data within your steps
@@ -31,7 +31,7 @@ This article will show you how to:
3131

3232
## Prerequisites
3333

34-
You'll need:
34+
You need:
3535

3636
- An Azure subscription. If you don't have an Azure subscription, create a free account before you begin. Try the [free or paid version of Azure Machine Learning](https://azure.microsoft.com/free/).
3737

@@ -56,7 +56,7 @@ You'll need:
5656

5757
The preferred way to ingest data into a pipeline is to use a [Dataset](/python/api/azureml-core/azureml.core.dataset%28class%29) object. `Dataset` objects represent persistent data available throughout a workspace.
5858

59-
There are many ways to create and register `Dataset` objects. Tabular datasets are for delimited data available in one or more files. File datasets are for binary data (such as images) or for data that you'll parse. The simplest programmatic ways to create `Dataset` objects are to use existing blobs in workspace storage or public URLs:
59+
There are many ways to create and register `Dataset` objects. Tabular datasets are for delimited data available in one or more files. File datasets are for binary data (such as images) or for data that you parse. The simplest programmatic ways to create `Dataset` objects are to use existing blobs in workspace storage or public URLs:
6060

6161
```python
6262
datastore = Datastore.get(workspace, 'training_data')
@@ -76,7 +76,7 @@ For more options on creating datasets with different options and from different
7676

7777
To pass the dataset's path to your script, use the `Dataset` object's `as_named_input()` method. You can either pass the resulting `DatasetConsumptionConfig` object to your script as an argument or, by using the `inputs` argument to your pipeline script, you can retrieve the dataset using `Run.get_context().input_datasets[]`.
7878

79-
Once you've created a named input, you can choose its access mode: `as_mount()` or `as_download()`. If your script processes all the files in your dataset and the disk on your compute resource is large enough for the dataset, the download access mode is the better choice. The download access mode will avoid the overhead of streaming the data at runtime. If your script accesses a subset of the dataset or it's too large for your compute, use the mount access mode. For more information, read [Mount vs. Download](how-to-train-with-datasets.md#mount-vs-download)
79+
Once you've created a named input, you can choose its access mode: `as_mount()` or `as_download()`. If your script processes all the files in your dataset and the disk on your compute resource is large enough for the dataset, the download access mode is the better choice. The download access mode avoids the overhead of streaming the data at runtime. If your script accesses a subset of the dataset or it's too large for your compute, use the mount access mode. For more information, read [Mount vs. Download](how-to-train-with-datasets.md#mount-vs-download)
8080

8181
To pass a dataset to your pipeline step:
8282

@@ -117,29 +117,37 @@ train_step = PythonScriptStep(
117117

118118
### Access datasets within your script
119119

120-
Named inputs to your pipeline step script are available as a dictionary within the `Run` object. Retrieve the active `Run` object using `Run.get_context()` and then retrieve the dictionary of named inputs using `input_datasets`. If you passed the `DatasetConsumptionConfig` object using the `arguments` argument rather than the `inputs` argument, access the data using `ArgParser` code. Both techniques are demonstrated in the following snippet.
120+
Named inputs to your pipeline step script are available as a dictionary within the `Run` object. Retrieve the active `Run` object using `Run.get_context()` and then retrieve the dictionary of named inputs using `input_datasets`. If you passed the `DatasetConsumptionConfig` object using the `arguments` argument rather than the `inputs` argument, access the data using `ArgParser` code. Both techniques are demonstrated in the following snippets:
121+
122+
__The pipeline definition script__
121123

122124
```python
123-
# In pipeline definition script:
124125
# Code for demonstration only: It would be very confusing to split datasets between `arguments` and `inputs`
125126
train_step = PythonScriptStep(
126127
name="train_data",
127128
script_name="train.py",
128129
compute_target=cluster,
130+
# datasets passed as arguments
129131
arguments=['--training-folder', train.as_named_input('train').as_download()],
132+
# datasets passed as inputs
130133
inputs=[test.as_named_input('test').as_download()]
131134
)
135+
```
136+
137+
__The `train.py` script referenced from the PythonScriptStep__
132138

139+
```python
133140
# In pipeline script
134141
parser = argparse.ArgumentParser()
142+
# Retreive the dataset passed as an argument
135143
parser.add_argument('--training-folder', type=str, dest='train_folder', help='training data folder mounting point')
136144
args = parser.parse_args()
137145
training_data_folder = args.train_folder
138-
146+
# Retrieve the dataset passed as an input
139147
testing_data_folder = Run.get_context().input_datasets['test']
140148
```
141149

142-
The passed value will be the path to the dataset file(s).
150+
The passed value is the path to the dataset file(s).
143151

144152
It's also possible to access a registered `Dataset` directly. Since registered datasets are persistent and shared across a workspace, you can retrieve them directly:
145153

@@ -154,7 +162,7 @@ ds = Dataset.get_by_name(workspace=ws, name='mnist_opendataset')
154162
155163
## Use `OutputFileDatasetConfig` for intermediate data
156164

157-
While `Dataset` objects represent only persistent data, [`OutputFileDatasetConfig`](/python/api/azureml-core/azureml.data.outputfiledatasetconfig) object(s) can be used for temporary data output from pipeline steps **and** persistent output data. `OutputFileDatasetConfig` supports writing data to blob storage, fileshare, adlsgen1, or adlsgen2. It supports both mount mode and upload mode. In mount mode, files written to the mounted directory are permanently stored when the file is closed. In upload mode, files written to the output directory are uploaded at the end of the job. If the job fails or is canceled, the output directory will not be uploaded.
165+
While `Dataset` objects represent only persistent data, [`OutputFileDatasetConfig`](/python/api/azureml-core/azureml.data.outputfiledatasetconfig) object(s) can be used for temporary data output from pipeline steps **and** persistent output data. `OutputFileDatasetConfig` supports writing data to blob storage, fileshare, adlsgen1, or adlsgen2. It supports both mount mode and upload mode. In mount mode, files written to the mounted directory are permanently stored when the file is closed. In upload mode, files written to the output directory are uploaded at the end of the job. If the job fails or is canceled, the output directory won't be uploaded.
158166

159167
`OutputFileDatasetConfig` object's default behavior is to write to the default datastore of the workspace. Pass your `OutputFileDatasetConfig` objects to your `PythonScriptStep` with the `arguments` parameter.
160168

@@ -223,6 +231,9 @@ step2 = PythonScriptStep(
223231
pipeline = Pipeline(workspace=ws, steps=[step1, step2])
224232
```
225233

234+
> [!TIP]
235+
> Reading the data in the python script `step2.py` is the same as documented earlier in [Access datasets within your script](#access-datasets-within-your-script); use `ArgumentParser` to add an argument of `--pd` in your script to access the data.
236+
226237
## Register `OutputFileDatasetConfig` objects for reuse
227238

228239
If you'd like to make your `OutputFileDatasetConfig` available for longer than the duration of your experiment, register it to your workspace to share and reuse across experiments.
@@ -236,12 +247,12 @@ step1_output_ds = step1_output_data.register_on_complete(
236247

237248
## Delete `OutputFileDatasetConfig` contents when no longer needed
238249

239-
Azure does not automatically delete intermediate data written with `OutputFileDatasetConfig`. To avoid storage charges for large amounts of unneeded data, you should either:
250+
Azure doesn't automatically delete intermediate data written with `OutputFileDatasetConfig`. To avoid storage charges for large amounts of unneeded data, you should either:
240251

241252
> [!CAUTION]
242253
> Only delete intermediate data after 30 days from the last change date of the data. Deleting the data earlier could cause the pipeline run to fail because the pipeline will assume the intermediate data exists within 30 day period for reuse.
243254
244-
* Programmatically delete intermediate data at the end of a pipeline job, when it is no longer needed.
255+
* Programmatically delete intermediate data at the end of a pipeline job, when it's no longer needed.
245256
* Use blob storage with a short-term storage policy for intermediate data (see [Optimize costs by automating Azure Blob Storage access tiers](../../storage/blobs/lifecycle-management-overview.md)). This policy can only be set to a workspace's non-default datastore. Use `OutputFileDatasetConfig` to export intermediate data to another datastore that isn't the default.
246257
```Python
247258
# Get adls gen 2 datastore already registered with the workspace
@@ -254,5 +265,5 @@ For more information, see [Plan and manage costs for Azure Machine Learning](../
254265

255266
## Next steps
256267

257-
* [Create an Azure machine learning dataset](how-to-create-register-datasets.md)
268+
* [Create an Azure Machine Learning dataset](how-to-create-register-datasets.md)
258269
* [Create and run machine learning pipelines with Azure Machine Learning SDK](how-to-create-machine-learning-pipelines.md)

0 commit comments

Comments
 (0)