Skip to content

Commit f20076c

Browse files
committed
edits
1 parent d1568d4 commit f20076c

File tree

1 file changed

+28
-30
lines changed

1 file changed

+28
-30
lines changed

articles/machine-learning/v1/how-to-move-data-in-out-of-pipelines.md

Lines changed: 28 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Moving data in ML pipelines
2+
title: Moving Data in ML Pipelines
33
titleSuffix: Azure Machine Learning
44
description: Learn how Azure Machine Learning pipelines ingest data, and how to manage and move data between pipeline steps.
55
services: machine-learning
@@ -11,37 +11,35 @@ ms.reviewer: keli19
1111
ms.date: 06/24/2025
1212
ms.topic: how-to
1313
ms.custom: UpdateFrequency5, devx-track-python, data4ml, sdkv1
14-
#Customer intent: As a data scientist using Python, I want to get data into my pipeline and flowing between steps.
14+
#Customer intent: As a data scientist using Python, I want to get data into my pipeline and propogate it between steps.
1515
---
1616

17-
# Moving data into and between ML pipeline steps (Python)
17+
# Moving data into and between machine learning pipeline steps (Python)
1818

1919
[!INCLUDE [sdk v1](../includes/machine-learning-sdk-v1.md)]
2020

2121
[!INCLUDE [v1 deprecation](../includes/sdk-v1-deprecation.md)]
2222

23-
This article provides code for importing, transforming, and moving data between steps in an Azure Machine Learning pipeline. For an overview of how data works in Azure Machine Learning, see [Access data in Azure storage services](how-to-access-data.md). For the benefits and structure of Azure Machine Learning pipelines, see [What are Azure Machine Learning pipelines?](../concept-ml-pipelines.md)
23+
This article provides code for importing data, transforming data, and moving data between steps in an Azure Machine Learning pipeline. For an overview of how data works in Azure Machine Learning, see [Access data in Azure storage services](how-to-access-data.md). For information about the benefits and structure of Azure Machine Learning pipelines, see [What are Azure Machine Learning pipelines?](../concept-ml-pipelines.md)
2424

25-
This article shows you how to:
25+
This article shows how to:
2626

2727
- Use `Dataset` objects for pre-existing data
2828
- Access data within your steps
2929
- Split `Dataset` data into subsets, such as training and validation subsets
3030
- Create `OutputFileDatasetConfig` objects to transfer data to the next pipeline step
3131
- Use `OutputFileDatasetConfig` objects as input to pipeline steps
32-
- Create new `Dataset` objects from `OutputFileDatasetConfig` you wish to persist
32+
- Create new `Dataset` objects from `OutputFileDatasetConfig` that you want to persist
3333

3434
## Prerequisites
3535

36-
You need:
37-
38-
- An Azure subscription. If you don't have an Azure subscription, create a free account before you begin. Try the [free or paid version of Azure Machine Learning](https://azure.microsoft.com/free/).
36+
- An Azure subscription. If you don't have one, create a free account before you begin. Try the [free or paid version of Azure Machine Learning](https://azure.microsoft.com/free/).
3937

4038
- The [Azure Machine Learning SDK for Python](/python/api/overview/azure/ml/intro), or access to [Azure Machine Learning studio](https://ml.azure.com/).
4139

4240
- An Azure Machine Learning workspace.
4341

44-
Either [create an Azure Machine Learning workspace](../quickstart-create-resources.md) or use an existing one via the Python SDK. Import the `Workspace` and `Datastore` class, and load your subscription information from the file `config.json` using the function `from_config()`. This function looks for the JSON file in the current directory by default, but you can also specify a path parameter to point to the file using `from_config(path="your/file/path")`.
42+
Either [create an Azure Machine Learning workspace](../quickstart-create-resources.md) or use an existing one via the Python SDK. Import the `Workspace` and `Datastore` class, and load your subscription information from the file `config.json` by using the function `from_config()`. This function looks for the JSON file in the current directory by default, but you can also specify a path parameter to point to the file by using `from_config(path="your/file/path")`.
4543

4644
```python
4745
import azureml.core
@@ -56,9 +54,9 @@ You need:
5654

5755
## Use `Dataset` objects for pre-existing data
5856

59-
The preferred way to ingest data into a pipeline is to use a [Dataset](/python/api/azureml-core/azureml.core.dataset%28class%29) object. `Dataset` objects represent persistent data available throughout a workspace.
57+
The preferred way to ingest data into a pipeline is to use a [Dataset](/python/api/azureml-core/azureml.core.dataset%28class%29) object. `Dataset` objects represent persistent data that's available throughout a workspace.
6058

61-
There are many ways to create and register `Dataset` objects. Tabular datasets are for delimited data available in one or more files. File datasets are for binary data (such as images) or for data that you parse. The simplest programmatic ways to create `Dataset` objects are to use existing blobs in workspace storage or public URLs:
59+
There are many ways to create and register `Dataset` objects. Tabular datasets are for delimited data that's available in one or more files. File datasets are for binary data (such as images) or for data that you parse. The simplest programmatic ways to create `Dataset` objects are to use existing blobs in workspace storage or public URLs:
6260

6361
```python
6462
datastore = Datastore.get(workspace, 'training_data')
@@ -72,21 +70,21 @@ datastore_path = [
7270
cats_dogs_dataset = Dataset.File.from_files(path=datastore_path)
7371
```
7472

75-
For more options on creating datasets with different options and from different sources, registering them and reviewing them in the Azure Machine Learning UI, understanding how data size interacts with compute capacity, and versioning them, see [Create Azure Machine Learning datasets](how-to-create-register-datasets.md).
73+
For more information about creating datasets with different options and from different sources, registering them and reviewing them in the Azure Machine Learning UI, understanding how data size interacts with compute capacity, and versioning them, see [Create Azure Machine Learning datasets](how-to-create-register-datasets.md).
7674

7775
### Pass datasets to your script
7876

79-
To pass the dataset's path to your script, use the `Dataset` object's `as_named_input()` method. You can either pass the resulting `DatasetConsumptionConfig` object to your script as an argument or, by using the `inputs` argument to your pipeline script, you can retrieve the dataset using `Run.get_context().input_datasets[]`.
77+
To pass the dataset's path to your script, use the `Dataset` object's `as_named_input()` method. You can either pass the resulting `DatasetConsumptionConfig` object to your script as an argument or, by using the `inputs` argument to your pipeline script, you can retrieve the dataset by using `Run.get_context().input_datasets[]`.
8078

81-
Once you've created a named input, you can choose its access mode(for FileDataset only): `as_mount()` or `as_download()`. If your script processes all the files in your dataset and the disk on your compute resource is large enough for the dataset, the download access mode is the better choice. The download access mode avoids the overhead of streaming the data at runtime. If your script accesses a subset of the dataset or it's too large for your compute, use the mount access mode. For more information, read [Mount vs. Download](how-to-train-with-datasets.md#mount-vs-download)
79+
After you create a named input, you can choose its access mode(for `FileDataset` only): `as_mount()` or `as_download()`. If your script processes all the files in your dataset and the disk on your compute resource is large enough for the dataset, the download access mode is a better choice. The download access mode avoids the overhead of streaming the data at runtime. If your script accesses a subset of the dataset or is too large for your compute, use the mount access mode. For more information, see [Mount vs. download](how-to-train-with-datasets.md#mount-vs-download).
8280

8381
To pass a dataset to your pipeline step:
8482

85-
1. Use `TabularDataset.as_named_input()` or `FileDataset.as_named_input()` (no 's' at end) to create a `DatasetConsumptionConfig` object
86-
1. **For `FileDataset` only:**. Use `as_mount()` or `as_download()` to set the access mode. TabularDataset does not suppport set access mode.
87-
1. Pass the datasets to your pipeline steps using either the `arguments` or the `inputs` argument
83+
1. Use `TabularDataset.as_named_input()` or `FileDataset.as_named_input()` (no *s* at the end) to create a `DatasetConsumptionConfig` object
84+
1. **For `FileDataset` only:** Use `as_mount()` or `as_download()` to set the access mode. With `TabularDataset`, you can't set the access mode.
85+
1. Pass the datasets to your pipeline steps by using either `arguments` or `inputs`.
8886

89-
The following snippet shows the common pattern of combining these steps within the `PythonScriptStep` constructor, using iris_dataset (TabularDataset):
87+
The following snippet shows the common pattern of combining these steps within the `PythonScriptStep` constructor by using `iris_dataset` (`TabularDataset`):
9088

9189
```python
9290

@@ -99,10 +97,10 @@ train_step = PythonScriptStep(
9997
```
10098

10199
> [!NOTE]
102-
> You would need to replace the values for all these arguments (that is, `"train_data"`, `"train.py"`, `cluster`, and `iris_dataset`) with your own data.
103-
> The above snippet just shows the form of the call and is not part of a Microsoft sample.
100+
> You need to replace the values for all of these arguments (that is, `"train_data"`, `"train.py"`, `cluster`, and `iris_dataset`) with your own data.
101+
> The above snippet just shows the form of the call and isn't part of a Microsoft sample.
104102
105-
You can also use methods such as `random_split()` and `take_sample()` to create multiple inputs or reduce the amount of data passed to your pipeline step:
103+
You can also use methods like `random_split()` and `take_sample()` to create multiple inputs or to reduce the amount of data that's passed to your pipeline step:
106104

107105
```python
108106
seed = 42 # PRNG seed
@@ -119,19 +117,19 @@ train_step = PythonScriptStep(
119117

120118
### Access datasets within your script
121119

122-
Named inputs to your pipeline step script are available as a dictionary within the `Run` object. Retrieve the active `Run` object using `Run.get_context()` and then retrieve the dictionary of named inputs using `input_datasets`. If you passed the `DatasetConsumptionConfig` object using the `arguments` argument rather than the `inputs` argument, access the data using `ArgParser` code. Both techniques are demonstrated in the following snippets:
120+
Named inputs to your pipeline step script are available as a dictionary within the `Run` object. Retrieve the active `Run` object by using `Run.get_context()`, and then retrieve the dictionary of named inputs by using `input_datasets`. If you passed the `DatasetConsumptionConfig` object by using the `arguments` argument rather than the `inputs` argument, access the data by using `ArgParser` code. Both techniques are demonstrated in the following snippets:
123121

124122
__The pipeline definition script__
125123

126124
```python
127-
# Code for demonstration only: It would be very confusing to split datasets between `arguments` and `inputs`
125+
# Code is for demonstration only: It would be confusing to split datasets between `arguments` and `inputs`
128126
train_step = PythonScriptStep(
129127
name="train_data",
130128
script_name="train.py",
131129
compute_target=cluster,
132-
# datasets passed as arguments
130+
# Datasets passed as arguments
133131
arguments=['--training-folder', train.as_named_input('train').as_download()],
134-
# datasets passed as inputs
132+
# Datasets passed as inputs
135133
inputs=[test.as_named_input('test').as_download()]
136134
)
137135
```
@@ -149,9 +147,9 @@ training_data_folder = args.train_folder
149147
testing_data_folder = Run.get_context().input_datasets['test']
150148
```
151149

152-
The passed value is the path to the dataset file(s).
150+
The passed value is the path to the dataset file or files.
153151

154-
It's also possible to access a registered `Dataset` directly. Since registered datasets are persistent and shared across a workspace, you can retrieve them directly:
152+
Because registered datasets are persistent and shared across a workspace, you can retrieve them directly:
155153

156154
```python
157155
run = Run.get_context()
@@ -160,11 +158,11 @@ ds = Dataset.get_by_name(workspace=ws, name='mnist_opendataset')
160158
```
161159

162160
> [!NOTE]
163-
> The preceding snippets show the form of the calls and are not part of a Microsoft sample. You must replace the various arguments with values from your own project.
161+
> The preceding snippets show the form of the calls. They aren't part of a Microsoft sample. You need to replace the arguments with values from your own project.
164162
165163
## Use `OutputFileDatasetConfig` for intermediate data
166164

167-
While `Dataset` objects represent only persistent data, [`OutputFileDatasetConfig`](/python/api/azureml-core/azureml.data.outputfiledatasetconfig) object(s) can be used for temporary data output from pipeline steps **and** persistent output data. `OutputFileDatasetConfig` supports writing data to blob storage, fileshare, adlsgen1, or adlsgen2. It supports both mount mode and upload mode. In mount mode, files written to the mounted directory are permanently stored when the file is closed. In upload mode, files written to the output directory are uploaded at the end of the job. If the job fails or is canceled, the output directory won't be uploaded.
165+
Although `Dataset` objects represent only persistent data, [`OutputFileDatasetConfig`](/python/api/azureml-core/azureml.data.outputfiledatasetconfig) objects can be used for temporary data output from pipeline steps and for persistent output data. `OutputFileDatasetConfig` supports writing data to blob storage, fileshare, Azure Data Lake Storage Gen1, or Azure Data Lake Storage Gen2. It supports both mount mode and upload mode. In mount mode, files written to the mounted directory are permanently stored when the file is closed. In upload mode, files written to the output directory are uploaded at the end of the job. If the job fails or is canceled, the output directory won't be uploaded.
168166

169167
`OutputFileDatasetConfig` object's default behavior is to write to the default datastore of the workspace. Pass your `OutputFileDatasetConfig` objects to your `PythonScriptStep` with the `arguments` parameter.
170168

0 commit comments

Comments
 (0)