You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This article provides code for importing, transforming, and moving data between steps in an Azure Machine Learning pipeline. For an overview of how data works in Azure Machine Learning, see [Access data in Azure storage services](how-to-access-data.md). For the benefits and structure of Azure Machine Learning pipelines, see [What are Azure Machine Learning pipelines?](../concept-ml-pipelines.md)
23
+
This article provides code for importing data, transforming data, and moving data between steps in an Azure Machine Learning pipeline. For an overview of how data works in Azure Machine Learning, see [Access data in Azure storage services](how-to-access-data.md). For information about the benefits and structure of Azure Machine Learning pipelines, see [What are Azure Machine Learning pipelines?](../concept-ml-pipelines.md)
24
24
25
-
This article shows you how to:
25
+
This article shows how to:
26
26
27
27
- Use `Dataset` objects for pre-existing data
28
28
- Access data within your steps
29
29
- Split `Dataset` data into subsets, such as training and validation subsets
30
30
- Create `OutputFileDatasetConfig` objects to transfer data to the next pipeline step
31
31
- Use `OutputFileDatasetConfig` objects as input to pipeline steps
32
-
- Create new `Dataset` objects from `OutputFileDatasetConfig` you wish to persist
32
+
- Create new `Dataset` objects from `OutputFileDatasetConfig`that you want to persist
33
33
34
34
## Prerequisites
35
35
36
-
You need:
37
-
38
-
- An Azure subscription. If you don't have an Azure subscription, create a free account before you begin. Try the [free or paid version of Azure Machine Learning](https://azure.microsoft.com/free/).
36
+
- An Azure subscription. If you don't have one, create a free account before you begin. Try the [free or paid version of Azure Machine Learning](https://azure.microsoft.com/free/).
39
37
40
38
- The [Azure Machine Learning SDK for Python](/python/api/overview/azure/ml/intro), or access to [Azure Machine Learning studio](https://ml.azure.com/).
41
39
42
40
- An Azure Machine Learning workspace.
43
41
44
-
Either [create an Azure Machine Learning workspace](../quickstart-create-resources.md) or use an existing one via the Python SDK. Import the `Workspace` and `Datastore` class, and load your subscription information from the file `config.json` using the function `from_config()`. This function looks for the JSON file in the current directory by default, but you can also specify a path parameter to point to the file using `from_config(path="your/file/path")`.
42
+
Either [create an Azure Machine Learning workspace](../quickstart-create-resources.md) or use an existing one via the Python SDK. Import the `Workspace` and `Datastore` class, and load your subscription information from the file `config.json`by using the function `from_config()`. This function looks for the JSON file in the current directory by default, but you can also specify a path parameter to point to the file by using `from_config(path="your/file/path")`.
45
43
46
44
```python
47
45
import azureml.core
@@ -56,9 +54,9 @@ You need:
56
54
57
55
## Use `Dataset` objects for pre-existing data
58
56
59
-
The preferred way to ingest data into a pipeline is to use a [Dataset](/python/api/azureml-core/azureml.core.dataset%28class%29) object. `Dataset` objects represent persistent data available throughout a workspace.
57
+
The preferred way to ingest data into a pipeline is to use a [Dataset](/python/api/azureml-core/azureml.core.dataset%28class%29) object. `Dataset` objects represent persistent data that's available throughout a workspace.
60
58
61
-
There are many ways to create and register `Dataset` objects. Tabular datasets are for delimited data available in one or more files. File datasets are for binary data (such as images) or for data that you parse. The simplest programmatic ways to create `Dataset` objects are to use existing blobs in workspace storage or public URLs:
59
+
There are many ways to create and register `Dataset` objects. Tabular datasets are for delimited data that's available in one or more files. File datasets are for binary data (such as images) or for data that you parse. The simplest programmatic ways to create `Dataset` objects are to use existing blobs in workspace storage or public URLs:
For more options on creating datasets with different options and from different sources, registering them and reviewing them in the Azure Machine Learning UI, understanding how data size interacts with compute capacity, and versioning them, see [Create Azure Machine Learning datasets](how-to-create-register-datasets.md).
73
+
For more information about creating datasets with different options and from different sources, registering them and reviewing them in the Azure Machine Learning UI, understanding how data size interacts with compute capacity, and versioning them, see [Create Azure Machine Learning datasets](how-to-create-register-datasets.md).
76
74
77
75
### Pass datasets to your script
78
76
79
-
To pass the dataset's path to your script, use the `Dataset` object's `as_named_input()` method. You can either pass the resulting `DatasetConsumptionConfig` object to your script as an argument or, by using the `inputs` argument to your pipeline script, you can retrieve the dataset using `Run.get_context().input_datasets[]`.
77
+
To pass the dataset's path to your script, use the `Dataset` object's `as_named_input()` method. You can either pass the resulting `DatasetConsumptionConfig` object to your script as an argument or, by using the `inputs` argument to your pipeline script, you can retrieve the dataset by using `Run.get_context().input_datasets[]`.
80
78
81
-
Once you've created a named input, you can choose its access mode(for FileDataset only): `as_mount()` or `as_download()`. If your script processes all the files in your dataset and the disk on your compute resource is large enough for the dataset, the download access mode is the better choice. The download access mode avoids the overhead of streaming the data at runtime. If your script accesses a subset of the dataset or it's too large for your compute, use the mount access mode. For more information, read[Mount vs. Download](how-to-train-with-datasets.md#mount-vs-download)
79
+
After you create a named input, you can choose its access mode(for `FileDataset` only): `as_mount()` or `as_download()`. If your script processes all the files in your dataset and the disk on your compute resource is large enough for the dataset, the download access mode is a better choice. The download access mode avoids the overhead of streaming the data at runtime. If your script accesses a subset of the dataset or is too large for your compute, use the mount access mode. For more information, see[Mount vs. download](how-to-train-with-datasets.md#mount-vs-download).
82
80
83
81
To pass a dataset to your pipeline step:
84
82
85
-
1. Use `TabularDataset.as_named_input()` or `FileDataset.as_named_input()` (no 's' at end) to create a `DatasetConsumptionConfig` object
86
-
1.**For `FileDataset` only:**. Use `as_mount()` or `as_download()` to set the access mode. TabularDataset does not suppport set access mode.
87
-
1. Pass the datasets to your pipeline steps using either the `arguments` or the `inputs` argument
83
+
1. Use `TabularDataset.as_named_input()` or `FileDataset.as_named_input()` (no *s* at the end) to create a `DatasetConsumptionConfig` object
84
+
1.**For `FileDataset` only:** Use `as_mount()` or `as_download()` to set the access mode. With `TabularDataset`, you can't set the access mode.
85
+
1. Pass the datasets to your pipeline steps by using either `arguments` or `inputs`.
88
86
89
-
The following snippet shows the common pattern of combining these steps within the `PythonScriptStep` constructor, using iris_dataset (TabularDataset):
87
+
The following snippet shows the common pattern of combining these steps within the `PythonScriptStep` constructor by using `iris_dataset` (`TabularDataset`):
> You would need to replace the values for all these arguments (that is, `"train_data"`, `"train.py"`, `cluster`, and `iris_dataset`) with your own data.
103
-
> The above snippet just shows the form of the call and is not part of a Microsoft sample.
100
+
> You need to replace the values for all of these arguments (that is, `"train_data"`, `"train.py"`, `cluster`, and `iris_dataset`) with your own data.
101
+
> The above snippet just shows the form of the call and isn't part of a Microsoft sample.
104
102
105
-
You can also use methods such as `random_split()` and `take_sample()` to create multiple inputs or reduce the amount of data passed to your pipeline step:
103
+
You can also use methods like `random_split()` and `take_sample()` to create multiple inputs or to reduce the amount of data that's passed to your pipeline step:
Named inputs to your pipeline step script are available as a dictionary within the `Run` object. Retrieve the active `Run` object using `Run.get_context()` and then retrieve the dictionary of named inputs using `input_datasets`. If you passed the `DatasetConsumptionConfig` object using the `arguments` argument rather than the `inputs` argument, access the data using `ArgParser` code. Both techniques are demonstrated in the following snippets:
120
+
Named inputs to your pipeline step script are available as a dictionary within the `Run` object. Retrieve the active `Run` object by using `Run.get_context()`, and then retrieve the dictionary of named inputs by using `input_datasets`. If you passed the `DatasetConsumptionConfig` object by using the `arguments` argument rather than the `inputs` argument, access the data by using `ArgParser` code. Both techniques are demonstrated in the following snippets:
123
121
124
122
__The pipeline definition script__
125
123
126
124
```python
127
-
# Code for demonstration only: It would be very confusing to split datasets between `arguments` and `inputs`
125
+
# Code is for demonstration only: It would be confusing to split datasets between `arguments` and `inputs`
The passed value is the path to the dataset file(s).
150
+
The passed value is the path to the dataset file or files.
153
151
154
-
It's also possible to access a registered `Dataset` directly. Since registered datasets are persistent and shared across a workspace, you can retrieve them directly:
152
+
Because registered datasets are persistent and shared across a workspace, you can retrieve them directly:
> The preceding snippets show the form of the calls and are not part of a Microsoft sample. You must replace the various arguments with values from your own project.
161
+
> The preceding snippets show the form of the calls. They aren't part of a Microsoft sample. You need to replace the arguments with values from your own project.
164
162
165
163
## Use `OutputFileDatasetConfig` for intermediate data
166
164
167
-
While`Dataset` objects represent only persistent data, [`OutputFileDatasetConfig`](/python/api/azureml-core/azureml.data.outputfiledatasetconfig)object(s) can be used for temporary data output from pipeline steps **and**persistent output data. `OutputFileDatasetConfig` supports writing data to blob storage, fileshare, adlsgen1, or adlsgen2. It supports both mount mode and upload mode. In mount mode, files written to the mounted directory are permanently stored when the file is closed. In upload mode, files written to the output directory are uploaded at the end of the job. If the job fails or is canceled, the output directory won't be uploaded.
165
+
Although`Dataset` objects represent only persistent data, [`OutputFileDatasetConfig`](/python/api/azureml-core/azureml.data.outputfiledatasetconfig)objects can be used for temporary data output from pipeline steps and for persistent output data. `OutputFileDatasetConfig` supports writing data to blob storage, fileshare, Azure Data Lake Storage Gen1, or Azure Data Lake Storage Gen2. It supports both mount mode and upload mode. In mount mode, files written to the mounted directory are permanently stored when the file is closed. In upload mode, files written to the output directory are uploaded at the end of the job. If the job fails or is canceled, the output directory won't be uploaded.
168
166
169
167
`OutputFileDatasetConfig` object's default behavior is to write to the default datastore of the workspace. Pass your `OutputFileDatasetConfig` objects to your `PythonScriptStep` with the `arguments` parameter.
0 commit comments