Skip to content

Commit 5ac4095

Browse files
committed
edits
1 parent f20076c commit 5ac4095

File tree

1 file changed

+20
-18
lines changed

1 file changed

+20
-18
lines changed

articles/machine-learning/v1/how-to-move-data-in-out-of-pipelines.md

Lines changed: 20 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,7 @@ train_step = PythonScriptStep(
117117

118118
### Access datasets within your script
119119

120-
Named inputs to your pipeline step script are available as a dictionary within the `Run` object. Retrieve the active `Run` object by using `Run.get_context()`, and then retrieve the dictionary of named inputs by using `input_datasets`. If you passed the `DatasetConsumptionConfig` object by using the `arguments` argument rather than the `inputs` argument, access the data by using `ArgParser` code. Both techniques are demonstrated in the following snippets:
120+
Named inputs to your pipeline step script are available as a dictionary within the `Run` object. Retrieve the active `Run` object by using `Run.get_context()`, and then retrieve the dictionary of named inputs by using `input_datasets`. If you passed the `DatasetConsumptionConfig` object by using the `arguments` argument rather than the `inputs` argument, access the data by using `ArgumentParser` code. Both techniques are demonstrated in the following snippets:
121121

122122
__The pipeline definition script__
123123

@@ -162,9 +162,9 @@ ds = Dataset.get_by_name(workspace=ws, name='mnist_opendataset')
162162
163163
## Use `OutputFileDatasetConfig` for intermediate data
164164

165-
Although `Dataset` objects represent only persistent data, [`OutputFileDatasetConfig`](/python/api/azureml-core/azureml.data.outputfiledatasetconfig) objects can be used for temporary data output from pipeline steps and for persistent output data. `OutputFileDatasetConfig` supports writing data to blob storage, fileshare, Azure Data Lake Storage Gen1, or Azure Data Lake Storage Gen2. It supports both mount mode and upload mode. In mount mode, files written to the mounted directory are permanently stored when the file is closed. In upload mode, files written to the output directory are uploaded at the end of the job. If the job fails or is canceled, the output directory won't be uploaded.
165+
Although `Dataset` objects represent only persistent data, [`OutputFileDatasetConfig`](/python/api/azureml-core/azureml.data.outputfiledatasetconfig) objects can be used for temporary data output from pipeline steps and for persistent output data. `OutputFileDatasetConfig` supports writing data to blob storage, fileshare, Azure Data Lake Storage Gen1, or Data Lake Storage Gen2. It supports both mount mode and upload mode. In mount mode, files written to the mounted directory are permanently stored when the file is closed. In upload mode, files written to the output directory are uploaded at the end of the job. If the job fails or is canceled, the output directory isn't uploaded.
166166

167-
`OutputFileDatasetConfig` object's default behavior is to write to the default datastore of the workspace. Pass your `OutputFileDatasetConfig` objects to your `PythonScriptStep` with the `arguments` parameter.
167+
The `OutputFileDatasetConfig` object's default behavior is to write to the default datastore of the workspace. Pass your `OutputFileDatasetConfig` objects to your `PythonScriptStep` by using the `arguments` parameter.
168168

169169
```python
170170
from azureml.data import OutputFileDatasetConfig
@@ -180,11 +180,11 @@ dataprep_step = PythonScriptStep(
180180
```
181181

182182
> [!NOTE]
183-
> Concurrent writes to a `OutputFileDatasetConfig` will fail. Do not attempt to use a single `OutputFileDatasetConfig` concurrently. Do not share a single `OutputFileDatasetConfig` in a multiprocessing situation, such as when using [distributed training](../how-to-train-distributed-gpu.md).
183+
> Concurrent writes to a `OutputFileDatasetConfig` will fail. Don't try to use a single `OutputFileDatasetConfig` concurrently. Don't share a single `OutputFileDatasetConfig` in a multiprocessing situation, like when you use [distributed training](../how-to-train-distributed-gpu.md).
184184
185185
### Use `OutputFileDatasetConfig` as outputs of a training step
186186

187-
Within your pipeline's `PythonScriptStep`, you can retrieve the available output paths using the program's arguments. If this step is the first and will initialize the output data, you must create the directory at the specified path. You can then write whatever files you wish to be contained in the `OutputFileDatasetConfig`.
187+
In your pipeline's `PythonScriptStep`, you can retrieve the available output paths by using the program's arguments. If this step is the first and will initialize the output data, you need to create the directory at the specified path. You can then write whatever files you want to be contained in the `OutputFileDatasetConfig`.
188188

189189
```python
190190
parser = argparse.ArgumentParser()
@@ -203,12 +203,12 @@ After the initial pipeline step writes some data to the `OutputFileDatasetConfig
203203

204204
In the following code:
205205

206-
* `step1_output_data` indicates that the output of the PythonScriptStep, `step1` is written to the ADLS Gen 2 datastore, `my_adlsgen2` in upload access mode. Learn more about how to [set up role permissions](how-to-access-data.md) in order to write data back to ADLS Gen 2 datastores.
206+
* `step1_output_data` indicates that the output of the `PythonScriptStep` `step1` is written to the Data Lake Storage Gen2 datastore, `my_adlsgen2` in upload access mode. Learn more about how to [set up role permissions](how-to-access-data.md) in order to write data back to Data Lake Storage Gen2 datastores.
207207

208-
* After `step1` completes and the output is written to the destination indicated by `step1_output_data`, then step2 is ready to use `step1_output_data` as an input.
208+
* After `step1` completes and the output is written to the destination that's indicated by `step1_output_data`, `step2` is ready to use `step1_output_data` as an input.
209209

210210
```python
211-
# get adls gen 2 datastore already registered with the workspace
211+
# Get Data Lake Storage Gen2 datastore that's already registered with the workspace
212212
datastore = workspace.datastores['my_adlsgen2']
213213
step1_output_data = OutputFileDatasetConfig(name="processed_data", destination=(datastore, "mypath/{run-id}/{output-name}")).as_upload()
214214

@@ -232,11 +232,11 @@ pipeline = Pipeline(workspace=ws, steps=[step1, step2])
232232
```
233233

234234
> [!TIP]
235-
> Reading the data in the python script `step2.py` is the same as documented earlier in [Access datasets within your script](#access-datasets-within-your-script); use `ArgumentParser` to add an argument of `--pd` in your script to access the data.
235+
> Reading the data in the Python script `step2.py` is the same as the process described earlier in [Access datasets within your script](#access-datasets-within-your-script). Use `ArgumentParser` to add an argument of `--pd` in your script to access the data.
236236
237237
## Register `OutputFileDatasetConfig` objects for reuse
238238

239-
If you'd like to make your `OutputFileDatasetConfig` available for longer than the duration of your experiment, register it to your workspace to share and reuse across experiments.
239+
If you want to make `OutputFileDatasetConfig` object available for longer than the duration of your experiment, register it to your workspace to share and reuse across experiments.
240240

241241
```python
242242
step1_output_ds = step1_output_data.register_on_complete(
@@ -245,21 +245,23 @@ step1_output_ds = step1_output_data.register_on_complete(
245245
)
246246
```
247247

248-
## Delete `OutputFileDatasetConfig` contents when no longer needed
248+
## Delete `OutputFileDatasetConfig` content when it's no longer needed
249249

250-
Azure doesn't automatically delete intermediate data written with `OutputFileDatasetConfig`. To avoid storage charges for large amounts of unneeded data, you should either:
251-
252-
> [!CAUTION]
253-
> Only delete intermediate data after 30 days from the last change date of the data. Deleting the data earlier could cause the pipeline run to fail because the pipeline will assume the intermediate data exists within 30 day period for reuse.
250+
Azure doesn't automatically delete intermediate data that's written with `OutputFileDatasetConfig`. To avoid storage charges for large amounts of unneeded data, you should take one of the following actions:
254251

255252
* Programmatically delete intermediate data at the end of a pipeline job, when it's no longer needed.
256-
* Use blob storage with a short-term storage policy for intermediate data (see [Optimize costs by automating Azure Blob Storage access tiers](/azure/storage/blobs/lifecycle-management-overview)). This policy can only be set to a workspace's non-default datastore. Use `OutputFileDatasetConfig` to export intermediate data to another datastore that isn't the default.
253+
* Use blob storage with a short-term storage policy for intermediate data. (See [Optimize costs by automating Azure Blob Storage access tiers](/azure/storage/blobs/lifecycle-management-overview).) This policy can be set only on a workspace's non-default datastore. Use `OutputFileDatasetConfig` to export intermediate data to another datastore that isn't the default.
254+
257255
```Python
258-
# Get adls gen 2 datastore already registered with the workspace
256+
# Get Data Lake Storage Gen2 datastore that's already registered with the workspace
259257
datastore = workspace.datastores['my_adlsgen2']
260258
step1_output_data = OutputFileDatasetConfig(name="processed_data", destination=(datastore, "mypath/{run-id}/{output-name}")).as_upload()
261259
```
262-
* Regularly review and delete no-longer-needed data.
260+
261+
* Regularly review data and delete data that you don't need.
262+
263+
> [!CAUTION]
264+
> Only delete intermediate data after 30 days from the last change date of the data. Deleting intermediate data earlier could cause the pipeline run to fail because the pipeline assumes the data exists for a 30 day period for reuse.
263265
264266
For more information, see [Plan and manage costs for Azure Machine Learning](../concept-plan-manage-cost.md).
265267

0 commit comments

Comments
 (0)