You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Named inputs to your pipeline step script are available as a dictionary within the `Run` object. Retrieve the active `Run` object by using `Run.get_context()`, and then retrieve the dictionary of named inputs by using `input_datasets`. If you passed the `DatasetConsumptionConfig` object by using the `arguments` argument rather than the `inputs` argument, access the data by using `ArgParser` code. Both techniques are demonstrated in the following snippets:
120
+
Named inputs to your pipeline step script are available as a dictionary within the `Run` object. Retrieve the active `Run` object by using `Run.get_context()`, and then retrieve the dictionary of named inputs by using `input_datasets`. If you passed the `DatasetConsumptionConfig` object by using the `arguments` argument rather than the `inputs` argument, access the data by using `ArgumentParser` code. Both techniques are demonstrated in the following snippets:
## Use `OutputFileDatasetConfig` for intermediate data
164
164
165
-
Although `Dataset` objects represent only persistent data, [`OutputFileDatasetConfig`](/python/api/azureml-core/azureml.data.outputfiledatasetconfig) objects can be used for temporary data output from pipeline steps and for persistent output data. `OutputFileDatasetConfig` supports writing data to blob storage, fileshare, Azure Data Lake Storage Gen1, or Azure Data Lake Storage Gen2. It supports both mount mode and upload mode. In mount mode, files written to the mounted directory are permanently stored when the file is closed. In upload mode, files written to the output directory are uploaded at the end of the job. If the job fails or is canceled, the output directory won't be uploaded.
165
+
Although `Dataset` objects represent only persistent data, [`OutputFileDatasetConfig`](/python/api/azureml-core/azureml.data.outputfiledatasetconfig) objects can be used for temporary data output from pipeline steps and for persistent output data. `OutputFileDatasetConfig` supports writing data to blob storage, fileshare, Azure Data Lake Storage Gen1, or Data Lake Storage Gen2. It supports both mount mode and upload mode. In mount mode, files written to the mounted directory are permanently stored when the file is closed. In upload mode, files written to the output directory are uploaded at the end of the job. If the job fails or is canceled, the output directory isn't uploaded.
166
166
167
-
`OutputFileDatasetConfig` object's default behavior is to write to the default datastore of the workspace. Pass your `OutputFileDatasetConfig` objects to your `PythonScriptStep`with the `arguments` parameter.
167
+
The `OutputFileDatasetConfig` object's default behavior is to write to the default datastore of the workspace. Pass your `OutputFileDatasetConfig` objects to your `PythonScriptStep`by using the `arguments` parameter.
> Concurrent writes to a `OutputFileDatasetConfig` will fail. Do not attempt to use a single `OutputFileDatasetConfig` concurrently. Do not share a single `OutputFileDatasetConfig` in a multiprocessing situation, such as when using[distributed training](../how-to-train-distributed-gpu.md).
183
+
> Concurrent writes to a `OutputFileDatasetConfig` will fail. Don't try to use a single `OutputFileDatasetConfig` concurrently. Don't share a single `OutputFileDatasetConfig` in a multiprocessing situation, like when you use[distributed training](../how-to-train-distributed-gpu.md).
184
184
185
185
### Use `OutputFileDatasetConfig` as outputs of a training step
186
186
187
-
Within your pipeline's `PythonScriptStep`, you can retrieve the available output paths using the program's arguments. If this step is the first and will initialize the output data, you must create the directory at the specified path. You can then write whatever files you wish to be contained in the `OutputFileDatasetConfig`.
187
+
In your pipeline's `PythonScriptStep`, you can retrieve the available output paths by using the program's arguments. If this step is the first and will initialize the output data, you need to create the directory at the specified path. You can then write whatever files you want to be contained in the `OutputFileDatasetConfig`.
188
188
189
189
```python
190
190
parser = argparse.ArgumentParser()
@@ -203,12 +203,12 @@ After the initial pipeline step writes some data to the `OutputFileDatasetConfig
203
203
204
204
In the following code:
205
205
206
-
*`step1_output_data` indicates that the output of the PythonScriptStep,`step1` is written to the ADLS Gen 2 datastore, `my_adlsgen2` in upload access mode. Learn more about how to [set up role permissions](how-to-access-data.md) in order to write data back to ADLS Gen 2 datastores.
206
+
*`step1_output_data` indicates that the output of the `PythonScriptStep``step1` is written to the Data Lake Storage Gen2 datastore, `my_adlsgen2` in upload access mode. Learn more about how to [set up role permissions](how-to-access-data.md) in order to write data back to Data Lake Storage Gen2 datastores.
207
207
208
-
* After `step1` completes and the output is written to the destination indicated by `step1_output_data`, then step2 is ready to use `step1_output_data` as an input.
208
+
* After `step1` completes and the output is written to the destination that's indicated by `step1_output_data`, `step2` is ready to use `step1_output_data` as an input.
209
209
210
210
```python
211
-
#get adls gen 2 datastore already registered with the workspace
211
+
#Get Data Lake Storage Gen2 datastore that's already registered with the workspace
> Reading the data in the python script `step2.py` is the same as documented earlier in [Access datasets within your script](#access-datasets-within-your-script); use`ArgumentParser` to add an argument of `--pd` in your script to access the data.
235
+
> Reading the data in the Python script `step2.py` is the same as the process described earlier in [Access datasets within your script](#access-datasets-within-your-script). Use`ArgumentParser` to add an argument of `--pd` in your script to access the data.
236
236
237
237
## Register `OutputFileDatasetConfig` objects for reuse
238
238
239
-
If you'd like to make your `OutputFileDatasetConfig` available for longer than the duration of your experiment, register it to your workspace to share and reuse across experiments.
239
+
If you want to make `OutputFileDatasetConfig` object available for longer than the duration of your experiment, register it to your workspace to share and reuse across experiments.
## Delete `OutputFileDatasetConfig`contents when no longer needed
248
+
## Delete `OutputFileDatasetConfig`content when it's no longer needed
249
249
250
-
Azure doesn't automatically delete intermediate data written with `OutputFileDatasetConfig`. To avoid storage charges for large amounts of unneeded data, you should either:
251
-
252
-
> [!CAUTION]
253
-
> Only delete intermediate data after 30 days from the last change date of the data. Deleting the data earlier could cause the pipeline run to fail because the pipeline will assume the intermediate data exists within 30 day period for reuse.
250
+
Azure doesn't automatically delete intermediate data that's written with `OutputFileDatasetConfig`. To avoid storage charges for large amounts of unneeded data, you should take one of the following actions:
254
251
255
252
* Programmatically delete intermediate data at the end of a pipeline job, when it's no longer needed.
256
-
* Use blob storage with a short-term storage policy for intermediate data (see [Optimize costs by automating Azure Blob Storage access tiers](/azure/storage/blobs/lifecycle-management-overview)). This policy can only be set to a workspace's non-default datastore. Use `OutputFileDatasetConfig` to export intermediate data to another datastore that isn't the default.
253
+
* Use blob storage with a short-term storage policy for intermediate data. (See [Optimize costs by automating Azure Blob Storage access tiers](/azure/storage/blobs/lifecycle-management-overview).) This policy can be set only on a workspace's non-default datastore. Use `OutputFileDatasetConfig` to export intermediate data to another datastore that isn't the default.
254
+
257
255
```Python
258
-
# Get adls gen 2 datastore already registered with the workspace
256
+
# Get Data Lake Storage Gen2 datastore that's already registered with the workspace
* Regularly review and delete no-longer-needed data.
260
+
261
+
* Regularly review data and delete data that you don't need.
262
+
263
+
> [!CAUTION]
264
+
> Only delete intermediate data after 30 days from the last change date of the data. Deleting intermediate data earlier could cause the pipeline run to fail because the pipeline assumes the data exists for a 30 day period for reuse.
263
265
264
266
For more information, see [Plan and manage costs for Azure Machine Learning](../concept-plan-manage-cost.md).
0 commit comments