Skip to content

Commit 6de313a

Browse files
committed
clarified arguments + mount v download
1 parent 5dfc836 commit 6de313a

File tree

1 file changed

+56
-57
lines changed

1 file changed

+56
-57
lines changed

articles/machine-learning/how-to-train-with-datasets.md

Lines changed: 56 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,9 @@ ms.author: sihhu
1010
author: MayMSFT
1111
manager: cgronlun
1212
ms.reviewer: nibaccam
13-
ms.date: 03/09/2020
13+
ms.date: 04/20/2020
1414

15-
# Customer intent: As an experienced Python developer, I need to make my data available to my local or remote compute to train my machine learning models.
15+
# Customer intent: As an experienced Python developer, I need to make my data available to my local or remote compute target to train my machine learning models.
1616

1717
---
1818

@@ -36,22 +36,16 @@ To create and train with datasets, you need:
3636
> [!Note]
3737
> Some Dataset classes have dependencies on the [azureml-dataprep](https://docs.microsoft.com/python/api/azureml-dataprep/?view=azure-ml-py) package. For Linux users, these classes are supported only on the following distributions: Red Hat Enterprise Linux, Ubuntu, Fedora, and CentOS.
3838
39+
## Use datasets directly in training scripts
3940

40-
## Local Options
41-
42-
## Remote Options
43-
44-
There are two ways to consume Azure Machine Learning datasets in remote experiment training runs:
45-
46-
Option 1: If you have structured data, create a TabularDataset and use it directly in your training script.
47-
48-
Option 2: If you have unstructured data, create a FileDataset and mount or download files to a remote compute for training.
49-
### Option 1: Use datasets directly in training scripts
41+
If you have structured data, create a TabularDataset and use it directly in your training script for your local or remote experiment.
5042

5143
In this example, you create a [TabularDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) and use it as a direct input to your `estimator` object for training.
5244

5345
### Create a TabularDataset
5446

47+
TabularDataset objects provide the ability to load the data into a pandas or spark DataFrame so that you can work with familiar data preparation and training libraries without having to leave your notebook. To leverage this capability, see [how to access input datasets](#access-input-datasets).
48+
5549
The following code creates an unregistered TabularDataset from a web url. You can also create datasets from local files or paths in datastores. Learn more about [how to create datasets](https://aka.ms/azureml/howto/createdatasets).
5650

5751
```Python
@@ -61,24 +55,6 @@ web_path ='https://dprepdata.blob.core.windows.net/demo/Titanic.csv'
6155
titanic_ds = Dataset.Tabular.from_delimited_files(path=web_path)
6256
```
6357

64-
### Access the input dataset in your training script
65-
66-
TabularDataset objects provide the ability to load the data into a pandas or spark DataFrame so that you can work with familiar data preparation and training libraries. To leverage this capability, you can pass a TabularDataset as the input in your training configuration, and then retrieve it in your script.
67-
68-
To do so, access the input dataset through the [`Run`](https://docs.microsoft.com/python/api/azureml-core/azureml.core.run.run?view=azure-ml-py) object in your training script and use the [`to_pandas_dataframe()`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset#to-pandas-dataframe-on-error--null---out-of-range-datetime--null--) method.
69-
70-
```Python
71-
%%writefile $script_folder/train_titanic.py
72-
73-
from azureml.core import Dataset, Run
74-
75-
run = Run.get_context()
76-
# get the input dataset by name
77-
dataset = run.input_datasets['titanic']
78-
# load the TabularDataset to pandas DataFrame
79-
df = dataset.to_pandas_dataframe()
80-
```
81-
8258
### Configure the estimator
8359

8460
An [estimator](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.estimator.estimator?view=azure-ml-py) object is used to submit the experiment run. Azure Machine Learning has pre-configured estimators for common machine learning frameworks, as well as a generic estimator.
@@ -87,7 +63,7 @@ This code creates a generic estimator object, `est`, that specifies
8763

8864
* A script directory for your scripts. All the files in this directory are uploaded into the cluster nodes for execution.
8965
* The training script, *train_titanic.py*.
90-
* The input dataset for training, `titanic`. `as_named_input()` is required so that the input dataset can be referenced by the assigned name in your training script.
66+
* The input dataset for training, `titanic_ds`. `as_named_input()` is required so that the input dataset can be referenced by the assigned name `titanic` in your training script.
9167
* The compute target for the experiment.
9268
* The environment definition for the experiment.
9369

@@ -104,34 +80,30 @@ experiment_run = experiment.submit(est)
10480
experiment_run.wait_for_completion(show_output=True)
10581
```
10682

83+
### Access input dataset
10784

108-
## Option 2: Mount files to a remote compute target
109-
110-
If you want to make your data files available on the compute target for training, use [FileDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py) to mount or download files referred by it.
85+
If you want to get the dataset used in your training run
11186

112-
### Mount vs. Download
113-
114-
Mounting or downloading files of any format are supported for datasets created from Azure Blob storage, Azure Files, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Database, and Azure Database for PostgreSQL.
11587

116-
When you mount a dataset, you attach the files referenced by the dataset to a directory (mount point) and make it available on the compute target. Mounting is supported for Linux-based computes, including Azure Machine Learning Compute, virtual machines, and HDInsight. When you download a dataset, all the files referenced by the dataset will be downloaded to the compute target. Downloading is supported for all compute types.
88+
The following code uses the [`get_context()`]() method in the [`Run`](https://docs.microsoft.com/python/api/azureml-core/azureml.core.run.run?view=azure-ml-py) class to access the input TabularDataset, `titanic`, in the training script. Then uses the [`to_pandas_dataframe()`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset#to-pandas-dataframe-on-error--null---out-of-range-datetime--null--) method to load that dataset into a pandas dataframe.
11789

118-
If your script processes all files referenced by the dataset, and your compute disk can fit your full dataset, downloading is recommended to avoid the overhead of streaming data from storage services. If your data size exceeds the compute disk size, downloading is not possible. For this scenario, we recommend mounting since only the data files used by your script are loaded at the time of processing.
90+
```Python
91+
%%writefile $script_folder/train_titanic.py
11992

120-
The following code mounts `dataset` to the temp directory at `mounted_path`
93+
from azureml.core import Dataset, Run
12194

122-
```python
123-
import tempfile
124-
mounted_path = tempfile.mkdtemp()
95+
run = Run.get_context()
96+
# get the input dataset by name
97+
dataset = run.input_datasets['titanic']
12598

126-
# mount dataset onto the mounted_path of a Linux-based compute
127-
mount_context = dataset.mount(mounted_path)
99+
# load the TabularDataset to pandas DataFrame
100+
df = dataset.to_pandas_dataframe()
101+
```
102+
## Mount files to remote compute targets
128103

129-
mount_context.start()
104+
If you have unstructured data, create a [FileDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.filedataset?view=azure-ml-py) and either mount or download your data files to make them available to your remote compute target for training. Learn about when to use [mount vs. download](#mount-vs.-download) for your training experiments.
130105

131-
import os
132-
print(os.listdir(mounted_path))
133-
print (mounted_path)
134-
```
106+
The following example creates a FileDataset and mounts the dataset to the compute target by passing it as an argument in the estimator for training.
135107

136108
### Create a FileDataset
137109

@@ -151,9 +123,9 @@ mnist_ds = Dataset.File.from_files(path = web_paths)
151123

152124
### Configure the estimator
153125

154-
Besides passing the dataset through the `inputs` parameter in the estimator, you can also pass the dataset through `script_params` and get the data path (mounting point) in your training script via arguments. This way, you can keep your training script independent of azureml-sdk. In other words, you will be able use the same training script for local debugging and remote training on any cloud platform.
126+
We recommend passing the dataset as an argument when mounting. Besides passing the dataset through the `inputs` parameter in the estimator, you can also pass the dataset through `script_params` and get the data path (mounting point) in your training script via arguments. This way, you will be able use the same training script for local debugging and remote training on any cloud platform.
155127

156-
An [SKLearn](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.sklearn.sklearn?view=azure-ml-py) estimator object is used to submit the run for scikit-learn experiments. Learn more about training with the [SKlearn estimator](how-to-train-scikit-learn.md).
128+
An [SKLearn](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.sklearn.sklearn?view=azure-ml-py) estimator object is used to submit the run for scikit-learn experiments. After you submit the run, data files referred by the `mnist` dataset will be mounted to the compute target. Learn more about training with the [SKlearn estimator](how-to-train-scikit-learn.md).
157129

158130
```Python
159131
from azureml.train.sklearn import SKLearn
@@ -174,10 +146,10 @@ est = SKLearn(source_directory=script_folder,
174146
run = experiment.submit(est)
175147
run.wait_for_completion(show_output=True)
176148
```
177-
178149
### Retrieve the data in your training script
150+
If .............................
179151

180-
After you submit the run, data files referred by the `mnist` dataset will be mounted to the compute target. The following code shows how to retrieve the data in your script.
152+
The following code shows how to retrieve the data in your script.
181153

182154
```Python
183155
%%writefile $script_folder/train_mnist.py
@@ -211,14 +183,41 @@ y_train = load_data(y_train_path, True).reshape(-1)
211183
y_test = load_data(y_test, True).reshape(-1)
212184
```
213185

186+
187+
## Mount vs. download
188+
189+
Mounting or downloading files of any format are supported for datasets created from Azure Blob storage, Azure Files, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Database, and Azure Database for PostgreSQL.
190+
191+
When you mount a dataset, you attach the files referenced by the dataset to a directory (mount point) and make it available on the compute target. Mounting is supported for Linux-based computes, including Azure Machine Learning Compute, virtual machines, and HDInsight.
192+
193+
When you download a dataset, all the files referenced by the dataset will be downloaded to the compute target. Downloading is supported for all compute types.
194+
195+
If your script processes all files referenced by the dataset, and your compute disk can fit your full dataset, downloading is recommended to avoid the overhead of streaming data from storage services. If your data size exceeds the compute disk size, downloading is not possible. For this scenario, we recommend mounting since only the data files used by your script are loaded at the time of processing.
196+
197+
The following code mounts `dataset` to the temp directory at `mounted_path`
198+
199+
```python
200+
import tempfile
201+
mounted_path = tempfile.mkdtemp()
202+
203+
# mount dataset onto the mounted_path of a Linux-based compute
204+
mount_context = dataset.mount(mounted_path)
205+
206+
mount_context.start()
207+
208+
import os
209+
print(os.listdir(mounted_path))
210+
print (mounted_path)
211+
```
212+
214213
## Notebook examples
215214

216215
The [dataset notebooks](https://aka.ms/dataset-tutorial) demonstrate and expand upon concepts in this article.
217216

218217
## Next steps
219218

220-
* [Auto train machine learning models](how-to-auto-train-remote.md) with TabularDatasets
219+
* [Auto train machine learning models](how-to-auto-train-remote.md) with TabularDatasets.
221220

222-
* [Train image classification models](https://aka.ms/filedataset-samplenotebook) with FileDatasets
221+
* [Train image classification models](https://aka.ms/filedataset-samplenotebook) with FileDatasets.
223222

224-
* [Train with datasets using pipelines](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/work-with-data/datasets-tutorial/pipeline-with-datasets/pipeline-for-image-classification.ipynb)
223+
* [Train with datasets using pipelines](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/work-with-data/datasets-tutorial/pipeline-with-datasets/pipeline-for-image-classification.ipynb).

0 commit comments

Comments
 (0)