Skip to content

Commit 4e1a556

Browse files
authored
Merge pull request #112033 from nibaccam/data-train
Data train
2 parents cae4847 + 9859aee commit 4e1a556

File tree

3 files changed

+69
-59
lines changed

3 files changed

+69
-59
lines changed

articles/machine-learning/concept-data.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ Datasets can be created from local files, public urls, [Azure Open Datasets](htt
7070
We support 2 types of datasets:
7171
+ A [TabularDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) represents data in a tabular format by parsing the provided file or list of files. You can load a TabularDataset into a Pandas or Spark DataFrame for further manipulation and cleansing. For a complete list of data formats you can create TabularDatasets from, see the [TabularDatasetFactory class](https://aka.ms/tabulardataset-api-reference).
7272

73-
+ A [FileDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py) references single or multiple files in your datastores or public URLs. You can [download or mount files](how-to-train-with-datasets.md#option-2--mount-files-to-a-remote-compute-target) referenced by FileDatasets to your compute target.
73+
+ A [FileDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py) references single or multiple files in your datastores or public URLs. You can [download or mount files](how-to-train-with-datasets.md#mount-files-to-remote-compute-targets) referenced by FileDatasets to your compute target.
7474

7575
Additional datasets capabilities can be found in the following documentation:
7676

articles/machine-learning/how-to-configure-auto-train.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -108,7 +108,7 @@ For remote executions, training data must be accessible from the remote compute.
108108
* easily transfer data from static files or URL sources into your workspace
109109
* make your data available to training scripts when running on cloud compute resources
110110

111-
See the [how-to](how-to-train-with-datasets.md#option-2--mount-files-to-a-remote-compute-target) for an example of using the `Dataset` class to mount data to your compute target.
111+
See the [how-to](how-to-train-with-datasets.md#mount-files-to-remote-compute-targets) for an example of using the `Dataset` class to mount data to your compute target.
112112

113113
## Train and validation data
114114

articles/machine-learning/how-to-train-with-datasets.md

Lines changed: 67 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -10,20 +10,16 @@ ms.author: sihhu
1010
author: MayMSFT
1111
manager: cgronlun
1212
ms.reviewer: nibaccam
13-
ms.date: 03/09/2020
13+
ms.date: 04/20/2020
1414

15-
# Customer intent: As an experienced Python developer, I need to make my data available to my local or remote compute to train my machine learning models.
15+
# Customer intent: As an experienced Python developer, I need to make my data available to my local or remote compute target to train my machine learning models.
1616

1717
---
1818

1919
# Train with datasets in Azure Machine Learning
2020
[!INCLUDE [applies-to-skus](../../includes/aml-applies-to-basic-enterprise-sku.md)]
2121

22-
In this article, you learn the two ways to consume [Azure Machine Learning datasets](https://docs.microsoft.com/python/api/azureml-core/azureml.core.dataset%28class%29?view=azure-ml-py) in a remote experiment training runs without worrying about connection strings or data paths.
23-
24-
- Option 1: If you have structured data, create a TabularDataset and use it directly in your training script.
25-
26-
- Option 2: If you have unstructured data, create a FileDataset and mount or download files to a remote compute for training.
22+
In this article, you learn how to work with [Azure Machine Learning datasets](https://docs.microsoft.com/python/api/azureml-core/azureml.core.dataset%28class%29?view=azure-ml-py) in your training experiments. You can use datasets in your local or remote compute target without worrying about connection strings or data paths.
2723

2824
Azure Machine Learning datasets provide a seamless integration with Azure Machine Learning training products like [ScriptRun](https://docs.microsoft.com/python/api/azureml-core/azureml.core.scriptrun?view=azure-ml-py), [Estimator](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.estimator?view=azure-ml-py), [HyperDrive](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.hyperdrive?view=azure-ml-py) and [Azure Machine Learning pipelines](how-to-create-your-first-pipeline.md).
2925

@@ -40,26 +36,14 @@ To create and train with datasets, you need:
4036
> [!Note]
4137
> Some Dataset classes have dependencies on the [azureml-dataprep](https://docs.microsoft.com/python/api/azureml-dataprep/?view=azure-ml-py) package. For Linux users, these classes are supported only on the following distributions: Red Hat Enterprise Linux, Ubuntu, Fedora, and CentOS.
4238
43-
## Option 1: Use datasets directly in training scripts
44-
45-
In this example, you create a [TabularDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) and use it as a direct input to your `estimator` object for training.
46-
47-
### Create a TabularDataset
39+
## Access and explore input datasets
4840

49-
The following code creates an unregistered TabularDataset from a web url. You can also create datasets from local files or paths in datastores. Learn more about [how to create datasets](https://aka.ms/azureml/howto/createdatasets).
41+
You can access an existing TabularDataset from the training script of an experiment on your workspace, and load that dataset into a pandas dataframe for further exploration on your local environment.
5042

51-
```Python
52-
from azureml.core.dataset import Dataset
43+
The following code uses the [`get_context()`]() method in the [`Run`](https://docs.microsoft.com/python/api/azureml-core/azureml.core.run.run?view=azure-ml-py) class to access the existing input TabularDataset, `titanic`, in the training script. Then uses the [`to_pandas_dataframe()`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset#to-pandas-dataframe-on-error--null---out-of-range-datetime--null--) method to load that dataset into a pandas dataframe for further data exploration and preparation prior to training.
5344

54-
web_path ='https://dprepdata.blob.core.windows.net/demo/Titanic.csv'
55-
titanic_ds = Dataset.Tabular.from_delimited_files(path=web_path)
56-
```
57-
58-
### Access the input dataset in your training script
59-
60-
TabularDataset objects provide the ability to load the data into a pandas or spark DataFrame so that you can work with familiar data preparation and training libraries. To leverage this capability, you can pass a TabularDataset as the input in your training configuration, and then retrieve it in your script.
61-
62-
To do so, access the input dataset through the [`Run`](https://docs.microsoft.com/python/api/azureml-core/azureml.core.run.run?view=azure-ml-py) object in your training script and use the [`to_pandas_dataframe()`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset#to-pandas-dataframe-on-error--null---out-of-range-datetime--null--) method.
45+
> [!Note]
46+
> If your original data source contains NaN, empty strings or blank values, when you use the to_pandas_dataframe(), then those values are replaced as a *Null* value.
6347
6448
```Python
6549
%%writefile $script_folder/train_titanic.py
@@ -69,10 +53,32 @@ from azureml.core import Dataset, Run
6953
run = Run.get_context()
7054
# get the input dataset by name
7155
dataset = run.input_datasets['titanic']
56+
7257
# load the TabularDataset to pandas DataFrame
7358
df = dataset.to_pandas_dataframe()
7459
```
7560

61+
If you need to load the prepared data into a new dataset from an in memory pandas dataframe, write the data to a local file, like a parquet, and create a new dataset from that file. You can also create datasets from local files or paths in datastores. Learn more about [how to create datasets](how-to-create-register-datasets.md).
62+
63+
## Use datasets directly in training scripts
64+
65+
If you have structured data not yet registered as a dataset, create a TabularDataset and use it directly in your training script for your local or remote experiment.
66+
67+
In this example, you create an unregistered [TabularDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) and use it as a direct input to your `estimator` object for training. If you want to reuse this TabularDataset with other experiments in your workspace, see [how to register datasets to your workspace](how-to-create-register-datasets.md#register-datasets).
68+
69+
### Create a TabularDataset
70+
71+
The following code creates an unregistered TabularDataset from a web url.
72+
73+
```Python
74+
from azureml.core.dataset import Dataset
75+
76+
web_path ='https://dprepdata.blob.core.windows.net/demo/Titanic.csv'
77+
titanic_ds = Dataset.Tabular.from_delimited_files(path=web_path)
78+
```
79+
80+
TabularDataset objects provide the ability to load the data in your TabularDataset into a pandas or spark DataFrame so that you can work with familiar data preparation and training libraries without having to leave your notebook. To leverage this capability, see [access and explore input datasets](#access-and-explore-input-datasets).
81+
7682
### Configure the estimator
7783

7884
An [estimator](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.estimator.estimator?view=azure-ml-py) object is used to submit the experiment run. Azure Machine Learning has pre-configured estimators for common machine learning frameworks, as well as a generic estimator.
@@ -81,7 +87,7 @@ This code creates a generic estimator object, `est`, that specifies
8187

8288
* A script directory for your scripts. All the files in this directory are uploaded into the cluster nodes for execution.
8389
* The training script, *train_titanic.py*.
84-
* The input dataset for training, `titanic`. `as_named_input()` is required so that the input dataset can be referenced by the assigned name in your training script.
90+
* The input dataset for training, `titanic_ds`. `as_named_input()` is required so that the input dataset can be referenced by the assigned name `titanic` in your training script.
8591
* The compute target for the experiment.
8692
* The environment definition for the experiment.
8793

@@ -98,34 +104,11 @@ experiment_run = experiment.submit(est)
98104
experiment_run.wait_for_completion(show_output=True)
99105
```
100106

107+
## Mount files to remote compute targets
101108

102-
## Option 2: Mount files to a remote compute target
103-
104-
If you want to make your data files available on the compute target for training, use [FileDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py) to mount or download files referred by it.
105-
106-
### Mount vs. Download
107-
108-
Mounting or downloading files of any format are supported for datasets created from Azure Blob storage, Azure Files, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Database, and Azure Database for PostgreSQL.
109-
110-
When you mount a dataset, you attach the files referenced by the dataset to a directory (mount point) and make it available on the compute target. Mounting is supported for Linux-based computes, including Azure Machine Learning Compute, virtual machines, and HDInsight. When you download a dataset, all the files referenced by the dataset will be downloaded to the compute target. Downloading is supported for all compute types.
111-
112-
If your script processes all files referenced by the dataset, and your compute disk can fit your full dataset, downloading is recommended to avoid the overhead of streaming data from storage services. If your data size exceeds the compute disk size, downloading is not possible. For this scenario, we recommend mounting since only the data files used by your script are loaded at the time of processing.
113-
114-
The following code mounts `dataset` to the temp directory at `mounted_path`
115-
116-
```python
117-
import tempfile
118-
mounted_path = tempfile.mkdtemp()
119-
120-
# mount dataset onto the mounted_path of a Linux-based compute
121-
mount_context = dataset.mount(mounted_path)
109+
If you have unstructured data, create a [FileDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.filedataset?view=azure-ml-py) and either mount or download your data files to make them available to your remote compute target for training. Learn about when to use [mount vs. download](#mount-vs-download) for your remote training experiments.
122110

123-
mount_context.start()
124-
125-
import os
126-
print(os.listdir(mounted_path))
127-
print (mounted_path)
128-
```
111+
The following example creates a FileDataset and mounts the dataset to the compute target by passing it as an argument in the estimator for training.
129112

130113
### Create a FileDataset
131114

@@ -145,9 +128,9 @@ mnist_ds = Dataset.File.from_files(path = web_paths)
145128

146129
### Configure the estimator
147130

148-
Besides passing the dataset through the `inputs` parameter in the estimator, you can also pass the dataset through `script_params` and get the data path (mounting point) in your training script via arguments. This way, you can keep your training script independent of azureml-sdk. In other words, you will be able use the same training script for local debugging and remote training on any cloud platform.
131+
We recommend passing the dataset as an argument when mounting. Besides passing the dataset through the `inputs` parameter in the estimator, you can also pass the dataset through `script_params` and get the data path (mounting point) in your training script via arguments. This way, you will be able use the same training script for local debugging and remote training on any cloud platform.
149132

150-
An [SKLearn](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.sklearn.sklearn?view=azure-ml-py) estimator object is used to submit the run for scikit-learn experiments. Learn more about training with the [SKlearn estimator](how-to-train-scikit-learn.md).
133+
An [SKLearn](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.sklearn.sklearn?view=azure-ml-py) estimator object is used to submit the run for scikit-learn experiments. After you submit the run, data files referred by the `mnist` dataset will be mounted to the compute target. Learn more about training with the [SKlearn estimator](how-to-train-scikit-learn.md).
151134

152135
```Python
153136
from azureml.train.sklearn import SKLearn
@@ -171,7 +154,7 @@ run.wait_for_completion(show_output=True)
171154

172155
### Retrieve the data in your training script
173156

174-
After you submit the run, data files referred by the `mnist` dataset will be mounted to the compute target. The following code shows how to retrieve the data in your script.
157+
The following code shows how to retrieve the data in your script.
175158

176159
```Python
177160
%%writefile $script_folder/train_mnist.py
@@ -205,14 +188,41 @@ y_train = load_data(y_train_path, True).reshape(-1)
205188
y_test = load_data(y_test, True).reshape(-1)
206189
```
207190

191+
192+
## Mount vs download
193+
194+
Mounting or downloading files of any format are supported for datasets created from Azure Blob storage, Azure Files, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Database, and Azure Database for PostgreSQL.
195+
196+
When you mount a dataset, you attach the files referenced by the dataset to a directory (mount point) and make it available on the compute target. Mounting is supported for Linux-based computes, including Azure Machine Learning Compute, virtual machines, and HDInsight.
197+
198+
When you download a dataset, all the files referenced by the dataset will be downloaded to the compute target. Downloading is supported for all compute types.
199+
200+
If your script processes all files referenced by the dataset, and your compute disk can fit your full dataset, downloading is recommended to avoid the overhead of streaming data from storage services. If your data size exceeds the compute disk size, downloading is not possible. For this scenario, we recommend mounting since only the data files used by your script are loaded at the time of processing.
201+
202+
The following code mounts `dataset` to the temp directory at `mounted_path`
203+
204+
```python
205+
import tempfile
206+
mounted_path = tempfile.mkdtemp()
207+
208+
# mount dataset onto the mounted_path of a Linux-based compute
209+
mount_context = dataset.mount(mounted_path)
210+
211+
mount_context.start()
212+
213+
import os
214+
print(os.listdir(mounted_path))
215+
print (mounted_path)
216+
```
217+
208218
## Notebook examples
209219

210220
The [dataset notebooks](https://aka.ms/dataset-tutorial) demonstrate and expand upon concepts in this article.
211221

212222
## Next steps
213223

214-
* [Auto train machine learning models](how-to-auto-train-remote.md) with TabularDatasets
224+
* [Auto train machine learning models](how-to-auto-train-remote.md) with TabularDatasets.
215225

216-
* [Train image classification models](https://aka.ms/filedataset-samplenotebook) with FileDatasets
226+
* [Train image classification models](https://aka.ms/filedataset-samplenotebook) with FileDatasets.
217227

218-
* [Train with datasets using pipelines](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/work-with-data/datasets-tutorial/pipeline-with-datasets/pipeline-for-image-classification.ipynb)
228+
* [Train with datasets using pipelines](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/work-with-data/datasets-tutorial/pipeline-with-datasets/pipeline-for-image-classification.ipynb).

0 commit comments

Comments
 (0)