You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#Customer intent: As an experienced Python developer, I need to make my data available to my local or remote compute target to train my machine learning models.
15
15
---
16
16
17
-
# Train models with Azure Machine Learning datasets
18
-
19
-
> [!CAUTION]
20
-
> This article references CentOS, a Linux distribution that is End Of Life (EOL) status. Please consider your use and planning accordingly. For more information, see the [CentOS End Of Life guidance](/azure/virtual-machines/workloads/centos/centos-end-of-life).
17
+
# Train models with Azure Machine Learning datasets
In this article, you learn how to work with [Azure Machine Learning datasets](/python/api/azureml-core/azureml.core.dataset%28class%29) to train machine learning models. You can use datasets in your local or remote compute target without worrying about connection strings or data paths.
21
+
In this article, you learn how to work with [Azure Machine Learning datasets](/python/api/azureml-core/azureml.core.dataset%28class%29) to train machine learning models. You can use datasets in your local or remote compute target without worrying about connection strings or data paths.
25
22
26
23
* For structured data, see [Consume datasets in machine learning training scripts](#consume-datasets-in-machine-learning-training-scripts).
27
24
28
25
* For unstructured data, see [Mount files to remote compute targets](#mount-files-to-remote-compute-targets).
29
26
30
27
Azure Machine Learning datasets provide a seamless integration with Azure Machine Learning training functionality like [ScriptRunConfig](/python/api/azureml-core/azureml.core.scriptrunconfig), [HyperDrive](/python/api/azureml-train-core/azureml.train.hyperdrive), and [Azure Machine Learning pipelines](./how-to-create-machine-learning-pipelines.md).
31
28
32
-
If you aren't ready to make your data available for model training, but want to load your data to your notebook for data exploration, see how to [explore the data in your dataset](how-to-create-register-datasets.md).
29
+
If you aren't ready to make your data available for model training, but want to load your data to your notebook for data exploration, see how to [explore the data in your dataset](how-to-create-register-datasets.md).
33
30
34
31
## Prerequisites
35
32
@@ -43,11 +40,11 @@ To create and train with datasets, you need:
43
40
44
41
45
42
> [!Note]
46
-
> Some Dataset classes have dependencies on the [azureml-dataprep](https://pypi.org/project/azureml-dataprep/) package. For Linux users, these classes are supported only on the following distributions: Red Hat Enterprise Linux, Ubuntu, Fedora, and CentOS.
43
+
> Some Dataset classes have dependencies on the [azureml-dataprep](https://pypi.org/project/azureml-dataprep/) package. For Linux users, these classes are supported only on the following distributions: Red Hat Enterprise Linux, Ubuntu, and Fedora.
47
44
48
45
## Consume datasets in machine learning training scripts
49
46
50
-
If you have structured data not yet registered as a dataset, create a TabularDataset and use it directly in your training script for your local or remote experiment.
47
+
If you structured data not yet registered as a dataset, create a TabularDataset and use it directly in your training script for your local or remote experiment.
51
48
52
49
In this example, you create an unregistered [TabularDataset](/python/api/azureml-core/azureml.data.tabulardataset) and specify it as a script argument in the [ScriptRunConfig](/python/api/azureml-core/azureml.core.script_run_config.scriptrunconfig) object for training. If you want to reuse this TabularDataset with other experiments in your workspace, see [how to register datasets to your workspace](how-to-create-register-datasets.md).
TabularDataset objects provide the ability to load the data in your TabularDataset into a pandas or Spark DataFrame so that you can work with familiar data preparation and training libraries without having to leave your notebook.
62
+
TabularDataset objects offer a way to load the data in your TabularDataset into a pandas or Spark DataFrame so that you can work with familiar data preparation and training libraries without having to leave your notebook.
66
63
67
64
### Access dataset in training script
68
65
69
-
The following code configures a script argument `--input-data` that you'll specify when you configure your training run (see next section). When the tabular dataset is passed in as the argument value, Azure Machine Learning will resolve that to ID of the dataset, which you can then use to access the dataset in your training script (without having to hardcode the name or ID of the dataset in your script). It then uses the [`to_pandas_dataframe()`](/python/api/azureml-core/azureml.data.tabulardataset#to-pandas-dataframe-on-error--null---out-of-range-datetime--null--) method to load that dataset into a pandas dataframe for further data exploration and preparation prior to training.
66
+
The following code configures a script argument `--input-data` that you'll specify when you configure your training run (see next section). When the tabular dataset is passed in as the argument value, Azure Machine Learning resolves it to the dataset ID. You can then use that argument value to access the dataset in your training script (without having to hardcode the name or ID of the dataset in your script). It then uses the [`to_pandas_dataframe()`](/python/api/azureml-core/azureml.data.tabulardataset#to-pandas-dataframe-on-error--null---out-of-range-datetime--null--) method to load that dataset into a pandas dataframe for further data exploration and preparation before training.
70
67
71
68
> [!Note]
72
69
> If your original data source contains NaN, empty strings or blank values, when you use `to_pandas_dataframe()`, then those values are replaced as a *Null* value.
@@ -101,7 +98,7 @@ This code creates a ScriptRunConfig object, `src`, that specifies:
101
98
102
99
* A script directory for your scripts. All the files in this directory are uploaded into the cluster nodes for execution.
103
100
* The training script, *train_titanic.py*.
104
-
* The input dataset for training, `titanic_ds`, as a script argument. Azure Machine Learning will resolve this to corresponding ID of the dataset when it's passed to your script.
101
+
* The input dataset for training, `titanic_ds`, as a script argument. Azure Machine Learning resolves it to corresponding ID of the dataset when it's passed to your script.
If you have unstructured data, create a [FileDataset](/python/api/azureml-core/azureml.data.filedataset) and either mount or download your data files to make them available to your remote compute target for training. Learn about when to use [mount vs. download](#mount-vs-download) for your remote training experiments.
122
+
If you have unstructured data, create a [FileDataset](/python/api/azureml-core/azureml.data.filedataset) and either mount or download your data files to make them available to your remote compute target for training. Learn about when to use [mount vs. download](#mount-vs-download) for your remote training experiments.
126
123
127
-
The following example,
124
+
The following example
128
125
129
126
* Creates an input FileDataset, `mnist_ds`, for your training data.
130
127
* Specifies where to write training results, and to promote those results as a FileDataset.
You can specify where to write your training results with an [OutputFileDatasetConfig object](/python/api/azureml-core/azureml.data.output_dataset_config.outputfiledatasetconfig).
157
+
You can specify where to write your training results with an [OutputFileDatasetConfig object](/python/api/azureml-core/azureml.data.output_dataset_config.outputfiledatasetconfig).
161
158
162
-
OutputFileDatasetConfig objects allow you to:
159
+
OutputFileDatasetConfig objects allow you to:
163
160
164
161
* Mount or upload the output of a run to cloud storage you specify.
165
162
* Save the output as a FileDataset to these supported storage types:
@@ -168,7 +165,7 @@ OutputFileDatasetConfig objects allow you to:
168
165
* Azure Data Lake Storage generations 1 and 2
169
166
* Track the data lineage between training runs.
170
167
171
-
The following code specifies that training results should be saved as a FileDataset in the `outputdataset` folder in the default blob datastore, `def_blob_store`.
168
+
The following code specifies that training results should be saved as a FileDataset in the `outputdataset` folder in the default blob datastore, `def_blob_store`.
172
169
173
170
```python
174
171
from azureml.core import Workspace
@@ -231,11 +228,11 @@ with open(mounted_input_path, 'r') as f:
231
228
232
229
## Mount vs download
233
230
234
-
Mounting or downloading files of any format are supported for datasets created from Azure Blob storage, Azure Files, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Database, and Azure Database for PostgreSQL.
231
+
Mounting or downloading files of any format are supported for datasets created from Azure Blob storage, Azure Files, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Database, and Azure Database for PostgreSQL.
235
232
236
233
When you **mount** a dataset, you attach the files referenced by the dataset to a directory (mount point) and make it available on the compute target. Mounting is supported for Linux-based computes, including Azure Machine Learning Compute, virtual machines, and HDInsight. If your data size exceeds the compute disk size, downloading isn't possible. For this scenario, we recommend mounting since only the data files used by your script are loaded at the time of processing.
237
234
238
-
When you **download** a dataset, all the files referenced by the dataset will be downloaded to the compute target. Downloading is supported for all compute types. If your script processes all files referenced by the dataset, and your compute disk can fit your full dataset, downloading is recommended to avoid the overhead of streaming data from storage services. For multi-node downloads, see [how to avoid throttling](#troubleshooting).
235
+
When you **download** a dataset, all the files referenced by the dataset are downloaded to the compute target. Downloading is supported for all compute types. If your script processes all files referenced by the dataset, and your compute disk can fit your full dataset, downloading is recommended to avoid the overhead of streaming data from storage services. For multi-node downloads, see [how to avoid throttling](#troubleshooting).
239
236
240
237
> [!NOTE]
241
238
> The download path name should not be longer than 255 alpha-numeric characters for Windows OS. For Linux OS, the download path name should not be longer than 4,096 alpha-numeric characters. Also, for Linux OS the file name (which is the last segment of the download path `/path/to/file/{filename}`) should not be longer than 255 alpha-numeric characters.
Azure Blob storage has higher throughput speeds than an Azure file share and will scale to large numbers of jobs started in parallel. For this reason, we recommend configuring your runs to use Blob storage for transferring source code files.
279
+
Azure Blob storage has higher throughput speeds than an Azure file share, and will scale to large numbers of jobs started in parallel. For this reason, we recommend configuring your runs to use Blob storage for transferring source code files.
283
280
284
281
The following code example specifies in the run configuration which blob datastore to use for source code transfers.
**Dataset initialization failed: Waiting for mount point to be ready has timed out**:
295
+
**Dataset initialization failed: Waiting for mount point to be ready has timed out**:
299
296
* If you don't have any outbound [network security group](/azure/virtual-network/network-security-groups-overview) rules and are using `azureml-sdk>=1.12.0`, update `azureml-dataset-runtime` and its dependencies to be the latest for the specific minor version, or if you're using it in a run, recreate your environment so it can have the latest patch with the fix.
300
297
* If you're using `azureml-sdk<1.12.0`, upgrade to the latest version.
301
298
* If you have outbound NSG rules, make sure there's an outbound rule that allows all traffic for the service tag `AzureResourceMonitor`.
302
299
303
300
**Dataset initialization failed: StreamAccessException was caused by ThrottlingException**
304
301
305
-
For multi-node file downloads, all nodes may attempt to download all files in the file dataset from the Azure Storage service, which results in a throttling error. To avoid throttling, initially set the environment variable `AZUREML_DOWNLOAD_CONCURRENCY` to a value of eight times the number of CPU cores divided by the number of nodes. Setting up a value for this environment variable may require some experimentation, so the aforementioned guidance is a starting point.
302
+
For multi-node file downloads, all nodes might attempt to download all files in the file dataset from the Azure Storage service, which results in a throttling error. To avoid throttling, initially set the environment variable `AZUREML_DOWNLOAD_CONCURRENCY` to a value of eight times the number of CPU cores divided by the number of nodes. Setting up a value for this environment variable might require some experimentation, so the earlier guidance is a starting point.
306
303
307
304
The following example assumes 32 cores and 4 nodes.
**Unable to upload project files to working directory in AzureFile because the storage is overloaded**:
318
315
319
-
* If you're using file share for other workloads, such as data transfer, the recommendation is to use blobs so that file share is free to be used for submitting runs.
316
+
* If you use file share for other workloads, such as data transfer, the recommendation is to use blobs so that file share is free to be used for submitting runs.
320
317
321
-
*Another option is to split the workload between two different workspaces.
318
+
*You can also split the workload between two different workspaces.
322
319
323
320
**ConfigException: Could not create a connection to the AzureFileService due to missing credentials. Either an Account Key or SAS token needs to be linked the default workspace blob store.**
324
321
325
322
To ensure your storage access credentials are linked to the workspace and the associated file datastore, complete the following steps:
326
323
327
324
1. Navigate to your workspace in the [Azure portal](https://portal.azure.com).
328
325
1. Select the storage link on the workspace **Overview** page.
329
-
1. On the storage page, select **Access keys** on the left side menu.
326
+
1. On the storage page, select **Access keys** on the left side menu.
330
327
1. Copy the key.
331
328
1. Navigate to the [Azure Machine Learning studio](https://ml.azure.com) for your workspace.
332
-
1. In the studio, select the file datastore for which you want to provide authentication credentials.
333
-
1. Select **Update authentication**.
334
-
1. Paste the key from the previous steps.
335
-
1. Select **Save**.
329
+
1. In the studio, select the file datastore for which you want to provide authentication credentials.
330
+
1. Select **Update authentication**.
331
+
1. Paste the key from the previous steps.
332
+
1. Select **Save**.
336
333
337
334
### Passing data as input
338
335
339
-
**TypeError: FileNotFound: No such file or directory**: This error occurs if the file path you provide isn't where the file is located. You need to make sure the way you refer to the file is consistent with where you mounted your dataset on your compute target. To ensure a deterministic state, we recommend using the abstract path when mounting a dataset to a compute target. For example, in the following code we mount the dataset under the root of the filesystem of the compute target, `/tmp`.
336
+
**TypeError: FileNotFound: No such file or directory**: This error occurs if the file path you provide isn't where the file is located. You need to make sure the way you refer to the file is consistent with where you mounted your dataset on your compute target. To ensure a deterministic state, we recommend using the abstract path when mounting a dataset to a compute target. For example, in the following code we mount the dataset under the root of the filesystem of the compute target, `/tmp`.
340
337
341
338
```python
342
339
# Note the leading / in '/tmp/dataset'
@@ -345,8 +342,7 @@ script_params = {
345
342
}
346
343
```
347
344
348
-
If you don't include the leading forward slash, '/', you'll need to prefix the working directory for example, `/mnt/batch/.../tmp/dataset` on the compute target to indicate where you want the dataset to be mounted.
349
-
345
+
If you don't include the leading forward slash, '/', you must prefix the working directory for example, `/mnt/batch/.../tmp/dataset` on the compute target to indicate where you want the dataset to be mounted.
0 commit comments