Skip to content

Commit 80a9780

Browse files
committed
Remove references to CentOS to cover Centos EOL status . . .
1 parent 5b6547d commit 80a9780

File tree

1 file changed

+28
-32
lines changed

1 file changed

+28
-32
lines changed

articles/machine-learning/v1/how-to-train-with-datasets.md

Lines changed: 28 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -14,22 +14,19 @@ ms.custom: UpdateFrequency5, data4ml, sdkv1
1414
#Customer intent: As an experienced Python developer, I need to make my data available to my local or remote compute target to train my machine learning models.
1515
---
1616

17-
# Train models with Azure Machine Learning datasets
18-
19-
> [!CAUTION]
20-
> This article references CentOS, a Linux distribution that is End Of Life (EOL) status. Please consider your use and planning accordingly. For more information, see the [CentOS End Of Life guidance](/azure/virtual-machines/workloads/centos/centos-end-of-life).
17+
# Train models with Azure Machine Learning datasets
2118

2219
[!INCLUDE [sdk v1](../includes/machine-learning-sdk-v1.md)]
2320

24-
In this article, you learn how to work with [Azure Machine Learning datasets](/python/api/azureml-core/azureml.core.dataset%28class%29) to train machine learning models. You can use datasets in your local or remote compute target without worrying about connection strings or data paths.
21+
In this article, you learn how to work with [Azure Machine Learning datasets](/python/api/azureml-core/azureml.core.dataset%28class%29) to train machine learning models. You can use datasets in your local or remote compute target without worrying about connection strings or data paths.
2522

2623
* For structured data, see [Consume datasets in machine learning training scripts](#consume-datasets-in-machine-learning-training-scripts).
2724

2825
* For unstructured data, see [Mount files to remote compute targets](#mount-files-to-remote-compute-targets).
2926

3027
Azure Machine Learning datasets provide a seamless integration with Azure Machine Learning training functionality like [ScriptRunConfig](/python/api/azureml-core/azureml.core.scriptrunconfig), [HyperDrive](/python/api/azureml-train-core/azureml.train.hyperdrive), and [Azure Machine Learning pipelines](./how-to-create-machine-learning-pipelines.md).
3128

32-
If you aren't ready to make your data available for model training, but want to load your data to your notebook for data exploration, see how to [explore the data in your dataset](how-to-create-register-datasets.md).
29+
If you aren't ready to make your data available for model training, but want to load your data to your notebook for data exploration, see how to [explore the data in your dataset](how-to-create-register-datasets.md).
3330

3431
## Prerequisites
3532

@@ -43,11 +40,11 @@ To create and train with datasets, you need:
4340

4441

4542
> [!Note]
46-
> Some Dataset classes have dependencies on the [azureml-dataprep](https://pypi.org/project/azureml-dataprep/) package. For Linux users, these classes are supported only on the following distributions: Red Hat Enterprise Linux, Ubuntu, Fedora, and CentOS.
43+
> Some Dataset classes have dependencies on the [azureml-dataprep](https://pypi.org/project/azureml-dataprep/) package. For Linux users, these classes are supported only on the following distributions: Red Hat Enterprise Linux, Ubuntu, and Fedora.
4744
4845
## Consume datasets in machine learning training scripts
4946

50-
If you have structured data not yet registered as a dataset, create a TabularDataset and use it directly in your training script for your local or remote experiment.
47+
If you structured data not yet registered as a dataset, create a TabularDataset and use it directly in your training script for your local or remote experiment.
5148

5249
In this example, you create an unregistered [TabularDataset](/python/api/azureml-core/azureml.data.tabulardataset) and specify it as a script argument in the [ScriptRunConfig](/python/api/azureml-core/azureml.core.script_run_config.scriptrunconfig) object for training. If you want to reuse this TabularDataset with other experiments in your workspace, see [how to register datasets to your workspace](how-to-create-register-datasets.md).
5350

@@ -62,11 +59,11 @@ web_path ='https://dprepdata.blob.core.windows.net/demo/Titanic.csv'
6259
titanic_ds = Dataset.Tabular.from_delimited_files(path=web_path)
6360
```
6461

65-
TabularDataset objects provide the ability to load the data in your TabularDataset into a pandas or Spark DataFrame so that you can work with familiar data preparation and training libraries without having to leave your notebook.
62+
TabularDataset objects offer a way to load the data in your TabularDataset into a pandas or Spark DataFrame so that you can work with familiar data preparation and training libraries without having to leave your notebook.
6663

6764
### Access dataset in training script
6865

69-
The following code configures a script argument `--input-data` that you'll specify when you configure your training run (see next section). When the tabular dataset is passed in as the argument value, Azure Machine Learning will resolve that to ID of the dataset, which you can then use to access the dataset in your training script (without having to hardcode the name or ID of the dataset in your script). It then uses the [`to_pandas_dataframe()`](/python/api/azureml-core/azureml.data.tabulardataset#to-pandas-dataframe-on-error--null---out-of-range-datetime--null--) method to load that dataset into a pandas dataframe for further data exploration and preparation prior to training.
66+
The following code configures a script argument `--input-data` that you'll specify when you configure your training run (see next section). When the tabular dataset is passed in as the argument value, Azure Machine Learning resolves it to the dataset ID. You can then use that argument value to access the dataset in your training script (without having to hardcode the name or ID of the dataset in your script). It then uses the [`to_pandas_dataframe()`](/python/api/azureml-core/azureml.data.tabulardataset#to-pandas-dataframe-on-error--null---out-of-range-datetime--null--) method to load that dataset into a pandas dataframe for further data exploration and preparation before training.
7067

7168
> [!Note]
7269
> If your original data source contains NaN, empty strings or blank values, when you use `to_pandas_dataframe()`, then those values are replaced as a *Null* value.
@@ -101,7 +98,7 @@ This code creates a ScriptRunConfig object, `src`, that specifies:
10198

10299
* A script directory for your scripts. All the files in this directory are uploaded into the cluster nodes for execution.
103100
* The training script, *train_titanic.py*.
104-
* The input dataset for training, `titanic_ds`, as a script argument. Azure Machine Learning will resolve this to corresponding ID of the dataset when it's passed to your script.
101+
* The input dataset for training, `titanic_ds`, as a script argument. Azure Machine Learning resolves it to corresponding ID of the dataset when it's passed to your script.
105102
* The compute target for the run.
106103
* The environment for the run.
107104

@@ -122,9 +119,9 @@ run.wait_for_completion(show_output=True)
122119

123120
## Mount files to remote compute targets
124121

125-
If you have unstructured data, create a [FileDataset](/python/api/azureml-core/azureml.data.filedataset) and either mount or download your data files to make them available to your remote compute target for training. Learn about when to use [mount vs. download](#mount-vs-download) for your remote training experiments.
122+
If you have unstructured data, create a [FileDataset](/python/api/azureml-core/azureml.data.filedataset) and either mount or download your data files to make them available to your remote compute target for training. Learn about when to use [mount vs. download](#mount-vs-download) for your remote training experiments.
126123

127-
The following example,
124+
The following example
128125

129126
* Creates an input FileDataset, `mnist_ds`, for your training data.
130127
* Specifies where to write training results, and to promote those results as a FileDataset.
@@ -157,9 +154,9 @@ mnist_ds = Dataset.File.from_files(path = web_paths)
157154
```
158155
### Where to write training output
159156

160-
You can specify where to write your training results with an [OutputFileDatasetConfig object](/python/api/azureml-core/azureml.data.output_dataset_config.outputfiledatasetconfig).
157+
You can specify where to write your training results with an [OutputFileDatasetConfig object](/python/api/azureml-core/azureml.data.output_dataset_config.outputfiledatasetconfig).
161158

162-
OutputFileDatasetConfig objects allow you to:
159+
OutputFileDatasetConfig objects allow you to:
163160

164161
* Mount or upload the output of a run to cloud storage you specify.
165162
* Save the output as a FileDataset to these supported storage types:
@@ -168,7 +165,7 @@ OutputFileDatasetConfig objects allow you to:
168165
* Azure Data Lake Storage generations 1 and 2
169166
* Track the data lineage between training runs.
170167

171-
The following code specifies that training results should be saved as a FileDataset in the `outputdataset` folder in the default blob datastore, `def_blob_store`.
168+
The following code specifies that training results should be saved as a FileDataset in the `outputdataset` folder in the default blob datastore, `def_blob_store`.
172169

173170
```python
174171
from azureml.core import Workspace
@@ -231,11 +228,11 @@ with open(mounted_input_path, 'r') as f:
231228

232229
## Mount vs download
233230

234-
Mounting or downloading files of any format are supported for datasets created from Azure Blob storage, Azure Files, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Database, and Azure Database for PostgreSQL.
231+
Mounting or downloading files of any format are supported for datasets created from Azure Blob storage, Azure Files, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Database, and Azure Database for PostgreSQL.
235232

236233
When you **mount** a dataset, you attach the files referenced by the dataset to a directory (mount point) and make it available on the compute target. Mounting is supported for Linux-based computes, including Azure Machine Learning Compute, virtual machines, and HDInsight. If your data size exceeds the compute disk size, downloading isn't possible. For this scenario, we recommend mounting since only the data files used by your script are loaded at the time of processing.
237234

238-
When you **download** a dataset, all the files referenced by the dataset will be downloaded to the compute target. Downloading is supported for all compute types. If your script processes all files referenced by the dataset, and your compute disk can fit your full dataset, downloading is recommended to avoid the overhead of streaming data from storage services. For multi-node downloads, see [how to avoid throttling](#troubleshooting).
235+
When you **download** a dataset, all the files referenced by the dataset are downloaded to the compute target. Downloading is supported for all compute types. If your script processes all files referenced by the dataset, and your compute disk can fit your full dataset, downloading is recommended to avoid the overhead of streaming data from storage services. For multi-node downloads, see [how to avoid throttling](#troubleshooting).
239236

240237
> [!NOTE]
241238
> The download path name should not be longer than 255 alpha-numeric characters for Windows OS. For Linux OS, the download path name should not be longer than 4,096 alpha-numeric characters. Also, for Linux OS the file name (which is the last segment of the download path `/path/to/file/{filename}`) should not be longer than 255 alpha-numeric characters.
@@ -279,11 +276,11 @@ df = titanic_ds.to_pandas_dataframe()
279276

280277
## Access source code during training
281278

282-
Azure Blob storage has higher throughput speeds than an Azure file share and will scale to large numbers of jobs started in parallel. For this reason, we recommend configuring your runs to use Blob storage for transferring source code files.
279+
Azure Blob storage has higher throughput speeds than an Azure file share, and will scale to large numbers of jobs started in parallel. For this reason, we recommend configuring your runs to use Blob storage for transferring source code files.
283280

284281
The following code example specifies in the run configuration which blob datastore to use for source code transfers.
285282

286-
```python
283+
```python
287284
# workspaceblobstore is the default blob storage
288285
src.run_config.source_directory_data_store = "workspaceblobstore"
289286
```
@@ -295,14 +292,14 @@ src.run_config.source_directory_data_store = "workspaceblobstore"
295292

296293
## Troubleshooting
297294

298-
**Dataset initialization failed: Waiting for mount point to be ready has timed out**:
295+
**Dataset initialization failed: Waiting for mount point to be ready has timed out**:
299296
* If you don't have any outbound [network security group](/azure/virtual-network/network-security-groups-overview) rules and are using `azureml-sdk>=1.12.0`, update `azureml-dataset-runtime` and its dependencies to be the latest for the specific minor version, or if you're using it in a run, recreate your environment so it can have the latest patch with the fix.
300297
* If you're using `azureml-sdk<1.12.0`, upgrade to the latest version.
301298
* If you have outbound NSG rules, make sure there's an outbound rule that allows all traffic for the service tag `AzureResourceMonitor`.
302299

303300
**Dataset initialization failed: StreamAccessException was caused by ThrottlingException**
304301

305-
For multi-node file downloads, all nodes may attempt to download all files in the file dataset from the Azure Storage service, which results in a throttling error. To avoid throttling, initially set the environment variable `AZUREML_DOWNLOAD_CONCURRENCY` to a value of eight times the number of CPU cores divided by the number of nodes. Setting up a value for this environment variable may require some experimentation, so the aforementioned guidance is a starting point.
302+
For multi-node file downloads, all nodes might attempt to download all files in the file dataset from the Azure Storage service, which results in a throttling error. To avoid throttling, initially set the environment variable `AZUREML_DOWNLOAD_CONCURRENCY` to a value of eight times the number of CPU cores divided by the number of nodes. Setting up a value for this environment variable might require some experimentation, so the earlier guidance is a starting point.
306303

307304
The following example assumes 32 cores and 4 nodes.
308305

@@ -316,27 +313,27 @@ myenv.environment_variables = {"AZUREML_DOWNLOAD_CONCURRENCY":64}
316313

317314
**Unable to upload project files to working directory in AzureFile because the storage is overloaded**:
318315

319-
* If you're using file share for other workloads, such as data transfer, the recommendation is to use blobs so that file share is free to be used for submitting runs.
316+
* If you use file share for other workloads, such as data transfer, the recommendation is to use blobs so that file share is free to be used for submitting runs.
320317

321-
* Another option is to split the workload between two different workspaces.
318+
* You can also split the workload between two different workspaces.
322319

323320
**ConfigException: Could not create a connection to the AzureFileService due to missing credentials. Either an Account Key or SAS token needs to be linked the default workspace blob store.**
324321

325322
To ensure your storage access credentials are linked to the workspace and the associated file datastore, complete the following steps:
326323

327324
1. Navigate to your workspace in the [Azure portal](https://portal.azure.com).
328325
1. Select the storage link on the workspace **Overview** page.
329-
1. On the storage page, select **Access keys** on the left side menu.
326+
1. On the storage page, select **Access keys** on the left side menu.
330327
1. Copy the key.
331328
1. Navigate to the [Azure Machine Learning studio](https://ml.azure.com) for your workspace.
332-
1. In the studio, select the file datastore for which you want to provide authentication credentials.
333-
1. Select **Update authentication** .
334-
1. Paste the key from the previous steps.
335-
1. Select **Save**.
329+
1. In the studio, select the file datastore for which you want to provide authentication credentials.
330+
1. Select **Update authentication**.
331+
1. Paste the key from the previous steps.
332+
1. Select **Save**.
336333

337334
### Passing data as input
338335

339-
**TypeError: FileNotFound: No such file or directory**: This error occurs if the file path you provide isn't where the file is located. You need to make sure the way you refer to the file is consistent with where you mounted your dataset on your compute target. To ensure a deterministic state, we recommend using the abstract path when mounting a dataset to a compute target. For example, in the following code we mount the dataset under the root of the filesystem of the compute target, `/tmp`.
336+
**TypeError: FileNotFound: No such file or directory**: This error occurs if the file path you provide isn't where the file is located. You need to make sure the way you refer to the file is consistent with where you mounted your dataset on your compute target. To ensure a deterministic state, we recommend using the abstract path when mounting a dataset to a compute target. For example, in the following code we mount the dataset under the root of the filesystem of the compute target, `/tmp`.
340337

341338
```python
342339
# Note the leading / in '/tmp/dataset'
@@ -345,8 +342,7 @@ script_params = {
345342
}
346343
```
347344

348-
If you don't include the leading forward slash, '/', you'll need to prefix the working directory for example, `/mnt/batch/.../tmp/dataset` on the compute target to indicate where you want the dataset to be mounted.
349-
345+
If you don't include the leading forward slash, '/', you must prefix the working directory for example, `/mnt/batch/.../tmp/dataset` on the compute target to indicate where you want the dataset to be mounted.
350346

351347
## Next steps
352348

0 commit comments

Comments
 (0)