Remove references to CentOS to cover Centos EOL status . . .

fbsolo-ms1 · fbsolo-ms1 · commit 80a97808fe0e · 2024-09-25T13:37:37.000-07:00
diff --git a/articles/machine-learning/v1/how-to-train-with-datasets.md b/articles/machine-learning/v1/how-to-train-with-datasets.md
@@ -14,22 +14,19 @@ ms.custom: UpdateFrequency5, data4ml, sdkv1
 #Customer intent: As an experienced Python developer, I need to make my data available to my local or remote compute target to train my machine learning models.
 ---
 
-# Train models with Azure Machine Learning datasets 
-
-> [!CAUTION]
-> This article references CentOS, a Linux distribution that is End Of Life (EOL) status. Please consider your use and planning accordingly. For more information, see the [CentOS End Of Life guidance](/azure/virtual-machines/workloads/centos/centos-end-of-life).
+# Train models with Azure Machine Learning datasets
 
 [!INCLUDE [sdk v1](../includes/machine-learning-sdk-v1.md)]
 
-In this article, you learn how to work with [Azure Machine Learning datasets](/python/api/azureml-core/azureml.core.dataset%28class%29) to train machine learning models.  You can use datasets in your local or remote compute target without worrying about connection strings or data paths. 
+In this article, you learn how to work with [Azure Machine Learning datasets](/python/api/azureml-core/azureml.core.dataset%28class%29) to train machine learning models.  You can use datasets in your local or remote compute target without worrying about connection strings or data paths.
 
 * For structured data, see [Consume datasets in machine learning training scripts](#consume-datasets-in-machine-learning-training-scripts).
 
 * For unstructured data, see [Mount files to remote compute targets](#mount-files-to-remote-compute-targets).
 
 Azure Machine Learning datasets provide a seamless integration with Azure Machine Learning training functionality like [ScriptRunConfig](/python/api/azureml-core/azureml.core.scriptrunconfig), [HyperDrive](/python/api/azureml-train-core/azureml.train.hyperdrive), and [Azure Machine Learning pipelines](./how-to-create-machine-learning-pipelines.md).
 
-If you aren't ready to make your data available for model training, but want to load your data to your notebook for data exploration, see how to [explore the data in your dataset](how-to-create-register-datasets.md). 
+If you aren't ready to make your data available for model training, but want to load your data to your notebook for data exploration, see how to [explore the data in your dataset](how-to-create-register-datasets.md).
 
 ## Prerequisites
 
@@ -43,11 +40,11 @@ To create and train with datasets, you need:
 
 
 > [!Note]
-> Some Dataset classes have dependencies on the [azureml-dataprep](https://pypi.org/project/azureml-dataprep/) package. For Linux users, these classes are supported only on the following distributions:  Red Hat Enterprise Linux, Ubuntu, Fedora, and CentOS.
+> Some Dataset classes have dependencies on the [azureml-dataprep](https://pypi.org/project/azureml-dataprep/) package. For Linux users, these classes are supported only on the following distributions:  Red Hat Enterprise Linux, Ubuntu, and Fedora.
 
 ## Consume datasets in machine learning training scripts
 
-If you have structured data not yet registered as a dataset, create a TabularDataset and use it directly in your training script for your local or remote experiment.
+If you structured data not yet registered as a dataset, create a TabularDataset and use it directly in your training script for your local or remote experiment.
 
 In this example, you create an unregistered [TabularDataset](/python/api/azureml-core/azureml.data.tabulardataset) and specify it as a script argument in the [ScriptRunConfig](/python/api/azureml-core/azureml.core.script_run_config.scriptrunconfig) object for training. If you want to reuse this TabularDataset with other experiments in your workspace, see [how to register datasets to your workspace](how-to-create-register-datasets.md).
 
@@ -62,11 +59,11 @@ web_path ='https://dprepdata.blob.core.windows.net/demo/Titanic.csv'
 titanic_ds = Dataset.Tabular.from_delimited_files(path=web_path)
 ```
 
-TabularDataset objects provide the ability to load the data in your TabularDataset into a pandas or Spark DataFrame so that you can work with familiar data preparation and training libraries without having to leave your notebook.
+TabularDataset objects offer a way to load the data in your TabularDataset into a pandas or Spark DataFrame so that you can work with familiar data preparation and training libraries without having to leave your notebook.
 
 ### Access dataset in training script
 
-The following code configures a script argument `--input-data` that you'll specify when you configure your training run (see next section). When the tabular dataset is passed in as the argument value, Azure Machine Learning will resolve that to ID of the dataset, which you can then use to access the dataset in your training script (without having to hardcode the name or ID of the dataset in your script). It then uses the [`to_pandas_dataframe()`](/python/api/azureml-core/azureml.data.tabulardataset#to-pandas-dataframe-on-error--null---out-of-range-datetime--null--) method to load that dataset into a pandas dataframe for further data exploration and preparation prior to training.
+The following code configures a script argument `--input-data` that you'll specify when you configure your training run (see next section). When the tabular dataset is passed in as the argument value, Azure Machine Learning resolves it to the dataset ID. You can then use that argument value to access the dataset in your training script (without having to hardcode the name or ID of the dataset in your script). It then uses the [`to_pandas_dataframe()`](/python/api/azureml-core/azureml.data.tabulardataset#to-pandas-dataframe-on-error--null---out-of-range-datetime--null--) method to load that dataset into a pandas dataframe for further data exploration and preparation before training.
 
 > [!Note]
 > If your original data source contains NaN, empty strings or blank values, when you use `to_pandas_dataframe()`, then those values are replaced as a *Null* value.
@@ -101,7 +98,7 @@ This code creates a ScriptRunConfig object, `src`, that specifies:
 
 * A script directory for your scripts. All the files in this directory are uploaded into the cluster nodes for execution.
 * The training script, *train_titanic.py*.
-* The input dataset for training, `titanic_ds`, as a script argument. Azure Machine Learning will resolve this to corresponding ID of the dataset when it's passed to your script.
+* The input dataset for training, `titanic_ds`, as a script argument. Azure Machine Learning resolves it to corresponding ID of the dataset when it's passed to your script.
 * The compute target for the run.
 * The environment for the run.
 
@@ -122,9 +119,9 @@ run.wait_for_completion(show_output=True)
 
 ## Mount files to remote compute targets
 
-If you have unstructured data, create a [FileDataset](/python/api/azureml-core/azureml.data.filedataset) and either mount or download your data files to make them available to your remote compute target for training. Learn about when to use [mount vs. download](#mount-vs-download) for your remote training experiments. 
+If you have unstructured data, create a [FileDataset](/python/api/azureml-core/azureml.data.filedataset) and either mount or download your data files to make them available to your remote compute target for training. Learn about when to use [mount vs. download](#mount-vs-download) for your remote training experiments.
 
-The following example, 
+The following example
 
 * Creates an input FileDataset, `mnist_ds`, for your training data.
 * Specifies where to write training results, and to promote those results as a FileDataset.
@@ -157,9 +154,9 @@ mnist_ds = Dataset.File.from_files(path = web_paths)
 ```
 ### Where to write training output
 
-You can specify where to write your training results with an [OutputFileDatasetConfig object](/python/api/azureml-core/azureml.data.output_dataset_config.outputfiledatasetconfig). 
+You can specify where to write your training results with an [OutputFileDatasetConfig object](/python/api/azureml-core/azureml.data.output_dataset_config.outputfiledatasetconfig).
 
-OutputFileDatasetConfig objects allow you to: 
+OutputFileDatasetConfig objects allow you to:
 
 * Mount or upload the output of a run to cloud storage you specify.
 * Save the output as a FileDataset to these supported storage types:
@@ -168,7 +165,7 @@ OutputFileDatasetConfig objects allow you to:
     * Azure Data Lake Storage generations 1 and 2
 * Track the data lineage between training runs.
 
-The following code specifies that training results should be saved as a FileDataset in the `outputdataset` folder in the default blob datastore, `def_blob_store`. 
+The following code specifies that training results should be saved as a FileDataset in the `outputdataset` folder in the default blob datastore, `def_blob_store`.
 
 ```python
 from azureml.core import Workspace
@@ -231,11 +228,11 @@ with open(mounted_input_path, 'r') as f:
 
 ## Mount vs download
 
-Mounting or downloading files of any format are supported for datasets created from Azure Blob storage, Azure Files, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Database, and Azure Database for PostgreSQL. 
+Mounting or downloading files of any format are supported for datasets created from Azure Blob storage, Azure Files, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Database, and Azure Database for PostgreSQL.
 
 When you **mount** a dataset, you attach the files referenced by the dataset to a directory (mount point) and make it available on the compute target. Mounting is supported for Linux-based computes, including Azure Machine Learning Compute, virtual machines, and HDInsight. If your data size exceeds the compute disk size,  downloading isn't possible. For this scenario, we recommend mounting since only the data files used by your script are loaded at the time of processing.
 
-When you **download** a dataset, all the files referenced by the dataset will be downloaded to the compute target. Downloading is supported for all compute types. If your script processes all files referenced by the dataset, and your compute disk can fit your full dataset, downloading is recommended to avoid the overhead of streaming data from storage services. For multi-node downloads, see [how to avoid throttling](#troubleshooting). 
+When you **download** a dataset, all the files referenced by the dataset are downloaded to the compute target. Downloading is supported for all compute types. If your script processes all files referenced by the dataset, and your compute disk can fit your full dataset, downloading is recommended to avoid the overhead of streaming data from storage services. For multi-node downloads, see [how to avoid throttling](#troubleshooting).
 
 > [!NOTE]
 > The download path name should not be longer than 255 alpha-numeric characters for Windows OS. For Linux OS, the download path name should not be longer than 4,096 alpha-numeric characters. Also, for Linux OS the file name (which is the last segment of the download path `/path/to/file/{filename}`) should not be longer than 255 alpha-numeric characters.
@@ -279,11 +276,11 @@ df = titanic_ds.to_pandas_dataframe()
 
 ## Access source code during training
 
-Azure Blob storage has higher throughput speeds than an Azure file share and will scale to large numbers of jobs started in parallel. For this reason, we recommend configuring your runs to use Blob storage for transferring source code files.
+Azure Blob storage has higher throughput speeds than an Azure file share, and will scale to large numbers of jobs started in parallel. For this reason, we recommend configuring your runs to use Blob storage for transferring source code files.
 
 The following code example specifies in the run configuration which blob datastore to use for source code transfers.
 
-```python 
+```python
 # workspaceblobstore is the default blob storage
 src.run_config.source_directory_data_store = "workspaceblobstore" 
 ```
@@ -295,14 +292,14 @@ src.run_config.source_directory_data_store = "workspaceblobstore"
 
 ## Troubleshooting
 
-**Dataset initialization failed:  Waiting for mount point to be ready has timed out**: 
+**Dataset initialization failed:  Waiting for mount point to be ready has timed out**:
   * If you don't have any outbound [network security group](/azure/virtual-network/network-security-groups-overview) rules and are using `azureml-sdk>=1.12.0`, update `azureml-dataset-runtime` and its dependencies to be the latest for the specific minor version, or if you're using it in a run, recreate your environment so it can have the latest patch with the fix. 
   * If you're using `azureml-sdk<1.12.0`, upgrade to the latest version.
   * If you have outbound NSG rules, make sure there's an outbound rule that allows all traffic for the service tag `AzureResourceMonitor`.
 
 **Dataset initialization failed: StreamAccessException was caused by ThrottlingException**
 
-For multi-node file downloads, all nodes may attempt to download all files in the file dataset from the Azure Storage service, which results in a throttling error. To avoid throttling, initially set the environment variable `AZUREML_DOWNLOAD_CONCURRENCY` to a value of eight times the number of CPU cores divided by the number of nodes. Setting up a value for this environment variable may require some experimentation, so the aforementioned guidance is a starting point.
+For multi-node file downloads, all nodes might attempt to download all files in the file dataset from the Azure Storage service, which results in a throttling error. To avoid throttling, initially set the environment variable `AZUREML_DOWNLOAD_CONCURRENCY` to a value of eight times the number of CPU cores divided by the number of nodes. Setting up a value for this environment variable might require some experimentation, so the earlier guidance is a starting point.
 
 The following example assumes 32 cores and 4 nodes.
 
@@ -316,27 +313,27 @@ myenv.environment_variables = {"AZUREML_DOWNLOAD_CONCURRENCY":64}
 
 **Unable to upload project files to working directory in AzureFile because the storage is overloaded**:
 
-* If you're using file share for other workloads, such as data transfer, the recommendation is to use blobs so that file share is free to be used for submitting runs.
+* If you use file share for other workloads, such as data transfer, the recommendation is to use blobs so that file share is free to be used for submitting runs.
 
-* Another option is to split the workload between two different workspaces.
+* You can also split the workload between two different workspaces.
 
 **ConfigException: Could not create a connection to the AzureFileService due to missing credentials. Either an Account Key or SAS token needs to be linked the default workspace blob store.**
 
 To ensure your storage access credentials are linked to the workspace and the associated file datastore, complete the following steps:
 
 1. Navigate to your workspace in the [Azure portal](https://portal.azure.com).
 1. Select the storage link on the workspace **Overview** page.
-1. On the storage page, select **Access keys** on the left side menu. 
+1. On the storage page, select **Access keys** on the left side menu.
 1. Copy the key.
 1. Navigate to the [Azure Machine Learning studio](https://ml.azure.com) for your workspace.
-1. In the studio, select the file datastore for which you want to provide authentication credentials. 
-1. Select **Update authentication** .
-1. Paste the key from the previous steps. 
-1. Select **Save**. 
+1. In the studio, select the file datastore for which you want to provide authentication credentials.
+1. Select **Update authentication**.
+1. Paste the key from the previous steps.
+1. Select **Save**.
 
 ### Passing data as input
 
-**TypeError: FileNotFound: No such file or directory**: This error occurs if the file path you provide isn't where the file is located. You need to make sure the way you refer to the file is consistent with where you mounted your dataset on your compute target. To ensure a deterministic state, we recommend using the abstract path when mounting a dataset to a compute target. For example, in the following code we mount the dataset under the root of the filesystem of the compute target, `/tmp`. 
+**TypeError: FileNotFound: No such file or directory**: This error occurs if the file path you provide isn't where the file is located. You need to make sure the way you refer to the file is consistent with where you mounted your dataset on your compute target. To ensure a deterministic state, we recommend using the abstract path when mounting a dataset to a compute target. For example, in the following code we mount the dataset under the root of the filesystem of the compute target, `/tmp`.
     
 ```python
 # Note the leading / in '/tmp/dataset'
@@ -345,8 +342,7 @@ script_params = {
 } 
 ```
 
-If you don't include the leading forward slash, '/',  you'll need to prefix the working directory for example, `/mnt/batch/.../tmp/dataset` on the compute target to indicate where you want the dataset to be mounted.
-
+If you don't include the leading forward slash, '/', you must prefix the working directory for example, `/mnt/batch/.../tmp/dataset` on the compute target to indicate where you want the dataset to be mounted.
 
 ## Next steps