Skip to content

Commit 24c6a2c

Browse files
Merge pull request #3411 from fbsolo-ms1/freshness-updates
Freshness updates for V1 articles . . .
2 parents 7ac3722 + 2f5430d commit 24c6a2c

7 files changed

+77
-76
lines changed

articles/machine-learning/v1/how-to-create-register-datasets.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ ms.custom: UpdateFrequency5, data4ml, devx-track-arm-template
1010
ms.author: yogipandey
1111
author: ynpandey
1212
ms.reviewer: franksolomon
13-
ms.date: 02/28/2024
13+
ms.date: 03/06/2025
1414
#Customer intent: As an experienced data scientist, I need to package my data into a consumable and reusable object to train my machine learning models.
1515
---
1616

@@ -32,7 +32,7 @@ With Azure Machine Learning datasets, you can:
3232

3333
> [!IMPORTANT]
3434
> Items in this article marked as "preview" are currently in public preview.
35-
> The preview version is provided without a service level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities.
35+
> The preview version is provided without a service level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities.
3636
> For more information, see [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/).
3737
3838
## Prerequisites
@@ -55,7 +55,7 @@ To create and work with datasets, you need:
5555
> Some dataset classes have dependencies on the [azureml-dataprep](https://pypi.org/project/azureml-dataprep/) package, which is only compatible with 64-bit Python. If you develop on __Linux__, these classes rely on .NET Core 2.1, and only specific distributions support them. For more information about the supported distros, read the .NET Core 2.1 column in the [Install .NET on Linux](/dotnet/core/install/linux) article.
5656
5757
> [!IMPORTANT]
58-
> While the package may work on older versions of Linux distros, we do not recommend use of a distro that is out of mainstream support. Distros that are out of mainstream support may have security vulnerabilities, because they do not receive the latest updates. We recommend using the latest supported version of your distro that is compatible with .
58+
> While the package might work on older versions of Linux distros, we don't recommend use of a distro that is out of mainstream support. Distros that are out of mainstream support might have security vulnerabilities, because they don't receive the latest updates. We recommend using the latest supported version of your distro that is compatible with .
5959
6060
## Compute size guidance
6161

@@ -71,7 +71,7 @@ There are two dataset types, based on how users consume datasets in training: Fi
7171

7272
### FileDataset
7373

74-
A [FileDataset](/python/api/azureml-core/azureml.data.file_dataset.filedataset) references single or multiple files in your datastores or public URLs. If your data is already cleaned, and ready to use in training experiments, you can [download or mount](how-to-train-with-datasets.md#mount-vs-download) the files to your compute as a FileDataset object.
74+
A [FileDataset](/python/api/azureml-core/azureml.data.file_dataset.filedataset) references single or multiple files in your datastores or public URLs. If you have cleaned data that is ready for use in training experiments, you can [download or mount](how-to-train-with-datasets.md#mount-vs-download) the files to your compute as a FileDataset object.
7575

7676
We recommend FileDatasets for your machine learning workflows, because the source files can be in any format. This enables a wider range of machine learning scenarios, including deep learning.
7777

@@ -87,8 +87,8 @@ Create a TabularDataset with [the Python SDK](#create-a-tabulardataset) or [Azur
8787

8888
>[!NOTE]
8989
> [Automated ML](../concept-automated-ml.md) workflows generated via the Azure Machine Learning studio currently only support TabularDatasets.
90-
>
91-
>Also, for TabularDatasets generated from [SQL query results](/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory#azureml-data-dataset-factory-tabulardatasetfactory-from-sql-query), T-SQL (e.g. 'WITH' sub query) or duplicate column names are not supported. Complex T-SQL queries can cause performance issues. Duplicate column names in a dataset can cause ambiguity issues.
90+
>
91+
>Also, for TabularDatasets generated from [SQL query results](/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory#azureml-data-dataset-factory-tabulardatasetfactory-from-sql-query), T-SQL (e.g. 'WITH' sub query) or duplicate column names aren't supported. Complex T-SQL queries can cause performance issues. Duplicate column names in a dataset can cause ambiguity issues.
9292
9393
## Access datasets in a virtual network
9494

@@ -108,7 +108,7 @@ To create datasets from a datastore with the Python SDK:
108108
1. Create the dataset by referencing paths in the datastore. You can create a dataset from multiple paths in multiple datastores. There's no hard limit on the number of files or data size from which you can create a dataset.
109109

110110
> [!NOTE]
111-
> For each data path, a few requests will be sent to the storage service to check whether it points to a file or a folder. This overhead may lead to degraded performance or failure. A dataset referencing one folder with 1000 files inside is considered referencing one data path. For optimal performance, we recommend creating datasets that reference less than 100 paths in datastores.
111+
> For each data path, a few requests are sent to the storage service to check whether it points to a file or a folder. This overhead might lead to degraded performance or failure. A dataset that references one folder with 1,000 files inside is considered referencing one data path. For optimal performance, we recommend creating datasets that reference fewer than 100 paths in datastores.
112112
113113
### Create a FileDataset
114114

@@ -206,7 +206,7 @@ After you create and [register](#register-datasets) your dataset, you can load t
206206

207207
Filtering capabilities depends on the type of dataset you have.
208208
> [!IMPORTANT]
209-
> Filtering datasets with the [`filter()`](/python/api/azureml-core/azureml.data.tabulardataset#azureml-data-tabulardataset-filter) preview method is an [experimental](/python/api/overview/azure/ml/#stable-vs-experimental) preview feature, and may change at any time.
209+
> Filtering datasets with the [`filter()`](/python/api/azureml-core/azureml.data.tabulardataset#azureml-data-tabulardataset-filter) preview method is an [experimental](/python/api/overview/azure/ml/#stable-vs-experimental) preview feature, and could change at any time.
210210
>
211211
For **TabularDatasets**, you can keep or remove columns with the [keep_columns()](/python/api/azureml-core/azureml.data.tabulardataset#azureml-data-tabulardataset-keep-columns) and [drop_columns()](/python/api/azureml-core/azureml.data.tabulardataset#azureml-data-tabulardataset-drop-columns) methods.
212212

@@ -338,7 +338,7 @@ dataset = Dataset.Tabular.register_pandas_dataframe(pandas_df, datastore, "datas
338338

339339
```
340340
> [!TIP]
341-
> Create and register a TabularDataset from an in memory spark dataframe or a dask dataframe with the public preview methods, [`register_spark_dataframe()`](/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory#azureml-data-dataset-factory-tabulardatasetfactory-register-spark-dataframe) and [`register_dask_dataframe()`](/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory#azureml-data-dataset-factory-tabulardatasetfactory-register-dask-dataframe). These methods are [experimental](/python/api/overview/azure/ml/#stable-vs-experimental) preview features, and may change at any time.
341+
> Create and register a TabularDataset from an in memory spark dataframe or a dask dataframe with the public preview methods, [`register_spark_dataframe()`](/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory#azureml-data-dataset-factory-tabulardatasetfactory-register-spark-dataframe) and [`register_dask_dataframe()`](/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory#azureml-data-dataset-factory-tabulardatasetfactory-register-dask-dataframe). These methods are [experimental](/python/api/overview/azure/ml/#stable-vs-experimental) preview features, and might change at any time.
342342
>
343343
> These methods upload data to your underlying storage, and as a result incur storage costs.
344344

articles/machine-learning/v1/how-to-data-prep-synapse-spark-pool.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ ms.topic: how-to
99
author: ynpandey
1010
ms.author: franksolomon
1111
ms.reviewer: franksolomon
12-
ms.date: 02/22/2024
12+
ms.date: 03/06/2025
1313
ms.custom: UpdateFrequency5, data4ml, synapse-azureml, sdkv1
1414
#Customer intent: As a data scientist, I want to prepare my data at scale, and to train my machine learning models from a single notebook using Azure Machine Learning.
1515
---
@@ -71,7 +71,7 @@ After the session starts, you can check the session's metadata:
7171
You can specify an [Azure Machine Learning environment](../concept-environments.md) to use during your Apache Spark session. Only Conda dependencies specified in the environment will take effect. Docker images aren't supported.
7272

7373
>[!WARNING]
74-
> Python dependencies specified in environment Conda dependencies are not supported in Apache Spark pools. Currently, only fixed Python versions are supported
74+
> Python dependencies specified in environment Conda dependencies aren't supported in Apache Spark pools. Currently, only fixed Python versions are supported
7575
> Include `sys.version_info` in your script to check your Python version
7676
7777
This code creates the`myenv` environment variable, to install `azureml-core` version 1.20.0 and `numpy` version 1.17.0 before the session starts. You can then include this environment in your Apache Spark session `start` statement.
@@ -214,7 +214,7 @@ After you complete the data preparation, and you save your prepared data to stor
214214
%synapse stop
215215
```
216216

217-
## Create a dataset, to represent prepared data
217+
## Create a dataset to represent prepared data
218218

219219
When you're ready to consume your prepared data for model training, connect to your storage with an [Azure Machine Learning datastore](how-to-access-data.md), and specify the file or file you want to use with an [Azure Machine Learning dataset](how-to-create-register-datasets.md).
220220

@@ -238,7 +238,7 @@ input1 = train_ds.as_mount()
238238

239239
## Use a `ScriptRunConfig` to submit an experiment run to a Synapse Spark pool
240240

241-
If you're ready to automate and productionize your data wrangling tasks, you can submit an experiment run to [an attached Synapse Spark pool](how-to-link-synapse-ml-workspaces.md#attach-a-pool-with-the-python-sdk) with the [ScriptRunConfig](/python/api/azureml-core/azureml.core.scriptrunconfig) object. In a similar way, if you have an Azure Machine Learning pipeline, you can use the [SynapseSparkStep to specify your Synapse Spark pool as the compute target](how-to-use-synapsesparkstep.md) for the data preparation step in your pipeline. Availability of your data to the Synapse Spark pool depends on your dataset type.
241+
If you're ready to automate and productionize your data wrangling tasks, you can submit an experiment run to [an attached Synapse Spark pool](how-to-link-synapse-ml-workspaces.md#attach-a-pool-with-the-python-sdk) with the [ScriptRunConfig](/python/api/azureml-core/azureml.core.scriptrunconfig) object. In a similar way, if you have an Azure Machine Learning pipeline, you can use the [SynapseSparkStep to specify your Synapse Spark pool as the compute target](how-to-use-synapsesparkstep.md) for your pipeline data preparation step. Availability of your data to the Synapse Spark pool depends on your dataset type.
242242

243243
* For a FileDataset, you can use the [`as_hdfs()`](/python/api/azureml-core/azureml.data.filedataset#as-hdfs--) method. When the run is submitted, the dataset is made available to the Synapse Spark pool as a Hadoop distributed file system (HFDS)
244244
* For a [TabularDataset](how-to-create-register-datasets.md#tabulardataset), you can use the [`as_named_input()`](/python/api/azureml-core/azureml.data.abstract_dataset.abstractdataset#as-named-input-name-) method

articles/machine-learning/v1/how-to-designer-transform-data.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ ms.subservice: mldata
88
ms.reviewer: franksolomon
99
ms.author: keli19
1010
author: likebupt
11-
ms.date: 03/27/2024
11+
ms.date: 03/07/2025
1212
ms.topic: how-to
1313
ms.custom: UpdateFrequency5, designer
1414
---
@@ -17,22 +17,22 @@ ms.custom: UpdateFrequency5, designer
1717

1818
In this article, you learn how to transform and save datasets in the Azure Machine Learning designer, to prepare your own data for machine learning.
1919

20-
You'll use the sample [Adult Census Income Binary Classification](samples-designer.md) dataset to prepare two datasets: one dataset that includes adult census information from only the United States, and another dataset that includes census information from non-US adults.
20+
You'll use the sample [Adult Census Income Binary Classification](samples-designer.md) dataset to prepare two datasets. One dataset includes adult census information from only the United States, and another dataset includes census information from non-US adults.
2121

22-
In this article, you'll learn how to:
22+
In this article, you learn how to:
2323

2424
1. Transform a dataset to prepare it for training.
2525
1. Export the resulting datasets to a datastore.
2626
1. View the results.
2727

28-
This how-to is a prerequisite for the [how to retrain designer models](how-to-retrain-designer.md) article. In that article, you'll learn how to use the transformed datasets to train multiple models, with pipeline parameters.
28+
This how-to is a prerequisite for the [how to retrain designer models](how-to-retrain-designer.md) article. In that article, you learn how to use the transformed datasets to train multiple models with pipeline parameters.
2929

3030
> [!IMPORTANT]
31-
> If you do not observe graphical elements mentioned in this document, such as buttons in studio or designer, you may not have the correct level of permissions to the workspace. Please contact your Azure subscription administrator to verify that you have been granted the correct level of access. For more information, visit [Manage users and roles](../how-to-assign-roles.md).
31+
> If you don't observe the graphical elements mentioned in this document - for example, buttons in studio or designer, you might not have the correct level of permissions to the workspace. Contact your Azure subscription administrator to verify that you have the correct level of access. For more information, visit [Manage users and roles](../how-to-assign-roles.md).
3232
3333
## Transform a dataset
3434

35-
In this section, you'll learn how to import the sample dataset, and split the data into US and non-US datasets. Visit [how to import data](how-to-designer-import-data.md) for more information about how to import your own data into the designer.
35+
In this section, you learn how to import the sample dataset, and split the data into US and non-US datasets. Visit [how to import data](how-to-designer-import-data.md) for more information about how to import your own data into the designer.
3636

3737
### Import data
3838

@@ -52,7 +52,7 @@ Use these steps to import the sample dataset:
5252

5353
### Split the data
5454

55-
In this section, you'll use the [Split Data component](../algorithm-module-reference/split-data.md) to identify and split rows that contain "United-States" in the "native-country" column
55+
In this section, you use the [Split Data component](../algorithm-module-reference/split-data.md) to identify and split rows that contain "United-States" in the "native-country" column
5656

5757
1. To the left of the canvas, in the component tab, expand the **Data Transformation** section, and find the **Split Data** component
5858

@@ -91,7 +91,7 @@ Now that you set up your pipeline to split the data, you must specify where to p
9191
For the **Split Data** component, the output port order is important. The first output port contains the rows where the regular expression is true. In this case, the first port contains rows for US-based income, and the second port contains rows for non-US based income
9292

9393
1. In the component details pane to the right of the canvas, set the following options:
94-
94+
9595
**Datastore type**: Azure Blob Storage
9696

9797
**Datastore**: Select an existing datastore, or select "New datastore" to create a new one
@@ -101,9 +101,9 @@ Now that you set up your pipeline to split the data, you must specify where to p
101101
**File format**: csv
102102

103103
> [!NOTE]
104-
> This article assumes that you have access to a datastore registered to the current Azure Machine Learning workspace. Visit [Connect to Azure storage services](how-to-connect-data-ui.md#create-datastores) for datastore setup instructions
104+
> This article assumes that you have access to a datastore registered to the current Azure Machine Learning workspace. Visit [Connect to Azure storage services](how-to-connect-data-ui.md#create-datastores) for datastore setup instructions.
105105
106-
You can create a datastore if you don't have one now. For example purposes, this article saves the datasets to the default blob storage account associated with the workspace. It saves the datasets into the `azureml` container, in a new folder named `data`
106+
You can create a datastore if you don't have one. For example purposes, this article saves the datasets to the default blob storage account associated with the workspace. It saves the datasets into the `azureml` container, in a new folder named `data`.
107107

108108
1. Select the **Export Data** component connected to the *right*-most port of the **Split Data** component, to open the Export Data configuration pane
109109

0 commit comments

Comments
 (0)