Skip to content

Commit 5a295f2

Browse files
committed
Merge branch 'main' of https://github.com/MicrosoftDocs/azure-docs-pr into ingress-legacy-2
2 parents f416ff2 + 3c216b9 commit 5a295f2

File tree

3 files changed

+34
-47
lines changed

3 files changed

+34
-47
lines changed

articles/ai-services/containers/azure-kubernetes-recipe.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ author: aahill
77
manager: nitinme
88
ms.service: azure-ai-language
99
ms.topic: conceptual
10-
ms.date: 01/10/2022
10+
ms.date: 02/26/2024
1111
ms.author: aahi
1212
ms.custom: devx-track-azurecli
1313
ms.devlang: azurecli

articles/ai-services/language-service/whats-new.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ author: aahill
77
manager: nitinme
88
ms.service: azure-ai-language
99
ms.topic: whats-new
10-
ms.date: 01/31/2024
10+
ms.date: 02/26/2024
1111
ms.author: aahi
1212
---
1313

articles/machine-learning/v1/how-to-version-track-datasets.md

Lines changed: 32 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,9 @@ services: machine-learning
66
ms.service: machine-learning
77
ms.subservice: mldata
88
ms.author: samkemp
9+
ms.reviewer: franksolomon
910
author: samuel100
10-
ms.date: 08/17/2022
11+
ms.date: 02/26/2024
1112
ms.topic: how-to
1213
ms.custom: UpdateFrequency5, data4ml, sdkv1
1314
#Customer intent: As a data scientist, I want to version and track datasets so I can use and share them across multiple machine learning experiments.
@@ -17,38 +18,34 @@ ms.custom: UpdateFrequency5, data4ml, sdkv1
1718

1819
[!INCLUDE [sdk v1](../includes/machine-learning-sdk-v1.md)]
1920

20-
In this article, you'll learn how to version and track Azure Machine Learning datasets for reproducibility. Dataset versioning is a way to bookmark the state of your data so that you can apply a specific version of the dataset for future experiments.
21+
In this article, you'll learn how to version and track Azure Machine Learning datasets for reproducibility. Dataset versioning bookmarks specific states of your data, so that you can apply a specific version of the dataset for future experiments.
2122

22-
Typical versioning scenarios:
23+
You might want to version your Azure Machine Learning resources in these typical scenarios:
2324

24-
* When new data is available for retraining
25-
* When you're applying different data preparation or feature engineering approaches
25+
* When new data becomes available for retraining
26+
* When you apply different data preparation or feature engineering approaches
2627

2728
## Prerequisites
2829

29-
For this tutorial, you need:
30+
- The [Azure Machine Learning SDK for Python](/python/api/overview/azure/ml/install). This SDK includes the [azureml-datasets](/python/api/azureml-core/azureml.core.dataset) package
3031

31-
- [Azure Machine Learning SDK for Python installed](/python/api/overview/azure/ml/install). This SDK includes the [azureml-datasets](/python/api/azureml-core/azureml.core.dataset) package.
32-
33-
- An [Azure Machine Learning workspace](../concept-workspace.md). Retrieve an existing one by running the following code, or [create a new workspace](../quickstart-create-resources.md).
32+
- An [Azure Machine Learning workspace](../concept-workspace.md). [Create a new workspace](../quickstart-create-resources.md), or retrieve an existing workspace with this code sample:
3433

3534
```Python
3635
import azureml.core
3736
from azureml.core import Workspace
3837

3938
ws = Workspace.from_config()
4039
```
41-
- An [Azure Machine Learning dataset](how-to-create-register-datasets.md).
42-
43-
<a name="register"></a>
40+
- An [Azure Machine Learning dataset](how-to-create-register-datasets.md)
4441

4542
## Register and retrieve dataset versions
4643

47-
By registering a dataset, you can version, reuse, and share it across experiments and with colleagues. You can register multiple datasets under the same name and retrieve a specific version by name and version number.
44+
You can version, reuse, and share a registered dataset across experiments and with your colleagues. You can register multiple datasets under the same name, and retrieve a specific version by name and version number.
4845

4946
### Register a dataset version
5047

51-
The following code registers a new version of the `titanic_ds` dataset by setting the `create_new_version` parameter to `True`. If there's no existing `titanic_ds` dataset registered with the workspace, the code creates a new dataset with the name `titanic_ds` and sets its version to 1.
48+
This code sample sets the `create_new_version` parameter of the `titanic_ds` dataset to `True`, to register a new version of that dataset. If the workspace has no existing `titanic_ds` dataset registered, the code creates a new dataset with the name `titanic_ds`, and sets its version to 1.
5249

5350
```Python
5451
titanic_ds = titanic_ds.register(workspace = workspace,
@@ -59,9 +56,9 @@ titanic_ds = titanic_ds.register(workspace = workspace,
5956

6057
### Retrieve a dataset by name
6158

62-
By default, the [get_by_name()](/python/api/azureml-core/azureml.core.dataset.dataset#get-by-name-workspace--name--version--latest--) method on the `Dataset` class returns the latest version of the dataset registered with the workspace.
59+
By default, the `Dataset` class [get_by_name()](/python/api/azureml-core/azureml.core.dataset.dataset#azureml-core-dataset-dataset-get-by-name) method returns the latest version of the dataset registered with the workspace.
6360

64-
The following code gets version 1 of the `titanic_ds` dataset.
61+
This code returns version 1 of the `titanic_ds` dataset.
6562

6663
```Python
6764
from azureml.core import Dataset
@@ -71,18 +68,16 @@ titanic_ds = Dataset.get_by_name(workspace = workspace,
7168
version = 1)
7269
```
7370

74-
<a name="best-practice"></a>
75-
7671
## Versioning best practice
7772

78-
When you create a dataset version, you're *not* creating an extra copy of data with the workspace. Because datasets are references to the data in your storage service, you have a single source of truth, managed by your storage service.
73+
When you create a dataset version, you *don't* create an extra copy of data with the workspace. Since datasets are references to the data in your storage service, you have a single source of truth, managed by your storage service.
7974

8075
>[!IMPORTANT]
81-
> If the data referenced by your dataset is overwritten or deleted, calling a specific version of the dataset does *not* revert the change.
76+
> If the data referenced by your dataset is overwritten or deleted, a call to a specific version of the dataset does *not* revert the change.
8277
83-
When you load data from a dataset, the current data content referenced by the dataset is always loaded. If you want to make sure that each dataset version is reproducible, we recommend that you not modify data content referenced by the dataset version. When new data comes in, save new data files into a separate data folder and then create a new dataset version to include data from that new folder.
78+
When you load data from a dataset, the current data content referenced by the dataset is always loaded. If you want to make sure that each dataset version is reproducible, we recommend that you avoid modification of data content referenced by the dataset version. When new data comes in, save new data files into a separate data folder, and then create a new dataset version to include data from that new folder.
8479

85-
The following image and sample code show the recommended way to structure your data folders and to create dataset versions that reference those folders:
80+
This image and sample code show the recommended way to both structure your data folders and create dataset versions that reference those folders:
8681

8782
![Folder structure](./media/how-to-version-track-datasets/folder-image.png)
8883

@@ -110,13 +105,11 @@ dataset2.register(workspace = workspace,
110105

111106
```
112107

113-
<a name="pipeline"></a>
114-
115108
## Version an ML pipeline output dataset
116109

117110
You can use a dataset as the input and output of each [ML pipeline](../concept-ml-pipelines.md) step. When you rerun pipelines, the output of each pipeline step is registered as a new dataset version.
118111

119-
ML pipelines populate the output of each step into a new folder every time the pipeline reruns. This behavior allows the versioned output datasets to be reproducible. Learn more about [datasets in pipelines](./how-to-create-machine-learning-pipelines.md#steps).
112+
Machine Learning pipelines populate the output of each step into a new folder every time the pipeline reruns. The versioned output datasets then become reproducible. For more information, visit [datasets in pipelines](./how-to-create-machine-learning-pipelines.md#steps).
120113

121114
```Python
122115
from azureml.core import Dataset
@@ -148,23 +141,19 @@ prep_step = PythonScriptStep(script_name="prepare.py",
148141
source_directory=project_folder)
149142
```
150143

151-
<a name="track"></a>
152-
153144
## Track data in your experiments
154145

155-
Azure Machine Learning tracks your data throughout your experiment as input and output datasets.
156-
157-
The following are scenarios where your data is tracked as an **input dataset**.
146+
Azure Machine Learning tracks your data throughout your experiment as input and output datasets. In these scenarios, your data is tracked as an **input dataset**:
158147

159-
* As a `DatasetConsumptionConfig` object through either the `inputs` or `arguments` parameter of your `ScriptRunConfig` object when submitting the experiment job.
148+
* As a `DatasetConsumptionConfig` object, through either the `inputs` or `arguments` parameter of your `ScriptRunConfig` object, when submitting the experiment job
160149

161-
* When methods like, get_by_name() or get_by_id() are called in your script. For this scenario, the name assigned to the dataset when you registered it to the workspace is the name displayed.
150+
* When your script calls certain methods - `get_by_name()` or `get_by_id()` - for example. The name assigned to the dataset at the time you registered that dataset to the workspace is the displayed name
162151

163-
The following are scenarios where your data is tracked as an **output dataset**.
152+
In these scenarios, your data is tracked as an **output dataset**:
164153

165-
* Pass an `OutputFileDatasetConfig` object through either the `outputs` or `arguments` parameter when submitting an experiment job. `OutputFileDatasetConfig` objects can also be used to persist data between pipeline steps. See [Move data between ML pipeline steps.](how-to-move-data-in-out-of-pipelines.md)
154+
* Pass an `OutputFileDatasetConfig` object through either the `outputs` or `arguments` parameter when you submit an experiment job. `OutputFileDatasetConfig` objects can also persist data between pipeline steps. For more information, visit [Move data between ML pipeline steps](how-to-move-data-in-out-of-pipelines.md)
166155

167-
* Register a dataset in your script. For this scenario, the name assigned to the dataset when you registered it to the workspace is the name displayed. In the following example, `training_ds` is the name that would be displayed.
156+
* Register a dataset in your script. The name assigned to the dataset when you registered it to the workspace is the name displayed. In this code sample, `training_ds` is the displayed name:
168157

169158
```Python
170159
training_ds = unregistered_ds.register(workspace = workspace,
@@ -173,13 +162,11 @@ The following are scenarios where your data is tracked as an **output dataset**.
173162
)
174163
```
175164

176-
* Submit child job with an unregistered dataset in script. This results in an anonymous saved dataset.
165+
* Submission of a child job, with an unregistered dataset, in the script. This submission results in an anonymous saved dataset
177166

178167
### Trace datasets in experiment jobs
179168

180-
For each Machine Learning experiment, you can easily trace the datasets used as input with the experiment `Job` object.
181-
182-
The following code uses the [`get_details()`](/python/api/azureml-core/azureml.core.run.run#get-details--) method to track which input datasets were used with the experiment run:
169+
For each Machine Learning experiment, you can trace the input datasets for the experiment `Job` object. This code sample uses the [`get_details()`](/python/api/azureml-core/azureml.core.run.run#get-details--) method to track the input datasets used with the experiment run:
183170

184171
```Python
185172
# get input datasets
@@ -190,27 +177,27 @@ input_dataset = inputs[0]['dataset']
190177
input_dataset.to_path()
191178
```
192179

193-
You can also find the `input_datasets` from experiments by using the [Azure Machine Learning studio]().
180+
You can also find the `input_datasets` from experiments with the [Azure Machine Learning studio](https://ml.azure.com).
194181

195-
The following image shows where to find the input dataset of an experiment on Azure Machine Learning studio. For this example, go to your **Experiments** pane and open the **Properties** tab for a specific run of your experiment, `keras-mnist`.
182+
This screenshot shows where to find the input dataset of an experiment on Azure Machine Learning studio. For this example, start at your **Experiments** pane, and open the **Properties** tab for a specific run of your experiment, `keras-mnist`.
196183

197184
![Input datasets](./media/how-to-version-track-datasets/input-datasets.png)
198185

199-
Use the following code to register models with datasets:
186+
This code registers models with datasets:
200187

201188
```Python
202189
model = run.register_model(model_name='keras-mlp-mnist',
203190
model_path=model_path,
204191
datasets =[('training data',train_dataset)])
205192
```
206193

207-
After registration, you can see the list of models registered with the dataset by using Python or go to the [studio](https://ml.azure.com/).
194+
After registration, you can see the list of models registered with the dataset with either Python or the [studio](https://ml.azure.com/).
208195

209-
The following view is from the **Datasets** pane under **Assets**. Select the dataset and then select the **Models** tab for a list of the models that are registered with the dataset.
196+
Thia screenshot is from the **Datasets** pane under **Assets**. Select the dataset, and then select the **Models** tab for a list of the models that are registered with the dataset.
210197

211198
![Input datasets models](./media/how-to-version-track-datasets/dataset-models.png)
212199

213200
## Next steps
214201

215202
* [Train with datasets](how-to-train-with-datasets.md)
216-
* [More sample dataset notebooks](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/work-with-data/)
203+
* [More sample dataset notebooks](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/work-with-data/)

0 commit comments

Comments
 (0)