You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In this article, you'll learn how to version and track Azure Machine Learning datasets for reproducibility. Dataset versioning is a way to bookmark the state of your data so that you can apply a specific version of the dataset for future experiments.
21
+
In this article, you'll learn how to version and track Azure Machine Learning datasets for reproducibility. Dataset versioning bookmarks specific states of your data, so that you can apply a specific version of the dataset for future experiments.
21
22
22
-
Typical versioning scenarios:
23
+
You might want to version your Azure Machine Learning resources in these typical scenarios:
23
24
24
-
* When new data is available for retraining
25
-
* When you're applying different data preparation or feature engineering approaches
25
+
* When new data becomes available for retraining
26
+
* When you apply different data preparation or feature engineering approaches
26
27
27
28
## Prerequisites
28
29
29
-
For this tutorial, you need:
30
+
- The [Azure Machine Learning SDK for Python](/python/api/overview/azure/ml/install). This SDK includes the [azureml-datasets](/python/api/azureml-core/azureml.core.dataset) package
30
31
31
-
-[Azure Machine Learning SDK for Python installed](/python/api/overview/azure/ml/install). This SDK includes the [azureml-datasets](/python/api/azureml-core/azureml.core.dataset) package.
32
-
33
-
- An [Azure Machine Learning workspace](../concept-workspace.md). Retrieve an existing one by running the following code, or [create a new workspace](../quickstart-create-resources.md).
32
+
- An [Azure Machine Learning workspace](../concept-workspace.md). [Create a new workspace](../quickstart-create-resources.md), or retrieve an existing workspace with this code sample:
34
33
35
34
```Python
36
35
import azureml.core
37
36
from azureml.core import Workspace
38
37
39
38
ws = Workspace.from_config()
40
39
```
41
-
- An [Azure Machine Learning dataset](how-to-create-register-datasets.md).
42
-
43
-
<a name="register"></a>
40
+
- An [Azure Machine Learning dataset](how-to-create-register-datasets.md)
44
41
45
42
## Register and retrieve dataset versions
46
43
47
-
By registering a dataset, you can version, reuse, and share it across experiments andwith colleagues. You can register multiple datasets under the same name and retrieve a specific version by name and version number.
44
+
You can version, reuse, and share a registered dataset across experiments andwithyour colleagues. You can register multiple datasets under the same name,and retrieve a specific version by name and version number.
48
45
49
46
### Register a dataset version
50
47
51
-
The following code registers a new version of the `titanic_ds` dataset by setting the `create_new_version` parameter to `True`. If there's no existing `titanic_ds` dataset registered with the workspace, the code creates a new dataset with the name `titanic_ds` and sets its version to 1.
48
+
This code sample sets the `create_new_version` parameter of the `titanic_ds` dataset to `True`, to register a new version of that dataset. If the workspace has no existing `titanic_ds` dataset registered, the code creates a new dataset with the name `titanic_ds`,and sets its version to 1.
By default, the [get_by_name()](/python/api/azureml-core/azureml.core.dataset.dataset#get-by-name-workspace--name--version--latest--) method on the `Dataset` class returns the latest version of the dataset registered with the workspace.
59
+
By default, the `Dataset` class [get_by_name()](/python/api/azureml-core/azureml.core.dataset.dataset#azureml-core-dataset-dataset-get-by-name) method returns the latest version of the dataset registered with the workspace.
63
60
64
-
The following code gets version 1 of the `titanic_ds` dataset.
61
+
This code returns version 1 of the `titanic_ds` dataset.
When you create a dataset version, you're *not* creating an extra copy of data with the workspace. Because datasets are references to the data in your storage service, you have a single source of truth, managed by your storage service.
73
+
When you create a dataset version, you*don't* create an extra copy of data with the workspace. Since datasets are references to the data in your storage service, you have a single source of truth, managed by your storage service.
79
74
80
75
>[!IMPORTANT]
81
-
> If the data referenced by your dataset is overwritten or deleted, calling a specific version of the dataset does *not* revert the change.
76
+
> If the data referenced by your dataset is overwritten or deleted, a call to a specific version of the dataset does *not* revert the change.
82
77
83
-
When you load data from a dataset, the current data content referenced by the dataset is always loaded. If you want to make sure that each dataset version is reproducible, we recommend that you not modify data content referenced by the dataset version. When new data comes in, save new data files into a separate data folder and then create a new dataset version to include data from that new folder.
78
+
When you load data from a dataset, the current data content referenced by the dataset is always loaded. If you want to make sure that each dataset version is reproducible, we recommend that you avoid modification of data content referenced by the dataset version. When new data comes in, save new data files into a separate data folder, and then create a new dataset version to include data from that new folder.
84
79
85
-
The following image and sample code show the recommended way to structure your data folders and to create dataset versions that reference those folders:
80
+
This image and sample code show the recommended way to both structure your data folders and create dataset versions that reference those folders:
You can use a dataset as the input and output of each [ML pipeline](../concept-ml-pipelines.md) step. When you rerun pipelines, the output of each pipeline step is registered as a new dataset version.
118
111
119
-
ML pipelines populate the output of each step into a new folder every time the pipeline reruns. This behavior allows the versioned output datasets to be reproducible. Learn more about[datasets in pipelines](./how-to-create-machine-learning-pipelines.md#steps).
112
+
Machine Learning pipelines populate the output of each step into a new folder every time the pipeline reruns. The versioned output datasets then become reproducible. For more information, visit[datasets in pipelines](./how-to-create-machine-learning-pipelines.md#steps).
Azure Machine Learning tracks your data throughout your experiment as input and output datasets.
156
-
157
-
The following are scenarios where your data is tracked as an **input dataset**.
146
+
Azure Machine Learning tracks your data throughout your experiment as input and output datasets. In these scenarios, your data is tracked as an **input dataset**:
158
147
159
-
* As a `DatasetConsumptionConfig` object through either the `inputs` or `arguments` parameter of your `ScriptRunConfig` object when submitting the experiment job.
148
+
* As a `DatasetConsumptionConfig` object, through either the `inputs` or `arguments` parameter of your `ScriptRunConfig` object, when submitting the experiment job
160
149
161
-
* When methods like, get_by_name() or get_by_id() are called in your script. For this scenario, the name assigned to the dataset when you registered it to the workspace is the name displayed.
150
+
* When your script calls certain methods - `get_by_name()` or `get_by_id()` - for example. The name assigned to the dataset at the time you registered that dataset to the workspace is the displayed name
162
151
163
-
The following are scenarios where your data is tracked as an **output dataset**.
152
+
In these scenarios, your data is tracked as an **output dataset**:
164
153
165
-
* Pass an `OutputFileDatasetConfig` object through either the `outputs` or `arguments` parameter when submitting an experiment job. `OutputFileDatasetConfig` objects can also be used to persist data between pipeline steps. See [Move data between ML pipeline steps.](how-to-move-data-in-out-of-pipelines.md)
154
+
* Pass an `OutputFileDatasetConfig` object through either the `outputs` or `arguments` parameter when you submit an experiment job. `OutputFileDatasetConfig` objects can also persist data between pipeline steps. For more information, visit [Move data between ML pipeline steps](how-to-move-data-in-out-of-pipelines.md)
166
155
167
-
* Register a dataset in your script. For this scenario, the name assigned to the dataset when you registered it to the workspace is the name displayed. In the following example, `training_ds` is the name that would be displayed.
156
+
* Register a dataset in your script. The name assigned to the dataset when you registered it to the workspace is the name displayed. In this code sample, `training_ds` is the displayed name:
@@ -173,13 +162,11 @@ The following are scenarios where your data is tracked as an **output dataset**.
173
162
)
174
163
```
175
164
176
-
*Submit child job with an unregistered dataset in script. This results in an anonymous saved dataset.
165
+
*Submission of a child job,with an unregistered dataset,inthe script. This submission results in an anonymous saved dataset
177
166
178
167
### Trace datasets in experiment jobs
179
168
180
-
For each Machine Learning experiment, you can easily trace the datasets used asinputwith the experiment `Job`object.
181
-
182
-
The following code uses the [`get_details()`](/python/api/azureml-core/azureml.core.run.run#get-details--) method to track which input datasets were used with the experiment run:
169
+
For each Machine Learning experiment, you can trace the input datasets for the experiment `Job`object. This code sample uses the [`get_details()`](/python/api/azureml-core/azureml.core.run.run#get-details--) method to track the input datasets used with the experiment run:
You can also find the `input_datasets`from experiments by using the [Azure Machine Learning studio]().
180
+
You can also find the `input_datasets`from experiments withthe [Azure Machine Learning studio](https://ml.azure.com).
194
181
195
-
The following image shows where to find the input dataset of an experiment on Azure Machine Learning studio. For this example, go to your **Experiments** pane andopen the **Properties** tab for a specific run of your experiment, `keras-mnist`.
182
+
This screenshot shows where to find the input dataset of an experiment on Azure Machine Learning studio. For this example, start at your **Experiments** pane,andopen the **Properties** tab for a specific run of your experiment, `keras-mnist`.
After registration, you can see the list of models registered with the dataset by using Python or go to the [studio](https://ml.azure.com/).
194
+
After registration, you can see the list of models registered with the dataset with either Python or the [studio](https://ml.azure.com/).
208
195
209
-
The following view isfrom the **Datasets** pane under **Assets**. Select the dataset and then select the **Models** tab for a list of the models that are registered with the dataset.
196
+
Thia screenshot isfrom the **Datasets** pane under **Assets**. Select the dataset,and then select the **Models** tab for a list of the models that are registered with the dataset.
0 commit comments