Skip to content

Commit 7df7ad5

Browse files
author
fkriti
committed
updated article for data
1 parent 30ece21 commit 7df7ad5

File tree

1 file changed

+395
-0
lines changed

1 file changed

+395
-0
lines changed
Lines changed: 395 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,395 @@
1+
---
2+
title: Share data across workspaces with registries (preview)
3+
titleSuffix: Azure Machine Learning
4+
description: Learn how practice cross-workspace MLOps and collaborate across teams by sharing data through registries.
5+
services: machine-learning
6+
ms.service: machine-learning
7+
ms.subservice: mlops
8+
ms.author: kritifaujdar
9+
author: KritiFaujdar
10+
ms.reviewer: larryfr
11+
ms.date: 03/21/2023
12+
ms.topic: how-to
13+
ms.custom: devx-track-python, devx-track-azurecli
14+
---
15+
16+
# Share data across workspaces with registries (preview)
17+
18+
Azure Machine Learning registry (preview) enables you to collaborate across workspaces within your organization. Using registries, you can share models, components,environments and data.
19+
20+
21+
## Key scenario addressed by data sharing using Azure Machine Learning registry
22+
23+
You may want to have data shared across multiple teams or projects or workspaces in a central location. Such data do not have sensitive access controls and can be broadly used in the organization.
24+
25+
Examples include:
26+
* A team wants to share a public dataset that is pre-processed and ready to use in experiments.
27+
* Your organization has acquired a particular dataset for a project from an external vendor and wants to make it available all teams working on the same project.
28+
* Teams want to share data assets across workspaces in different regions.
29+
30+
In this scenario, you can create a data asset in registry or share an existing data asset from workspace to a registry. This data asset can be used across mulitple workspaces.
31+
32+
## Scenarios NOT addressed by data sharing using Azure Machine Learning registry
33+
34+
* Sharing sensitive data that needs fine grained access control. You cannot create a data asset in Registry to share with a small subset of users/workspaces while the Registry has access to many other users in the org.
35+
36+
* Sharing data that is available in existing storage that must not be copied or is too large or too expensive to be copied. Whenever data assets are created in Registry a copy of data is ingested into the Registry storage so that it can be replicated.
37+
38+
## Data asset types supported by Azure Machine Learning registry
39+
40+
> [!NOTE]
41+
> Make sure to check out the **canonical scenarios** below when deciding if you want to use uri_file, uri_folder or mltable for your use case.
42+
43+
You can create three data asset types:
44+
45+
| Type | V2 API | Canonical Scenarios |
46+
| :------------- |:-------------| :-----|
47+
| **File:** Reference a single file | uri_file | Read/write a single file - the file can have any format. |
48+
|**Folder:** Reference a single folder | uri_folder | You must read/write a folder of parquet/CSV files into Pandas/Spark. Deep-learning with images, text, audio, video files located in a folder. |
49+
| **Table:** Reference a data table | mltable | You have a complex schema subject to frequent changes, or you need a subset of large tabular data. |
50+
51+
## Paths supported by Azure Machine Learning registry
52+
53+
When you create a data asset, you must specify a **path** parameter that points to the data location. Supported paths include:
54+
55+
| Location | Example |
56+
| :------------- |:-------------|
57+
| A path on your local computer | A path on your local computer |
58+
59+
In this article, you'll learn how to:
60+
61+
* Create a data asset in the registry.
62+
* Share an existing data asset from workspace to registry
63+
* Use the data asset from registry as input to a model training job in a workspace.
64+
65+
66+
[!INCLUDE [machine-learning-preview-generic-disclaimer](../../includes/machine-learning-preview-generic-disclaimer.md)]
67+
68+
## Prerequisites
69+
70+
Before following the steps in this article, make sure you have the following prerequisites:
71+
72+
- Familiarity with [Azure Machine Learning registries](https://learn.microsoft.com/en-us/azure/machine-learning/concept-machine-learning-registries-mlops) and [Data concepts in Azure Machine Learning](https://learn.microsoft.com/en-us/azure/machine-learning/concept-data).
73+
74+
- An Azure subscription. If you don't have an Azure subscription, create a free account before you begin. Try the [free or paid version of Azure Machine Learning](https://azure.microsoft.com/free/).
75+
76+
- An Azure Machine Learning registry (preview) to share data. To create a registry, see [Learn how to create a registry](how-to-manage-registries.md).
77+
78+
- An Azure Machine Learning workspace. If you don't have one, use the steps in the [Quickstart: Create workspace resources](quickstart-create-resources.md) article to create one.
79+
80+
81+
> [!IMPORTANT]
82+
> The Azure region (location) where you create your workspace must be in the list of supported regions for Azure Machine Learning registry
83+
84+
- The Azure CLI and the `ml` extension __or__ the Azure Machine Learning Python SDK v2:
85+
86+
# [Azure CLI](#tab/cli)
87+
88+
To install the Azure CLI and extension, see [Install, set up, and use the CLI (v2)](how-to-configure-cli.md).
89+
90+
> [!IMPORTANT]
91+
> * The CLI examples in this article assume that you are using the Bash (or compatible) shell. For example, from a Linux system or [Windows Subsystem for Linux](/windows/wsl/about).
92+
> * The examples also assume that you have configured defaults for the Azure CLI so that you don't have to specify the parameters for your subscription, workspace, resource group, or location. To set default settings, use the following commands. Replace the following parameters with the values for your configuration:
93+
>
94+
> * Replace `<subscription>` with your Azure subscription ID.
95+
> * Replace `<workspace>` with your Azure Machine Learning workspace name.
96+
> * Replace `<resource-group>` with the Azure resource group that contains your workspace.
97+
> * Replace `<location>` with the Azure region that contains your workspace.
98+
>
99+
> ```azurecli
100+
> az account set --subscription <subscription>
101+
> az configure --defaults workspace=<workspace> group=<resource-group> location=<location>
102+
> ```
103+
> You can see what your current defaults are by using the `az configure -l` command.
104+
105+
# [Python SDK](#tab/python)
106+
107+
To install the Python SDK v2, use the following command:
108+
109+
```bash
110+
pip install --pre azure-ai-ml
111+
```
112+
113+
---
114+
115+
### Clone examples repository
116+
117+
The code examples in this article are based on the `nyc_taxi_data_regression` sample in the [examples repository](https://github.com/Azure/azureml-examples). To use these files on your development environment, use the following commands to clone the repository and change directories to the example:
118+
119+
```bash
120+
git clone https://github.com/Azure/azureml-examples
121+
cd azureml-examples
122+
```
123+
124+
# [Azure CLI](#tab/cli)
125+
126+
For the CLI example, change directories to `cli/jobs/pipelines-with-components/nyc_taxi_data_regression` in your local clone of the [examples repository](https://github.com/Azure/azureml-examples).
127+
128+
```bash
129+
cd cli/jobs/pipelines-with-components/nyc_taxi_data_regression
130+
```
131+
132+
# [Python SDK](#tab/python)
133+
134+
For the Python SDK example, use the `nyc_taxi_data_regression` sample from the [examples repository](https://github.com/Azure/azureml-examples). The sample notebook, is available in the `sdk/python/assets/assets-in-registry` folder. All the sample YAML files model training code, sample data for training and inference is available in `cli/jobs/pipelines-with-components/nyc_taxi_data_regression`. Change to the `sdk/resources/registry` directory and open the notebook if you'd like to step through a notebook to try out the code in this document.
135+
136+
---
137+
138+
### Create SDK connection
139+
140+
> [!TIP]
141+
> This step is only needed when using the Python SDK.
142+
143+
Create a client connection to both the Azure Machine Learning workspace and registry:
144+
145+
```python
146+
ml_client_workspace = MLClient( credential=credential,
147+
subscription_id = "<workspace-subscription>",
148+
resource_group_name = "<workspace-resource-group",
149+
workspace_name = "<workspace-name>")
150+
print(ml_client_workspace)
151+
152+
ml_client_registry = MLClient(credential=credential,
153+
registry_name="<REGISTRY_NAME>",
154+
registry_location="<REGISTRY_REGION>")
155+
print(ml_client_registry)
156+
```
157+
158+
## Create data in registry
159+
160+
We will create a data asset in this step and use the same as an input to the job further in this article.
161+
162+
# [Azure CLI](#tab/cli)
163+
164+
> [!TIP]
165+
> The same CLI command `az ml data create` can be used to create data in a workspace or registry. Running the command with `--workspace-name` command creates the data in a workspace whereas running the command with `--registry-name` creates the data in the registry.
166+
167+
If you've cloned the examples repo and are in the folder `cli/jobs/pipelines-with-components/nyc_taxi_data_regression`, you should be able to see a folder `data_transformed`. We will use folder as source of data. Create a YAML file `data-registry.yml` is shown below:
168+
169+
```YAML
170+
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
171+
name: transformed-nyc-taxt-data
172+
description: Transformed NYC Taxi data created from local folder.
173+
version: 1
174+
type: uri_folder
175+
path: data_transformed/
176+
```
177+
178+
Create the data using the `az ml data create` as follows
179+
180+
```azurecli
181+
az ml data create --file data-registry.yml --registry-name <registry-name>
182+
```
183+
184+
If you get an error that data with this name and version already exists in the registry, you can either edit the `version` field in `data-registry.yml` or specify a different version on the CLI that overrides the version value in `data-registry.yml`.
185+
186+
```azurecli
187+
# use shell epoch time as the version
188+
version=$(date +%s)
189+
az ml data create --file data-registry.yml --registry-name <registry-name> --set version=$version
190+
```
191+
192+
> [!TIP]
193+
> `version=$(date +%s)` works only in Linux. Replace `$version` with a random number if this does not work.
194+
195+
Note down the `name` and `version` of the data from the output of the `az ml data create` command and use them with `az ml data show` commands as follows.
196+
197+
```azurecli
198+
az ml data show --name transformed-nyc-taxt-data --version 1 --registry-name <registry-name>
199+
```
200+
201+
> [!TIP]
202+
> If you used a different data name or version, replace the `--name` and `--version` parameters accordingly.
203+
204+
You can also use `az ml data list --registry-name <registry-name>` to list all data assets in the registry.
205+
206+
# [Python SDK](#tab/python)
207+
208+
> [!TIP]
209+
> The same `MLClient.environmentsdata.create_or_update()` can be used to create data in either a workspace or a registry depending on the target it has been initialized with. Since you work wth both workspace and registry in this document, you have initialized `ml_client_workspace` and `ml_client_registry` to work with workspace and registry respectively.
210+
211+
212+
The source data folder `data_transformed` is available in `cli/jobs/pipelines-with-components/nyc_taxi_data_regression/`. Initialize the data object and create the data.
213+
214+
```python
215+
my_path = "./data_transformed/"
216+
my_data = Data(path=my_path,
217+
type=AssetTypes.URI_FOLDER,
218+
description="Transformed NYC Taxi data created from local folder.",
219+
name="transformed-nyc-taxt-data",
220+
version='1')
221+
ml_client_registry.data.create_or_update(my_data)
222+
```
223+
224+
> [!TIP]
225+
> If you get an error that an data with this name and version already exists in the registry, specify a different version for the `version` parameter.
226+
227+
Note down the `name` and `version` of the data from the output and pass them to the `ml_client_registry.data.get()` method to fetch the data from registry.
228+
229+
You can also use `ml_client_registry.data.list()` to list all data assets in the registry.
230+
231+
---
232+
233+
## Create an environment and component in registry
234+
235+
Follow the steps [in this article](how-to-share-models-pipelines-across-workspaces-with-registries.md) to create an environment and component in the registry. We will use these in the training job in next section. Alternatively, you can use an environment and component from the workspace.
236+
237+
## Run a pipeline job in a workspace using component from registry
238+
239+
When running a pipeline job that uses a component and data from a registry, the _compute_ resources are local to the workspace. For more information on running jobs, see the following articles:
240+
241+
* [Running jobs (CLI)](./how-to-train-cli.md)
242+
* [Running jobs (SDK)](./how-to-train-sdk.md)
243+
* [Pipeline jobs with components (CLI)](./how-to-create-component-pipelines-cli.md)
244+
* [Pipeline jobs with components (SDK)](./how-to-create-component-pipeline-python.md)
245+
246+
# [Azure CLI](#tab/cli)
247+
248+
We'll run a pipeline job with the Scikit Learn training component and data asset created in the previous sections to train a model. Check that you are in the folder `cli/jobs/pipelines-with-components/nyc_taxi_data_regression`. Edit the `component` section in under the `train_job` section of the `single-job-pipeline.yml` file to refer to the training component and `path` under `training_data` section to refer to data asset created in the previous sections. The resulting `single-job-pipeline.yml` is shown below.
249+
250+
```YAML
251+
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
252+
type: pipeline
253+
display_name: nyc_taxi_data_regression_single_job
254+
description: Single job pipeline to train regression model based on nyc taxi dataset
255+
256+
jobs:
257+
train_job:
258+
type: command
259+
component: azureml://registries/<registry-name>/component/train_linear_regression_model/versions/1
260+
compute: azureml:cpu-cluster
261+
inputs:
262+
training_data:
263+
type: uri_folder
264+
path: azureml://registries/<registry-name>/data/transformed-nyc-taxt-data/versions/1
265+
outputs:
266+
model_output:
267+
type: mlflow_model
268+
test_data:
269+
```
270+
271+
The key aspect is that this pipeline is going to run in a workspace using training data that isn't in the specific workspace. The data is in a registry that can be used with any workspace in your organization. You can run this training job in any workspace you have access to without having worry about making the training data available in that workspace.
272+
273+
> [!WARNING]
274+
> * Before running the pipeline job, confirm that the workspace in which you will run the job is in a Azure region that is supported by the registry in which you created the data.
275+
> * Confirm that the workspace has a compute cluster with the name `cpu-cluster` or edit the `compute` field under `jobs.train_job.compute` with the name of your compute.
276+
277+
Run the pipeline job with the `az ml job create` command.
278+
279+
```azurecli
280+
az ml job create --file single-job-pipeline.yml
281+
```
282+
283+
> [!TIP]
284+
> If you have not configured the default workspace and resource group as explained in the prerequisites section, you will need to specify the `--workspace-name` and `--resource-group` parameters for the `az ml job create` to work.
285+
286+
287+
# [Python SDK](#tab/python)
288+
289+
You'll run a pipeline job with the Scikit Learn training component and data asset created in the previous section to train a model. Construct the pipeline using the component and data created in the previous steps.
290+
291+
The key aspect is that this pipeline is going to run in a workspace using training data that isn't in the specific workspace. The data is in a registry that can be used with any workspace in your organization. You can run this training job in any workspace you have access to without having worry about making the training data available in that workspace.
292+
293+
```Python
294+
# get the data asset
295+
data_asset_from_registry = ml_client_registry.data.get(name="transformed-nyc-taxt-data", version="1")
296+
297+
@pipeline()
298+
def pipeline_with_registered_components(
299+
training_data
300+
):
301+
train_job = train_component_from_registry(
302+
training_data=training_data,
303+
)
304+
pipeline_job = pipeline_with_registered_components(
305+
training_data=Input(type="uri_folder", path=data_asset_from_registry.id"),
306+
)
307+
pipeline_job.settings.default_compute = "cpu-cluster"
308+
print(pipeline_job)
309+
```
310+
311+
> [!WARNING]
312+
> * Confirm that the workspace in which you will run this job is in a Azure location that is supported by the registry in which you created the component before you run the pipeline job.
313+
> * Confirm that the workspace has a compute cluster with the name `cpu-cluster` or update it `pipeline_job.settings.default_compute=<compute-cluster-name>`.
314+
315+
Run the pipeline job and wait for it to complete.
316+
317+
```python
318+
pipeline_job = ml_client_workspace.jobs.create_or_update(
319+
pipeline_job, experiment_name="sdk_job_data_from_registry" , skip_validation=True
320+
)
321+
ml_client_workspace.jobs.stream(pipeline_job.name)
322+
pipeline_job=ml_client_workspace.jobs.get(pipeline_job.name)
323+
pipeline_job
324+
```
325+
326+
> [!TIP]
327+
> Notice that you are using `ml_client_workspace` to run the pipeline job whereas you had used `ml_client_registry` to use create environment and component.
328+
329+
Since the component used in the training job is shared through a registry, you can submit the job to any workspace that you have access to in your organization, even across different subscriptions. For example, if you have `dev-workspace`, `test-workspace` and `prod-workspace`, you can connect to those workspaces and resubmit the job.
330+
331+
---
332+
333+
### Share data from workspace to registry
334+
335+
In this workflow, you'll learn an existing data asset from workspace to registry.
336+
337+
# [Azure CLI](#tab/cli)
338+
339+
We will first create a data asset in workspace. Make sure that you are in `cli/assets/data` folder. There is a YAML file `local-folder.yml` which we will use to create a data asset in workspace. Data is available in `cli/assets/data/sample-data` folder. Here is what it looks like:
340+
341+
```YAML
342+
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
343+
name: local-folder-example-titanic
344+
description: Dataset created from local folder.
345+
type: uri_folder
346+
path: sample-data/
347+
```
348+
349+
Execute below command to create data asset in workspace.
350+
351+
```azurecli
352+
az ml data create -f local-folder.yml
353+
```
354+
355+
356+
Follow [this article]((how-to-create-data-assets.md)) to learn more about creating a data asset in workspace.
357+
358+
The data asset created in workspace can be shared to a registry and it can be used in multiple workspaces from there. You can also change the name and version when sharing the data from workspace to registry.
359+
360+
```azurecli
361+
az ml data share --name local-folder-example-titanic --version 1 --registry-name <registry-name> --share-with-name <new-name> --share-with-version <new-version>
362+
```
363+
364+
365+
# [Python SDK](#tab/python)
366+
367+
We will first create a data asset in workspace. Make sure that you are in `sdk/assets/data` folder. Data is available in `sdk/assets/data/sample-data` folder.
368+
369+
```python
370+
my_path = "./sample-data/"
371+
my_data = Data(path=my_path,
372+
type=AssetTypes.URI_FOLDER,
373+
description="",
374+
name="titanic-dataset",
375+
version='1')
376+
ml_client_workspace.data.create_or_update(my_data)
377+
378+
```
379+
Follow [this article]((how-to-create-data-assets.md)) to learn more about creating a data asset in workspace.
380+
381+
The data asset created in workspace can be shared to a registry and it can be used in multiple workspaces from there. You can also change the name and version when sharing the data from workspace to registry.
382+
383+
```python
384+
ml_client.models.share(name="titanic-dataset", version=1, registry_name=<registry_name>, share_with_name=<new-name>, share_with_version=<new-version>)
385+
```
386+
---
387+
388+
389+
390+
## Next steps
391+
392+
* [How to create and manage registries](how-to-manage-registries.md)
393+
* [How to manage environments](how-to-manage-environments-v2.md)
394+
* [How to train models](how-to-train-cli.md)
395+
* [How to create pipelines using components](how-to-create-component-pipeline-python.md)

0 commit comments

Comments
 (0)