Skip to content

Commit 80acb0f

Browse files
authored
Update how-to-access-data-batch-endpoints-jobs.md
1 parent b87994b commit 80acb0f

File tree

1 file changed

+99
-99
lines changed

1 file changed

+99
-99
lines changed

articles/machine-learning/how-to-access-data-batch-endpoints-jobs.md

Lines changed: 99 additions & 99 deletions
Original file line numberDiff line numberDiff line change
@@ -25,15 +25,15 @@ Batch endpoints can be used to perform batch scoring on large amounts of data. S
2525

2626
Batch endpoints support reading files located in the following storage options:
2727

28-
* Azure Machine Learning Data Stores. The following stores are supported:
29-
* Azure Blob Storage
30-
* Azure Data Lake Storage Gen1
31-
* Azure Data Lake Storage Gen2
32-
* Azure Machine Learning Data Assets. The following types are supported:
28+
* [Azure Machine Learning Data Assets](#input-data-from-a-data-asset). The following types are supported:
3329
* Data assets of type Folder (`uri_folder`).
3430
* Data assets of type File (`uri_file`).
3531
* Datasets of type `FileDataset` (Deprecated).
36-
* Azure Storage Accounts. The following storage containers are supported:
32+
* [Azure Machine Learning Data Stores](#input-data-from-data-stores). The following stores are supported:
33+
* Azure Blob Storage
34+
* Azure Data Lake Storage Gen1
35+
* Azure Data Lake Storage Gen2
36+
* [Azure Storage Accounts](#input-data-from-azure-storage-accounts). The following storage containers are supported:
3737
* Azure Data Lake Storage Gen1
3838
* Azure Data Lake Storage Gen2
3939
* Azure Blob Storage
@@ -45,54 +45,77 @@ Batch endpoints support reading files located in the following storage options:
4545
> __Deprecation notice__: Datasets of type `FileDataset` (V1) are deprecated and will be retired in the future. Existing batch endpoints relying on this functionality will continue to work but batch endpoints created with GA CLIv2 (2.4.0 and newer) or GA REST API (2022-05-01 and newer) will not support V1 dataset.
4646
4747

48-
## Reading data from data stores
48+
## Input data from a data asset
4949

50-
Data from Azure Machine Learning registered data stores can be directly referenced by batch deployments jobs. In this example, we're going to first upload some data to the default data store in the Azure Machine Learning workspace and then run a batch deployment on it. Follow these steps to run a batch endpoint job using data stored in a data store:
50+
Azure Machine Learning data assets (formerly known as datasets) are supported as inputs for jobs. Follow these steps to run a batch endpoint job using data stored in a registered data asset in Azure Machine Learning:
5151

52-
1. Let's get access to the default data store in the Azure Machine Learning workspace. If your data is in a different store, you can use that store instead. There's no requirement of using the default data store.
52+
> [!WARNING]
53+
> Data assets of type Table (`MLTable`) aren't currently supported.
5354
54-
# [Azure CLI](#tab/cli)
55+
1. Let's create the data asset first. This data asset consists of a folder with multiple CSV files that we want to process in parallel using batch endpoints. You can skip this step is your data is already registered as a data asset.
5556

56-
```azurecli
57-
DATASTORE_ID=$(az ml datastore show -n workspaceblobstore | jq -r '.id')
57+
# [Azure CLI](#tab/cli)
58+
59+
Create a data asset definition in `YAML`:
60+
61+
__heart-dataset-unlabeled.yml__
62+
```yaml
63+
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
64+
name: heart-dataset-unlabeled
65+
description: An unlabeled dataset for heart classification.
66+
type: uri_folder
67+
path: heart-classifier-mlflow/data
5868
```
59-
60-
> [!NOTE]
61-
> Data stores ID would look like `/subscriptions/<subscription>/resourceGroups/<resource-group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/datastores/<data-store>`.
62-
69+
70+
Then, create the data asset:
71+
72+
```bash
73+
az ml data create -f heart-dataset-unlabeled.yml
74+
```
75+
6376
# [Python](#tab/sdk)
64-
77+
6578
```python
66-
default_ds = ml_client.datastores.get_default()
79+
data_path = "heart-classifier-mlflow/data"
80+
dataset_name = "heart-dataset-unlabeled"
81+
82+
heart_dataset_unlabeled = Data(
83+
path=data_path,
84+
type=AssetTypes.URI_FOLDER,
85+
description="An unlabeled dataset for heart classification",
86+
name=dataset_name,
87+
)
88+
```
89+
90+
Then, create the data asset:
91+
92+
```python
93+
ml_client.data.create_or_update(heart_dataset_unlabeled)
94+
```
95+
96+
To get the newly created data asset, use:
97+
98+
```python
99+
heart_dataset_unlabeled = ml_client.data.get(name=dataset_name, label="latest")
67100
```
68101

69102
# [REST](#tab/rest)
70103

71-
Use the Azure ML CLI, Azure ML SDK for Python, or Studio to get the data store information.
72-
73-
---
74-
75-
> [!TIP]
76-
> The default blob data store in a workspace is called __workspaceblobstore__. You can skip this step if you already know the resource ID of the default data store in your workspace.
104+
Use the Azure ML CLI, Azure ML SDK for Python, or Studio to get the location (region), workspace, and data asset name and version. You will need them later.
77105

78-
1. We'll need to upload some sample data to it. This example assumes you've uploaded the sample data included in the repo in the folder `sdk/python/endpoints/batch/heart-classifier/data` in the folder `heart-classifier/data` in the blob storage account. Ensure you have done that before moving forward.
79106

80107
1. Create a data input:
81108

82109
# [Azure CLI](#tab/cli)
83110

84-
Let's place the file path in the following variable:
85-
86111
```azurecli
87-
DATA_PATH="heart-disease-uci-unlabeled"
88-
INPUT_PATH="$DATASTORE_ID/paths/$DATA_PATH"
112+
DATASET_ID=$(az ml data show -n heart-dataset-unlabeled --label latest --query id)
89113
```
90114

91115
# [Python](#tab/sdk)
92116

93117
```python
94-
data_path = "heart-classifier/data"
95-
input = Input(type=AssetTypes.URI_FOLDER, path=f"{default_ds.id}/paths/{data_path})
118+
input = Input(type=AssetTypes.URI_FOLDER, path=heart_dataset_unlabeled.id)
96119
```
97120

98121
# [REST](#tab/rest)
@@ -105,28 +128,29 @@ Data from Azure Machine Learning registered data stores can be directly referenc
105128
"InputData": {
106129
"mnistinput": {
107130
"JobInputType" : "UriFolder",
108-
"Uri": "azureml:/subscriptions/<subscription>/resourceGroups/<resource-group/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/datastores/<data-store>/paths/<data-path>"
131+
"Uri": "azureml://locations/<location>/workspaces/<workspace>/data/<dataset_name>/versions/labels/latest"
109132
}
110133
}
111134
}
112135
}
113136
```
114137
---
115-
138+
116139
> [!NOTE]
117-
> See how the path `paths` is appended to the resource id of the data store to indicate that what follows is a path inside of it.
140+
> Data assets ID would look like `/subscriptions/<subscription>/resourcegroups/<resource-group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/data/<data-asset>/versions/<version>`.
118141

119-
> [!TIP]
120-
> You can also use `azureml://datastores/<data-store>/paths/<data-path>` as a way to indicate the input.
121142

122143
1. Run the deployment:
123144

124145
# [Azure CLI](#tab/cli)
125146

126147
```bash
127-
INVOKE_RESPONSE = $(az ml batch-endpoint invoke --name $ENDPOINT_NAME --input $INPUT_PATH)
148+
INVOKE_RESPONSE = $(az ml batch-endpoint invoke --name $ENDPOINT_NAME --input $DATASET_ID)
128149
```
129-
150+
151+
> [!TIP]
152+
> You can also use `--input azureml:/<dataasset_name>@latest` as a way to indicate the input.
153+
130154
# [Python](#tab/sdk)
131155

132156
```python
@@ -136,9 +160,9 @@ Data from Azure Machine Learning registered data stores can be directly referenc
136160
)
137161
```
138162

139-
# [REST](#tab/rest)
163+
# [REST](#tab/rest)
140164

141-
__Request__
165+
__Request__
142166

143167
```http
144168
POST jobs HTTP/1.1
@@ -147,77 +171,54 @@ Data from Azure Machine Learning registered data stores can be directly referenc
147171
Content-Type: application/json
148172
```
149173

150-
## Reading data from a data asset
174+
## Input data from data stores
151175

152-
Azure Machine Learning data assets (formerly known as datasets) are supported as inputs for jobs. Follow these steps to run a batch endpoint job using data stored in a registered data asset in Azure Machine Learning:
153-
154-
> [!WARNING]
155-
> Data assets of type Table (`MLTable`) aren't currently supported.
176+
Data from Azure Machine Learning registered data stores can be directly referenced by batch deployments jobs. In this example, we're going to first upload some data to the default data store in the Azure Machine Learning workspace and then run a batch deployment on it. Follow these steps to run a batch endpoint job using data stored in a data store:
156177

157-
1. Let's create the data asset first. This data asset consists of a folder with multiple CSV files that we want to process in parallel using batch endpoints. You can skip this step is your data is already registered as a data asset.
178+
1. Let's get access to the default data store in the Azure Machine Learning workspace. If your data is in a different store, you can use that store instead. There's no requirement of using the default data store.
158179

159180
# [Azure CLI](#tab/cli)
160-
161-
Create a data asset definition in `YAML`:
162-
163-
__heart-dataset-unlabeled.yml__
164-
```yaml
165-
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
166-
name: heart-dataset-unlabeled
167-
description: An unlabeled dataset for heart classification.
168-
type: uri_folder
169-
path: heart-classifier-mlflow/data
170-
```
171-
172-
Then, create the data asset:
173-
174-
```bash
175-
az ml data create -f heart-dataset-unlabeled.yml
176-
```
177-
178-
# [Python](#tab/sdk)
179-
180-
```python
181-
data_path = "heart-classifier-mlflow/data"
182-
dataset_name = "heart-dataset-unlabeled"
183-
184-
heart_dataset_unlabeled = Data(
185-
path=data_path,
186-
type=AssetTypes.URI_FOLDER,
187-
description="An unlabeled dataset for heart classification",
188-
name=dataset_name,
189-
)
190-
```
191-
192-
Then, create the data asset:
193-
194-
```python
195-
ml_client.data.create_or_update(heart_dataset_unlabeled)
181+
182+
```azurecli
183+
DATASTORE_ID=$(az ml datastore show -n workspaceblobstore | jq -r '.id')
196184
```
197185

198-
To get the newly created data asset, use:
199-
186+
> [!NOTE]
187+
> Data stores ID would look like `/subscriptions/<subscription>/resourceGroups/<resource-group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/datastores/<data-store>`.
188+
189+
# [Python](#tab/sdk)
190+
200191
```python
201-
heart_dataset_unlabeled = ml_client.data.get(name=dataset_name, label="latest")
192+
default_ds = ml_client.datastores.get_default()
202193
```
203194

204195
# [REST](#tab/rest)
205196

206-
Use the Azure ML CLI, Azure ML SDK for Python, or Studio to get the location (region), workspace, and data asset name and version. You will need them later.
197+
Use the Azure ML CLI, Azure ML SDK for Python, or Studio to get the data store information.
198+
199+
---
200+
201+
> [!TIP]
202+
> The default blob data store in a workspace is called __workspaceblobstore__. You can skip this step if you already know the resource ID of the default data store in your workspace.
207203

204+
1. We'll need to upload some sample data to it. This example assumes you've uploaded the sample data included in the repo in the folder `sdk/python/endpoints/batch/heart-classifier/data` in the folder `heart-classifier/data` in the blob storage account. Ensure you have done that before moving forward.
208205

209206
1. Create a data input:
210207

211208
# [Azure CLI](#tab/cli)
212209

210+
Let's place the file path in the following variable:
211+
213212
```azurecli
214-
DATASET_ID=$(az ml data show -n heart-dataset-unlabeled --label latest --query id)
213+
DATA_PATH="heart-disease-uci-unlabeled"
214+
INPUT_PATH="$DATASTORE_ID/paths/$DATA_PATH"
215215
```
216216

217217
# [Python](#tab/sdk)
218218

219219
```python
220-
input = Input(type=AssetTypes.URI_FOLDER, path=heart_dataset_unlabeled.id)
220+
data_path = "heart-classifier/data"
221+
input = Input(type=AssetTypes.URI_FOLDER, path=f"{default_ds.id}/paths/{data_path})
221222
```
222223

223224
# [REST](#tab/rest)
@@ -230,29 +231,28 @@ Azure Machine Learning data assets (formerly known as datasets) are supported as
230231
"InputData": {
231232
"mnistinput": {
232233
"JobInputType" : "UriFolder",
233-
"Uri": "azureml://locations/<location>/workspaces/<workspace>/data/<dataset_name>/versions/labels/latest"
234+
"Uri": "azureml:/subscriptions/<subscription>/resourceGroups/<resource-group/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/datastores/<data-store>/paths/<data-path>"
234235
}
235236
}
236237
}
237238
}
238239
```
239240
---
240-
241+
241242
> [!NOTE]
242-
> Data assets ID would look like `/subscriptions/<subscription>/resourcegroups/<resource-group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/data/<data-asset>/versions/<version>`.
243+
> See how the path `paths` is appended to the resource id of the data store to indicate that what follows is a path inside of it.
243244

245+
> [!TIP]
246+
> You can also use `azureml://datastores/<data-store>/paths/<data-path>` as a way to indicate the input.
244247

245248
1. Run the deployment:
246249

247250
# [Azure CLI](#tab/cli)
248251

249252
```bash
250-
INVOKE_RESPONSE = $(az ml batch-endpoint invoke --name $ENDPOINT_NAME --input $DATASET_ID)
253+
INVOKE_RESPONSE = $(az ml batch-endpoint invoke --name $ENDPOINT_NAME --input $INPUT_PATH)
251254
```
252-
253-
> [!TIP]
254-
> You can also use `--input azureml:/<dataasset_name>@latest` as a way to indicate the input.
255-
255+
256256
# [Python](#tab/sdk)
257257

258258
```python
@@ -262,9 +262,9 @@ Azure Machine Learning data assets (formerly known as datasets) are supported as
262262
)
263263
```
264264

265-
# [REST](#tab/rest)
265+
# [REST](#tab/rest)
266266

267-
__Request__
267+
__Request__
268268

269269
```http
270270
POST jobs HTTP/1.1
@@ -273,7 +273,7 @@ Azure Machine Learning data assets (formerly known as datasets) are supported as
273273
Content-Type: application/json
274274
```
275275

276-
## Reading data from Azure Storage Accounts
276+
## Input data from Azure Storage Accounts
277277

278278
Azure Machine Learning batch endpoints can read data from cloud locations in Azure Storage Accounts, both public and private. Use the following steps to run a batch endpoint job using data stored in a storage account:
279279

0 commit comments

Comments
 (0)