Update how-to-access-data-batch-endpoints-jobs.md

santiagxf · web-flow · commit 80acb0febd5d · 2023-01-12T09:36:03.000-05:00
diff --git a/articles/machine-learning/how-to-access-data-batch-endpoints-jobs.md b/articles/machine-learning/how-to-access-data-batch-endpoints-jobs.md
@@ -25,15 +25,15 @@ Batch endpoints can be used to perform batch scoring on large amounts of data. S
 
 Batch endpoints support reading files located in the following storage options:
 
-* Azure Machine Learning Data Stores. The following stores are supported:
-    * Azure Blob Storage
-    * Azure Data Lake Storage Gen1
-    * Azure Data Lake Storage Gen2
-* Azure Machine Learning Data Assets. The following types are supported:
+* [Azure Machine Learning Data Assets](#input-data-from-a-data-asset). The following types are supported:
     * Data assets of type Folder (`uri_folder`).
     * Data assets of type File (`uri_file`).
     * Datasets of type `FileDataset` (Deprecated).
-* Azure Storage Accounts. The following storage containers are supported:
+* [Azure Machine Learning Data Stores](#input-data-from-data-stores). The following stores are supported:
+    * Azure Blob Storage
+    * Azure Data Lake Storage Gen1
+    * Azure Data Lake Storage Gen2
+* [Azure Storage Accounts](#input-data-from-azure-storage-accounts). The following storage containers are supported:
     * Azure Data Lake Storage Gen1
     * Azure Data Lake Storage Gen2
     * Azure Blob Storage
@@ -45,54 +45,77 @@ Batch endpoints support reading files located in the following storage options:
 > __Deprecation notice__: Datasets of type `FileDataset` (V1) are deprecated and will be retired in the future. Existing batch endpoints relying on this functionality will continue to work but batch endpoints created with GA CLIv2 (2.4.0 and newer) or GA REST API (2022-05-01 and newer) will not support V1 dataset.
 
 
-## Reading data from data stores
+## Input data from a data asset
 
-Data from Azure Machine Learning registered data stores can be directly referenced by batch deployments jobs. In this example, we're going to first upload some data to the default data store in the Azure Machine Learning workspace and then run a batch deployment on it. Follow these steps to run a batch endpoint job using data stored in a data store:
+Azure Machine Learning data assets (formerly known as datasets) are supported as inputs for jobs. Follow these steps to run a batch endpoint job using data stored in a registered data asset in Azure Machine Learning:
 
-1. Let's get access to the default data store in the Azure Machine Learning workspace. If your data is in a different store, you can use that store instead. There's no requirement of using the default data store. 
+> [!WARNING]
+> Data assets of type Table (`MLTable`) aren't currently supported.
 
-    # [Azure CLI](#tab/cli)
+1. Let's create the data asset first. This data asset consists of a folder with multiple CSV files that we want to process in parallel using batch endpoints. You can skip this step is your data is already registered as a data asset.
 
-    ```azurecli
-    DATASTORE_ID=$(az ml datastore show -n workspaceblobstore | jq -r '.id')
+    # [Azure CLI](#tab/cli)
+   
+    Create a data asset definition in `YAML`:
+   
+    __heart-dataset-unlabeled.yml__
+    ```yaml
+    $schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
+    name: heart-dataset-unlabeled
+    description: An unlabeled dataset for heart classification.
+    type: uri_folder
+    path: heart-classifier-mlflow/data
     ```
-    
-    > [!NOTE]
-    > Data stores ID would look like `/subscriptions/<subscription>/resourceGroups/<resource-group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/datastores/<data-store>`.
-
+   
+    Then, create the data asset:
+   
+    ```bash
+    az ml data create -f heart-dataset-unlabeled.yml
+    ```
+   
     # [Python](#tab/sdk)
-
+   
     ```python
-    default_ds = ml_client.datastores.get_default()
+    data_path = "heart-classifier-mlflow/data"
+    dataset_name = "heart-dataset-unlabeled"
+ 
+    heart_dataset_unlabeled = Data(
+        path=data_path,
+        type=AssetTypes.URI_FOLDER,
+        description="An unlabeled dataset for heart classification",
+        name=dataset_name,
+    )
+    ```
+    
+    Then, create the data asset:
+    
+    ```python
+    ml_client.data.create_or_update(heart_dataset_unlabeled)
+    ```
+    
+    To get the newly created data asset, use:
+    
+    ```python
+    heart_dataset_unlabeled = ml_client.data.get(name=dataset_name, label="latest")
     ```
 
     # [REST](#tab/rest)
 
-    Use the Azure ML CLI, Azure ML SDK for Python, or Studio to get the data store information.
-    
-    ---
-    
-    > [!TIP]
-    > The default blob data store in a workspace is called __workspaceblobstore__. You can skip this step if you already know the resource ID of the default data store in your workspace.
+    Use the Azure ML CLI, Azure ML SDK for Python, or Studio to get the location (region), workspace, and data asset name and version. You will need them later.
 
-1. We'll need to upload some sample data to it. This example assumes you've uploaded the sample data included in the repo in the folder `sdk/python/endpoints/batch/heart-classifier/data` in the folder `heart-classifier/data` in the blob storage account. Ensure you have done that before moving forward.
 
 1. Create a data input:
 
     # [Azure CLI](#tab/cli)
     
-    Let's place the file path in the following variable:
-
     ```azurecli
-    DATA_PATH="heart-disease-uci-unlabeled"
-    INPUT_PATH="$DATASTORE_ID/paths/$DATA_PATH"
+    DATASET_ID=$(az ml data show -n heart-dataset-unlabeled --label latest --query id)
     ```
 
     # [Python](#tab/sdk)
 
     ```python
-    data_path = "heart-classifier/data"
-    input = Input(type=AssetTypes.URI_FOLDER, path=f"{default_ds.id}/paths/{data_path})
+    input = Input(type=AssetTypes.URI_FOLDER, path=heart_dataset_unlabeled.id)
     ```
 
     # [REST](#tab/rest)
@@ -105,28 +128,29 @@ Data from Azure Machine Learning registered data stores can be directly referenc
             "InputData": {
                 "mnistinput": {
                     "JobInputType" : "UriFolder",
-                    "Uri": "azureml:/subscriptions/<subscription>/resourceGroups/<resource-group/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/datastores/<data-store>/paths/<data-path>"
+                    "Uri": "azureml://locations/<location>/workspaces/<workspace>/data/<dataset_name>/versions/labels/latest"
                 }
             }
         }
     }
     ```
     ---
-    
+
     > [!NOTE]
-    > See how the path `paths` is appended to the resource id of the data store to indicate that what follows is a path inside of it.
+    > Data assets ID would look like `/subscriptions/<subscription>/resourcegroups/<resource-group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/data/<data-asset>/versions/<version>`.
 
-    > [!TIP]
-    > You can also use `azureml://datastores/<data-store>/paths/<data-path>` as a way to indicate the input.
 
 1. Run the deployment:
 
     # [Azure CLI](#tab/cli)
    
     ```bash
-    INVOKE_RESPONSE = $(az ml batch-endpoint invoke --name $ENDPOINT_NAME --input $INPUT_PATH)
+    INVOKE_RESPONSE = $(az ml batch-endpoint invoke --name $ENDPOINT_NAME --input $DATASET_ID)
     ```
-   
+
+    > [!TIP]
+    > You can also use `--input azureml:/<dataasset_name>@latest` as a way to indicate the input.
+
     # [Python](#tab/sdk)
    
     ```python
@@ -136,9 +160,9 @@ Data from Azure Machine Learning registered data stores can be directly referenc
     )
     ```
 
-    # [REST](#tab/rest)
+   # [REST](#tab/rest)
 
-    __Request__
+   __Request__
     
     ```http
     POST jobs HTTP/1.1
@@ -147,77 +171,54 @@ Data from Azure Machine Learning registered data stores can be directly referenc
     Content-Type: application/json
     ```
 
-## Reading data from a data asset
+## Input data from data stores
 
-Azure Machine Learning data assets (formerly known as datasets) are supported as inputs for jobs. Follow these steps to run a batch endpoint job using data stored in a registered data asset in Azure Machine Learning:
-
-> [!WARNING]
-> Data assets of type Table (`MLTable`) aren't currently supported.
+Data from Azure Machine Learning registered data stores can be directly referenced by batch deployments jobs. In this example, we're going to first upload some data to the default data store in the Azure Machine Learning workspace and then run a batch deployment on it. Follow these steps to run a batch endpoint job using data stored in a data store:
 
-1. Let's create the data asset first. This data asset consists of a folder with multiple CSV files that we want to process in parallel using batch endpoints. You can skip this step is your data is already registered as a data asset.
+1. Let's get access to the default data store in the Azure Machine Learning workspace. If your data is in a different store, you can use that store instead. There's no requirement of using the default data store. 
 
     # [Azure CLI](#tab/cli)
-   
-    Create a data asset definition in `YAML`:
-   
-    __heart-dataset-unlabeled.yml__
-    ```yaml
-    $schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
-    name: heart-dataset-unlabeled
-    description: An unlabeled dataset for heart classification.
-    type: uri_folder
-    path: heart-classifier-mlflow/data
-    ```
-   
-    Then, create the data asset:
-   
-    ```bash
-    az ml data create -f heart-dataset-unlabeled.yml
-    ```
-   
-    # [Python](#tab/sdk)
-   
-    ```python
-    data_path = "heart-classifier-mlflow/data"
-    dataset_name = "heart-dataset-unlabeled"
- 
-    heart_dataset_unlabeled = Data(
-        path=data_path,
-        type=AssetTypes.URI_FOLDER,
-        description="An unlabeled dataset for heart classification",
-        name=dataset_name,
-    )
-    ```
-    
-    Then, create the data asset:
-    
-    ```python
-    ml_client.data.create_or_update(heart_dataset_unlabeled)
+
+    ```azurecli
+    DATASTORE_ID=$(az ml datastore show -n workspaceblobstore | jq -r '.id')
     ```
     
-    To get the newly created data asset, use:
-    
+    > [!NOTE]
+    > Data stores ID would look like `/subscriptions/<subscription>/resourceGroups/<resource-group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/datastores/<data-store>`.
+
+    # [Python](#tab/sdk)
+
     ```python
-    heart_dataset_unlabeled = ml_client.data.get(name=dataset_name, label="latest")
+    default_ds = ml_client.datastores.get_default()
     ```
 
     # [REST](#tab/rest)
 
-    Use the Azure ML CLI, Azure ML SDK for Python, or Studio to get the location (region), workspace, and data asset name and version. You will need them later.
+    Use the Azure ML CLI, Azure ML SDK for Python, or Studio to get the data store information.
+    
+    ---
+    
+    > [!TIP]
+    > The default blob data store in a workspace is called __workspaceblobstore__. You can skip this step if you already know the resource ID of the default data store in your workspace.
 
+1. We'll need to upload some sample data to it. This example assumes you've uploaded the sample data included in the repo in the folder `sdk/python/endpoints/batch/heart-classifier/data` in the folder `heart-classifier/data` in the blob storage account. Ensure you have done that before moving forward.
 
 1. Create a data input:
 
     # [Azure CLI](#tab/cli)
     
+    Let's place the file path in the following variable:
+
     ```azurecli
-    DATASET_ID=$(az ml data show -n heart-dataset-unlabeled --label latest --query id)
+    DATA_PATH="heart-disease-uci-unlabeled"
+    INPUT_PATH="$DATASTORE_ID/paths/$DATA_PATH"
     ```
 
     # [Python](#tab/sdk)
 
     ```python
-    input = Input(type=AssetTypes.URI_FOLDER, path=heart_dataset_unlabeled.id)
+    data_path = "heart-classifier/data"
+    input = Input(type=AssetTypes.URI_FOLDER, path=f"{default_ds.id}/paths/{data_path})
     ```
 
     # [REST](#tab/rest)
@@ -230,29 +231,28 @@ Azure Machine Learning data assets (formerly known as datasets) are supported as
             "InputData": {
                 "mnistinput": {
                     "JobInputType" : "UriFolder",
-                    "Uri": "azureml://locations/<location>/workspaces/<workspace>/data/<dataset_name>/versions/labels/latest"
+                    "Uri": "azureml:/subscriptions/<subscription>/resourceGroups/<resource-group/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/datastores/<data-store>/paths/<data-path>"
                 }
             }
         }
     }
     ```
     ---
-
+    
     > [!NOTE]
-    > Data assets ID would look like `/subscriptions/<subscription>/resourcegroups/<resource-group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/data/<data-asset>/versions/<version>`.
+    > See how the path `paths` is appended to the resource id of the data store to indicate that what follows is a path inside of it.
 
+    > [!TIP]
+    > You can also use `azureml://datastores/<data-store>/paths/<data-path>` as a way to indicate the input.
 
 1. Run the deployment:
 
     # [Azure CLI](#tab/cli)
    
     ```bash
-    INVOKE_RESPONSE = $(az ml batch-endpoint invoke --name $ENDPOINT_NAME --input $DATASET_ID)
+    INVOKE_RESPONSE = $(az ml batch-endpoint invoke --name $ENDPOINT_NAME --input $INPUT_PATH)
     ```
-
-    > [!TIP]
-    > You can also use `--input azureml:/<dataasset_name>@latest` as a way to indicate the input.
-
+   
     # [Python](#tab/sdk)
    
     ```python
@@ -262,9 +262,9 @@ Azure Machine Learning data assets (formerly known as datasets) are supported as
     )
     ```
 
-   # [REST](#tab/rest)
+    # [REST](#tab/rest)
 
-   __Request__
+    __Request__
     
     ```http
     POST jobs HTTP/1.1
@@ -273,7 +273,7 @@ Azure Machine Learning data assets (formerly known as datasets) are supported as
     Content-Type: application/json
     ```
 
-## Reading data from Azure Storage Accounts
+## Input data from Azure Storage Accounts
 
 Azure Machine Learning batch endpoints can read data from cloud locations in Azure Storage Accounts, both public and private. Use the following steps to run a batch endpoint job using data stored in a storage account: