MicrosoftDocs
diff --git a/‎articles/machine-learning/tutorial-1st-experiment-bring-data.md
Lines changed: 92 additions & 118 deletions b/‎articles/machine-learning/tutorial-1st-experiment-bring-data.md
Lines changed: 92 additions & 118 deletions
@@ -9,32 +9,33 @@ ms.topic: tutorial
 author: aminsaied
 ms.author: amsaied
 ms.reviewer: sgilley
-ms.date: 12/21/2021
+ms.date: 07/10/2022
 ms.custom: tracking-python, contperf-fy21q3, FY21Q4-aml-seo-hack, contperf-fy21q4, sdkv1, event-tier1-build-2022
 ---
 
 # Tutorial: Upload data and train a model (part 3 of 3)
 
-[!INCLUDE [sdk v1](../../includes/machine-learning-sdk-v1.md)]
+[!INCLUDE [sdk v2](../../includes/machine-learning-sdk-v2.md)]
 
-This tutorial shows you how to upload and use your own data to train machine learning models in Azure Machine Learning. This tutorial is *part 3 of a three-part tutorial series*.  
+This tutorial shows you how to upload and use your own data to train machine learning models in Azure Machine Learning. This tutorial is *part 3 of a three-part tutorial series*.
 
-In [Part 2: Train a model](tutorial-1st-experiment-sdk-train.md), you trained a model in the cloud, using sample data from `PyTorch`.  You also downloaded that data through the `torchvision.datasets.CIFAR10` method in the PyTorch API. In this tutorial, you'll use the downloaded data to learn the workflow for working with your own data in Azure Machine Learning.
+In [Part 2: Train a model](tutorial-1st-experiment-sdk-train.md), you trained a model in the cloud, using sample data from `PyTorch`. You also downloaded that data through the `torchvision.datasets.CIFAR10` method in the PyTorch API. In this tutorial, you'll use the downloaded data to learn the workflow for working with your own data in Azure Machine Learning.
 
 In this tutorial, you:
 
 > [!div class="checklist"]
+>
 > * Upload data to Azure.
 > * Create a control script.
-> * Understand the new Azure Machine Learning concepts (passing parameters, datasets, datastores).
+> * Understand the new Azure Machine Learning concepts (passing parameters, data inputs).
 > * Submit and run your training script.
 > * View your code output in the cloud.
 
 ## Prerequisites
 
-You'll need the data that was downloaded in the previous tutorial.  Make sure you have completed these steps:
+You'll need the data that was downloaded in the previous tutorial. Make sure you have completed these steps:
 
-1. [Create the training script](tutorial-1st-experiment-sdk-train.md#create-training-scripts).  
+1. [Create the training script](tutorial-1st-experiment-sdk-train.md#create-training-scripts).
 1. [Test locally](tutorial-1st-experiment-sdk-train.md#test-local).
 
 ## Adjust the training script
@@ -43,21 +44,21 @@ By now you have your training script (get-started/src/train.py) running in Azure
 
 Our training script is currently set to download the CIFAR10 dataset on each run. The following Python code has been adjusted to read the data from a directory.
 
->[!NOTE] 
+> [!NOTE]
 > The use of `argparse` parameterizes the script.
 
 1. Open *train.py* and replace it with this code:
 
-    ```python
+   ```python
     import os
     import argparse
     import torch
     import torch.optim as optim
     import torchvision
     import torchvision.transforms as transforms
     from model import Net
-    from azureml.core import Run
-    run = Run.get_context()
+    import mlflow
+
     if __name__ == "__main__":
         parser = argparse.ArgumentParser()
         parser.add_argument(
@@ -126,13 +127,13 @@ Our training script is currently set to download the CIFAR10 dataset on each run
                 running_loss += loss.item()
                 if i % 2000 == 1999:
                     loss = running_loss / 2000
-                    run.log('loss', loss)  # log loss metric to AML
+                    mlflow.log_metric('loss', loss)
                     print(f'epoch={epoch + 1}, batch={i + 1:5}: loss {loss:.2f}')
                     running_loss = 0.0
         print('Finished Training')
-    ```
+   ```
 
-1. **Save** the file.  Close the tab if you wish.
+1. **Save** the file. Close the tab if you wish.
 
 ### Understanding the code changes
 
@@ -158,165 +159,139 @@ optimizer = optim.SGD(
 )
 ```
 
-
 ## <a name="upload"></a> Upload the data to Azure
 
 To run this script in Azure Machine Learning, you need to make your training data available in Azure. Your Azure Machine Learning workspace comes equipped with a _default_ datastore. This is an Azure Blob Storage account where you can store your training data.
 
->[!NOTE] 
-> Azure Machine Learning allows you to connect other cloud-based datastores that store your data. For more details, see the [datastores documentation](./concept-data.md).  
-
-1. Create a new Python control script in the **get-started** folder (make sure it is in **get-started**, *not* in the **/src** folder).  Name the script *upload-data.py* and copy this code into the file:
-    
-    ```python
-    # upload-data.py
-    from azureml.core import Workspace
-    from azureml.core import Dataset
-    from azureml.data.datapath import DataPath
-    
-    ws = Workspace.from_config()
-    datastore = ws.get_default_datastore()
-    Dataset.File.upload_directory(src_dir='data', 
-                                  target=DataPath(datastore, "datasets/cifar10")
-                                 )  
-    ```
-
-    The `target_path` value specifies the path on the datastore where the CIFAR10 data will be uploaded.
-
-    >[!TIP] 
-    > While you're using Azure Machine Learning to upload the data, you can use [Azure Storage Explorer](https://azure.microsoft.com/features/storage-explorer/) to upload ad hoc files. If you need an ETL tool, you can use [Azure Data Factory](../data-factory/introduction.md) to ingest your data into Azure.
-
-2. Select **Save and run script in terminal** to run the *upload-data.py* script.
-
-    You should see the following standard output:
+> [!NOTE]
+> Azure Machine Learning allows you to connect other cloud-based storages that store your data. For more details, see the [data documentation](./concept-data.md).
 
-    ```txt
-    Uploading ./data\cifar-10-batches-py\data_batch_2
-    Uploaded ./data\cifar-10-batches-py\data_batch_2, 4 files out of an estimated total of 9
-    .
-    .
-    Uploading ./data\cifar-10-batches-py\data_batch_5
-    Uploaded ./data\cifar-10-batches-py\data_batch_5, 9 files out of an estimated total of 9
-    Uploaded 9 files
-    ```
+There is no additional step needed for uploading data, the control script will define and upload the CIFAR10 training data.
 
 ## <a name="control-script"></a> Create a control script
 
 As you've done previously, create a new Python control script called *run-pytorch-data.py* in the **get-started** folder:
 
 ```python
 # run-pytorch-data.py
+from azure.ai.ml import MLClient, command, Input
+from azure.identity import DefaultAzureCredential
+from azure.ai.ml.entities import Environment
+from azure.ai.ml import command, Input
+from azure.ai.ml.entities import Data
+from azure.ai.ml.constants import AssetTypes
 from azureml.core import Workspace
-from azureml.core import Experiment
-from azureml.core import Environment
-from azureml.core import ScriptRunConfig
-from azureml.core import Dataset
 
 if __name__ == "__main__":
+    # get details of the current Azure ML workspace
     ws = Workspace.from_config()
-    datastore = ws.get_default_datastore()
-    dataset = Dataset.File.from_files(path=(datastore, 'datasets/cifar10'))
-
-    experiment = Experiment(workspace=ws, name='day1-experiment-data')
-
-    config = ScriptRunConfig(
-        source_directory='./src',
-        script='train.py',
-        compute_target='cpu-cluster',
-        arguments=[
-            '--data_path', dataset.as_named_input('input').as_mount(),
-            '--learning_rate', 0.003,
-            '--momentum', 0.92],
-    )
 
-    # set up pytorch environment
-    env = Environment.from_conda_specification(
-        name='pytorch-env',
-        file_path='pytorch-env.yml'
+    # default authentication flow for Azure applications
+    default_azure_credential = DefaultAzureCredential()
+    subscription_id = ws.subscription_id
+    resource_group = ws.resource_group
+    workspace = ws.name
+
+    # client class to interact with Azure ML services and resources, e.g. workspaces, jobs, models and so on.
+    ml_client = MLClient(
+        default_azure_credential,
+        subscription_id,
+        resource_group,
+        workspace)
+
+    # the key here should match the key passed to the command
+    my_job_inputs = {
+        "data_path": Input(type=AssetTypes.URI_FOLDER, path="./data")
+    }
+
+    env_name = "pytorch-env"
+    env_docker_image = Environment(
+        image="pytorch/pytorch:latest",
+        name=env_name,
+        conda_file="pytorch-env.yml",
+    )
+    ml_client.environments.create_or_update(env_docker_image)
+
+    # target name of compute where job will be executed
+    computeName="cpu-cluster"
+    job = command(
+        code="./src",
+        # the parameter will match the training script argument name
+        # inputs.data_path key should match the dictionary key
+        command="python train.py --data_path ${{inputs.data_path}}",
+        inputs=my_job_inputs,
+        environment=f"{env_name}@latest",
+        compute=computeName,
+        display_name="day1-experiment-data",
     )
-    config.run_config.environment = env
 
-    run = experiment.submit(config)
-    aml_url = run.get_portal_url()
-    print("Submitted to compute cluster. Click link below")
-    print("")
-    print(aml_url)
+    returned_job = ml_client.create_or_update(job)
+    aml_url = returned_job.studio_url
+    print("Monitor your job at", aml_url)
 ```
 
 > [!TIP]
-> If you used a different name when you created your compute cluster, make sure to adjust the name in the code `compute_target='cpu-cluster'` as well.
+> If you used a different name when you created your compute cluster, make sure to adjust the name in the code `computeName='cpu-cluster'` as well.
 
 ### Understand the code changes
 
-The control script is similar to the one from [part 3 of this series](tutorial-1st-experiment-sdk-train.md), with the following new lines:
+The control script is similar to the one from [part 2 of this series](tutorial-1st-experiment-sdk-train.md), with the following new lines:
 
 :::row:::
    :::column span="":::
-      `dataset = Dataset.File.from_files( ... )`
+      `my_job_inputs = { "data_path": Input(type=AssetTypes.URI_FOLDER, path="./data")}`
    :::column-end:::
    :::column span="2":::
-      A [dataset](/python/api/azureml-core/azureml.core.dataset.dataset) is used to reference the data you uploaded to Azure Blob Storage. Datasets are an abstraction layer on top of your data that are designed to improve reliability and trustworthiness.
+      An [Input](/python/api/azure-ai-ml/azure.ai.ml.input) is used to reference inputs to your job. These can encompass data, either uploaded as part of the job or references to previously registered data assets. URI\*FOLDER tells that the reference points to a folder of data. The data will be mounted by default to the compute for the job.
    :::column-end:::
 :::row-end:::
 :::row:::
    :::column span="":::
-      `config = ScriptRunConfig(...)`
+      `command="python train.py --data_path ${{inputs.data_path}}"`
    :::column-end:::
    :::column span="2":::
-      [ScriptRunConfig](/python/api/azureml-core/azureml.core.scriptrunconfig) is modified to include a list of arguments that will be passed into `train.py`. The `dataset.as_named_input('input').as_mount()` argument means the specified directory will be _mounted_ to the compute target.
+      `--data_path` matches the argument defined in the updated training script. `${{inputs.data_path}}` passes the input defined by the input dictionary, and the keys must match.
    :::column-end:::
 :::row-end:::
 
-## <a name="submit-to-cloud"></a> Submit the run to Azure Machine Learning
+## <a name="submit-to-cloud"></a> Submit the job to Azure Machine Learning
 
-Select **Save and run script in terminal**  to run the *run-pytorch-data.py* script.  This run will train the model on the compute cluster using the data you uploaded.
+Select **Save and run script in terminal** to run the *run-pytorch-data.py* script. This job will train the model on the compute cluster using the data you uploaded.
 
 This code will print a URL to the experiment in the Azure Machine Learning studio. If you go to that link, you'll be able to see your code running.
 
 [!INCLUDE [amlinclude-info](../../includes/machine-learning-py38-ignore.md)]
 
-
 ### <a name="inspect-log"></a> Inspect the log file
 
 In the studio, go to the experiment job (by selecting the previous URL output) followed by **Outputs + logs**. Select the `std_log.txt` file. Scroll down through the log file until you see the following output:
 
 ```txt
-Processing 'input'.
-Processing dataset FileDataset
-{
-  "source": [
-    "('workspaceblobstore', 'datasets/cifar10')"
-  ],
-  "definition": [
-    "GetDatastoreFiles"
-  ],
-  "registration": {
-    "id": "XXXXX",
-    "name": null,
-    "version": null,
-    "workspace": "Workspace.create(name='XXXX', subscription_id='XXXX', resource_group='X')"
-  }
-}
-Mounting input to /tmp/tmp9kituvp3.
-Mounted input to /tmp/tmp9kituvp3 as folder.
-Exit __enter__ of DatasetContextManager
-Entering Job History Context Manager.
-Current directory:  /mnt/batch/tasks/shared/LS_root/jobs/dsvm-aml/azureml/tutorial-session-3_1600171983_763c5381/mounts/workspaceblobstore/azureml/tutorial-session-3_1600171983_763c5381
-Preparing to call script [ train.py ] with arguments: ['--data_path', '$input', '--learning_rate', '0.003', '--momentum', '0.92']
-After variable expansion, calling script [ train.py ] with arguments: ['--data_path', '/tmp/tmp9kituvp3', '--learning_rate', '0.003', '--momentum', '0.92']
-
-Script type = None
 ===== DATA =====
-DATA PATH: /tmp/tmp9kituvp3
+DATA PATH: /mnt/azureml/cr/j/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/cap/data-capability/wd/INPUT_data_path
 LIST FILES IN DATA PATH...
-['cifar-10-batches-py', 'cifar-10-python.tar.gz']
+['.amlignore', 'cifar-10-batches-py', 'cifar-10-python.tar.gz']
+================
+epoch=1, batch= 2000: loss 2.20
+epoch=1, batch= 4000: loss 1.90
+epoch=1, batch= 6000: loss 1.70
+epoch=1, batch= 8000: loss 1.58
+epoch=1, batch=10000: loss 1.54
+epoch=1, batch=12000: loss 1.48
+epoch=2, batch= 2000: loss 1.41
+epoch=2, batch= 4000: loss 1.38
+epoch=2, batch= 6000: loss 1.33
+epoch=2, batch= 8000: loss 1.30
+epoch=2, batch=10000: loss 1.29
+epoch=2, batch=12000: loss 1.25
+Finished Training
+
 ```
 
 Notice:
 
-- Azure Machine Learning has mounted Blob Storage to the compute cluster automatically for you.
-- The ``dataset.as_named_input('input').as_mount()`` used in the control script resolves to the mount point.
-
+- Azure Machine Learning has mounted Blob Storage to the compute cluster automatically for you, passing the mount point into `--data_path`. Compared to the previous job, there is no on the fly data download.
+- The `inputs=my_job_inputs` used in the control script resolves to the mount point.
 
 ## Clean up resources
 
@@ -331,7 +306,6 @@ If you're not going to use it now, stop the compute instance:
 1. Select the compute instance in the list.
 1. On the top toolbar, select **Stop**.
 
-
 ### Delete all resources
 
 [!INCLUDE [aml-delete-resource-group](../../includes/aml-delete-resource-group.md)]