|
| 1 | +--- |
| 2 | +title: 'Migrate data management from SDK v1 to v2' |
| 3 | +titleSuffix: Azure Machine Learning |
| 4 | +description: Migrate data management from v1 to v2 of Azure Machine Learning SDK |
| 5 | +services: machine-learning |
| 6 | +ms.service: machine-learning |
| 7 | +ms.subservice: mldata |
| 8 | +ms.topic: reference |
| 9 | +author: SturgeonMi |
| 10 | +ms.author: xunwan |
| 11 | +ms.date: 09/16/2022 |
| 12 | +ms.reviewer: sgilley |
| 13 | +ms.custom: migration |
| 14 | +--- |
| 15 | + |
| 16 | +# Migrate data management from SDK v1 to v2 |
| 17 | + |
| 18 | +In V1, an AzureML dataset can either be a `Filedataset` or a `Tabulardataset`. |
| 19 | +In V2, an AzureML data asset can be a `uri_folder`, `uri_file` or `mltable`. |
| 20 | +You can conceptually map `Filedataset` to `uri_folder` and `uri_file`, `Tabulardataset` to `mltable`. |
| 21 | + |
| 22 | +* URIs (`uri_folder`, `uri_file`) - a Uniform Resource Identifier that is a reference to a storage location on your local computer or in the cloud that makes it easy to access data in your jobs. |
| 23 | +* MLTable - a method to abstract the schema definition for tabular data so that it's easier for consumers of the data to materialize the table into a Pandas/Dask/Spark dataframe. |
| 24 | + |
| 25 | +This article gives a comparison of data scenario(s) in SDK v1 and SDK v2. |
| 26 | + |
| 27 | +## Create a `filedataset`/ uri type of data asset |
| 28 | + |
| 29 | +* SDK v1 - Create a `Filedataset` |
| 30 | + |
| 31 | + ```python |
| 32 | + from azureml.core import Workspace, Datastore, Dataset |
| 33 | + |
| 34 | + # create a FileDataset pointing to files in 'animals' folder and its subfolders recursively |
| 35 | + datastore_paths = [(datastore, 'animals')] |
| 36 | + animal_ds = Dataset.File.from_files(path=datastore_paths) |
| 37 | + |
| 38 | + # create a FileDataset from image and label files behind public web urls |
| 39 | + web_paths = ['https://azureopendatastorage.blob.core.windows.net/mnist/train-images-idx3-ubyte.gz', |
| 40 | + 'https://azureopendatastorage.blob.core.windows.net/mnist/train-labels-idx1-ubyte.gz'] |
| 41 | + mnist_ds = Dataset.File.from_files(path=web_paths) |
| 42 | + ``` |
| 43 | + |
| 44 | +* SDK v2 |
| 45 | + * Create a `URI_FOLDER` type data asset |
| 46 | + |
| 47 | + ```python |
| 48 | + from azure.ai.ml.entities import Data |
| 49 | + from azure.ai.ml.constants import AssetTypes |
| 50 | + |
| 51 | + # Supported paths include: |
| 52 | + # local: './<path>' |
| 53 | + # blob: 'https://<account_name>.blob.core.windows.net/<container_name>/<path>' |
| 54 | + # ADLS gen2: 'abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/' |
| 55 | + # Datastore: 'azureml://datastores/<data_store_name>/paths/<path>' |
| 56 | + |
| 57 | + my_path = '<path>' |
| 58 | + |
| 59 | + my_data = Data( |
| 60 | + path=my_path, |
| 61 | + type=AssetTypes.URI_FOLDER, |
| 62 | + description="<description>", |
| 63 | + name="<name>", |
| 64 | + version='<version>' |
| 65 | + ) |
| 66 | + |
| 67 | + ml_client.data.create_or_update(my_data) |
| 68 | + ``` |
| 69 | + |
| 70 | + * Create a `URI_FILE` type data asset. |
| 71 | + ```python |
| 72 | + from azure.ai.ml.entities import Data |
| 73 | + from azure.ai.ml.constants import AssetTypes |
| 74 | + |
| 75 | + # Supported paths include: |
| 76 | + # local: './<path>/<file>' |
| 77 | + # blob: 'https://<account_name>.blob.core.windows.net/<container_name>/<path>/<file>' |
| 78 | + # ADLS gen2: 'abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file>' |
| 79 | + # Datastore: 'azureml://datastores/<data_store_name>/paths/<path>/<file>' |
| 80 | + my_path = '<path>' |
| 81 | + |
| 82 | + my_data = Data( |
| 83 | + path=my_path, |
| 84 | + type=AssetTypes.URI_FILE, |
| 85 | + description="<description>", |
| 86 | + name="<name>", |
| 87 | + version="<version>" |
| 88 | + ) |
| 89 | + |
| 90 | + ml_client.data.create_or_update(my_data) |
| 91 | + ``` |
| 92 | + |
| 93 | +## Create a tabular dataset/data asset |
| 94 | + |
| 95 | +* SDK v1 |
| 96 | + |
| 97 | + ```python |
| 98 | + from azureml.core import Workspace, Datastore, Dataset |
| 99 | + |
| 100 | + datastore_name = 'your datastore name' |
| 101 | + |
| 102 | + # get existing workspace |
| 103 | + workspace = Workspace.from_config() |
| 104 | + |
| 105 | + # retrieve an existing datastore in the workspace by name |
| 106 | + datastore = Datastore.get(workspace, datastore_name) |
| 107 | + |
| 108 | + # create a TabularDataset from 3 file paths in datastore |
| 109 | + datastore_paths = [(datastore, 'weather/2018/11.csv'), |
| 110 | + (datastore, 'weather/2018/12.csv'), |
| 111 | + (datastore, 'weather/2019/*.csv')] |
| 112 | + |
| 113 | + weather_ds = Dataset.Tabular.from_delimited_files(path=datastore_paths) |
| 114 | + ``` |
| 115 | + |
| 116 | +* SDK v2 - Create `mltable` data asset via yaml definition |
| 117 | + |
| 118 | + ```yaml |
| 119 | + type: mltable |
| 120 | + |
| 121 | + paths: |
| 122 | + - pattern: ./*.txt |
| 123 | + transformations: |
| 124 | + - read_delimited: |
| 125 | + delimiter: , |
| 126 | + encoding: ascii |
| 127 | + header: all_files_same_headers |
| 128 | + ``` |
| 129 | + |
| 130 | + ```python |
| 131 | + from azure.ai.ml.entities import Data |
| 132 | + from azure.ai.ml.constants import AssetTypes |
| 133 | + |
| 134 | + # my_path must point to folder containing MLTable artifact (MLTable file + data |
| 135 | + # Supported paths include: |
| 136 | + # local: './<path>' |
| 137 | + # blob: 'https://<account_name>.blob.core.windows.net/<container_name>/<path>' |
| 138 | + # ADLS gen2: 'abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/' |
| 139 | + # Datastore: 'azureml://datastores/<data_store_name>/paths/<path>' |
| 140 | + |
| 141 | + my_path = '<path>' |
| 142 | + |
| 143 | + my_data = Data( |
| 144 | + path=my_path, |
| 145 | + type=AssetTypes.MLTABLE, |
| 146 | + description="<description>", |
| 147 | + name="<name>", |
| 148 | + version='<version>' |
| 149 | + ) |
| 150 | + |
| 151 | + ml_client.data.create_or_update(my_data) |
| 152 | + ``` |
| 153 | + |
| 154 | +## Use data in an experiment/job |
| 155 | + |
| 156 | +* SDK v1 |
| 157 | + |
| 158 | + ```python |
| 159 | + from azureml.core import ScriptRunConfig |
| 160 | + |
| 161 | + src = ScriptRunConfig(source_directory=script_folder, |
| 162 | + script='train_titanic.py', |
| 163 | + # pass dataset as an input with friendly name 'titanic' |
| 164 | + arguments=['--input-data', titanic_ds.as_named_input('titanic')], |
| 165 | + compute_target=compute_target, |
| 166 | + environment=myenv) |
| 167 | + |
| 168 | + # Submit the run configuration for your training run |
| 169 | + run = experiment.submit(src) |
| 170 | + run.wait_for_completion(show_output=True) |
| 171 | + ``` |
| 172 | + |
| 173 | +* SDK v2 |
| 174 | + |
| 175 | + ```python |
| 176 | + from azure.ai.ml import command |
| 177 | + from azure.ai.ml.entities import Data |
| 178 | + from azure.ai.ml import Input, Output |
| 179 | + from azure.ai.ml.constants import AssetTypes |
| 180 | + |
| 181 | + # Possible Asset Types for Data: |
| 182 | + # AssetTypes.URI_FILE |
| 183 | + # AssetTypes.URI_FOLDER |
| 184 | + # AssetTypes.MLTABLE |
| 185 | + |
| 186 | + # Possible Paths for Data: |
| 187 | + # Blob: https://<account_name>.blob.core.windows.net/<container_name>/<folder>/<file> |
| 188 | + # Datastore: azureml://datastores/paths/<folder>/<file> |
| 189 | + # Data Asset: azureml:<my_data>:<version> |
| 190 | + |
| 191 | + my_job_inputs = { |
| 192 | + "raw_data": Input(type=AssetTypes.URI_FOLDER, path="<path>") |
| 193 | + } |
| 194 | + |
| 195 | + my_job_outputs = { |
| 196 | + "prep_data": Output(type=AssetTypes.URI_FOLDER, path="<path>") |
| 197 | + } |
| 198 | + |
| 199 | + job = command( |
| 200 | + code="./src", # local path where the code is stored |
| 201 | + command="python process_data.py --raw_data ${{inputs.raw_data}} --prep_data ${{outputs.prep_data}}", |
| 202 | + inputs=my_job_inputs, |
| 203 | + outputs=my_job_outputs, |
| 204 | + environment="<environment_name>:<version>", |
| 205 | + compute="cpu-cluster", |
| 206 | + ) |
| 207 | + |
| 208 | + # submit the command |
| 209 | + returned_job = ml_client.create_or_update(job) |
| 210 | + # get a URL for the status of the job |
| 211 | + returned_job.services["Studio"].endpoint |
| 212 | + ``` |
| 213 | + |
| 214 | +## Mapping of key functionality in SDK v1 and SDK v2 |
| 215 | + |
| 216 | +|Functionality in SDK v1|Rough mapping in SDK v2| |
| 217 | +|-|-| |
| 218 | +|[Method/API in SDK v1](/python/api/azurzeml-core/azureml.datadisplayname: migration, v1, v2)|[Method/API in SDK v2](/python/api/azure-ai-ml/azure.ai.ml.entities)| |
| 219 | + |
| 220 | +## Next steps |
| 221 | + |
| 222 | +For more information, see the documentation here: |
| 223 | +* [Data in Azure Machine Learning](concept-data.md?tabs=uri-file-example%2Ccli-data-create-example) |
| 224 | +* [Create data_assets](how-to-create-data-assets.md?tabs=CLI) |
| 225 | +* [Read and write data in a job](how-to-read-write-data-v2.md) |
| 226 | +* [V2 datastore operations](/python/api/azure-ai-ml/azure.ai.ml.operations.datastoreoperations) |
0 commit comments