You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> [!div class="op_single_selector" title1="Select the version of Azure Machine Learning SDK you are using:"]
20
-
> *[v2](how-to-import-data-assets.md)
20
+
> *[v2](how-to-import-data-assets.md)
21
21
22
-
In this article, learn how to import data into the Azure Machine Learning platform from external sources. A successful import automatically creates and registers an Azure Machine Learning data asset with the name provided during the import. An Azure Machine Learning data asset resembles a web browser bookmark (favorites). You don't need to remember long storage paths (URIs) that point to your mostfrequently used data. Instead, you can create a data asset, and then access that asset with a friendly name.
22
+
In this article, learn how to import data into the Azure Machine Learning platform from external sources. A successful import automatically creates and registers an Azure Machine Learning data asset with the name provided during the import. An Azure Machine Learning data asset resembles a web browser bookmark (favorites). You don't need to remember long storage paths (URIs) that point to your most-frequently used data. Instead, you can create a data asset, and then access that asset with a friendly name.
23
23
24
24
A data import creates a cache of the source data, along with metadata, for faster and reliable data access in Azure Machine Learning training jobs. The data cache avoids network and connection constraints. The cached data is versioned to support reproducibility (which provides versioning capabilities for data imported from SQL Server sources). Additionally, the cached data provides data lineage for auditability. A data import uses ADF (Azure Data Factory pipelines) behind the scenes, which means that users can avoid complex interactions with ADF. Behind the scenes, Azure Machine Learning also handles management of ADF compute resource pool size, compute resource provisioning, and tear-down to optimize data transfer by determining proper parallelization.
25
25
26
-
The transferred data is partitioned and securely stored as parquet files in Azure storage, to enable faster processing during training. ADF compute costs only involve the time used for data transfers. Storage costs only involve the time needed to cache the data, because cached data is a copy of the data imported from an external source. That external source is hosted in Azure storage.
26
+
The transferred data is partitioned and securely stored as parquet files in Azure storage. This enables faster processing during training. ADF compute costs only involve the time used for data transfers. Storage costs only involve the time needed to cache the data, because cached data is a copy of the data imported from an external source. That external source is hosted in Azure storage.
27
27
28
-
The caching feature involves upfront compute and storage costs. However, it pays for itself, and can save money, because it reduces recurring training compute costs, compared to direct connections to external source data during training. It caches data as parquet files, which makes job training faster and more reliable against connection timeouts for larger data sets. Additionally, the caching feature leads to fewer reruns, and fewer training failures.
29
-
30
-
Customers who want the "auto-deletion" of unused imported data assets can now choose to import data into "workspacemanageddatastore," also known as "workspacemanagedstore". Microsoft manages this datastore on behalf of the customer and provides the convenience of automatic data management on certain conditions like - last used time or created time. By default, all the data assets imported into the workspace-managed datastore have an auto-delete setting configured to "not used for 30 days". If a data asset isn't used for 30 days, it will automatically delete. Within that time, you can edit the "auto-delete" settings in the imported data asset. You can increase or decrease the duration (number of days), or you can change the "condition". As of now, created time and unused time are the two conditions supported. If you chose to work with a "managed datastore", you must only point the `path` on your data import to `azureml://datastores/workspacemanagedstore`, and Azure Machine Learning will create one for you. The managed datastore costs the same as a regular ADLS Gen2 datastore, which charges by the amount of data that is stored in it. However, the managed datastore offers the benefit of data management.
31
-
32
-
> [!NOTE]
33
-
> - There will be only one `workspacemanagedstore` per workspace that would be created
34
-
> - The managed datastore backfills, or is automatic, when the first import job that refers to the managed datastore is submitted.
35
-
> - Users cannot create a `workspacemanagedstore` using any datastore APIs or methods.
36
-
> - In the import definition, users must refer to the managed datastore in this way: `path: azureml://datastores/manageddatastore`. The system automatically assigns a unique path for storage of the imported data. Unlike customer-owned datastores, or a workspace default blobstore, there is no need to provide the entire path where you want to import data.
37
-
> - Currently, the path on the `workspacemanagedstore` can be accessed only by data import service, and `workspacemanagedstore` cannot be given as a destination in any other process or step
38
-
> - The data path in the `workspacemanagedstore` can be accessed only by AzureML service
39
-
> - To access data from the `workspacemanagedstore`, reference the data asset name and version, similar to any other data asset in your jobs or scripts, or processes submitted to AzureML. AzureML knows how to read data from managed datastore.
28
+
The caching feature involves upfront compute and storage costs. However, it pays for itself, and can save money, because it reduces recurring training compute costs compared to direct connections to external source data during training. It caches data as parquet files, which makes job training faster and more reliable against connection timeouts for larger data sets. This leads to fewer reruns, and fewer training failures.
40
29
41
30
You can now import data from Snowflake, Amazon S3 and Azure SQL.
42
31
@@ -55,9 +44,9 @@ To create and work with data assets, you need:
> For a successful data import, please verify that you have installed the latest Azure-ai-ml package (version 1.5.0 or later) for SDK, and the ml extension (version 2.15.1 or later).
47
+
> For a successful data import, please verify that you have installed the latest azure-ai-ml package (version 1.5.0 or later) for SDK, and the ml extension (version 2.15.1 or later).
59
48
>
60
-
> If you have an older SDK package or CLI extension, please remove the old one and install the new one with the code shown in the tab section. Follow these instructions for SDK and CLI:
49
+
> If you have an older SDK package or CLI extension, please remove the old one and install the new one with the code shown in the tab section. Follow the instructions for SDK and CLI below:
61
50
62
51
### Code versions
63
52
@@ -94,9 +83,8 @@ Create a `YAML` file `<file-name>.yml`:
94
83
$schema: http://azureml/sdk-2-0/DataImport.json
95
84
# Supported connections include:
96
85
# Connection: azureml:<workspace_connection_name>
97
-
# Supported "paths" include either on regular datastore or managed datastore as shown below:
## Check the import status of external data sources
207
192
208
-
The data import action is an asynchronous action. It can take a long time. After submission of an import data action via the CLI or SDK, the Azure Machine Learning service might need several minutes to connect to the external data source. Then the service would start the data import and handle data caching and registration. The times needed for a data import also depends on the size of the source data set.
193
+
The data import action is an asynchronous action. It can take a long time. After submission of an import data action via the CLI or SDK, the Azure Machine Learning service might need several minutes to connect to the external data source. Then the service would start the data import and handle data caching and registration. The time needed for a data import also depends on the size of the source data set.
209
194
210
195
The next example returns the status of the submitted data import activity. The command or method uses the "data asset" name as the input to determine the status of the data materialization.
211
196
212
197
# [Azure CLI](#tab/cli)
213
198
199
+
214
200
```cli
215
201
> az ml data list-materialization-status --name <name>
0 commit comments