Skip to content

Commit 23a8f34

Browse files
Merge pull request #244953 from fbsolo-ms1/release-branch-for-freshness-updates
Update concept-data freshness.
2 parents e308c55 + 977627e commit 23a8f34

File tree

1 file changed

+47
-42
lines changed

1 file changed

+47
-42
lines changed

articles/machine-learning/concept-data.md

Lines changed: 47 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -9,29 +9,62 @@ ms.topic: conceptual
99
ms.reviewer: franksolomon
1010
author: samuel100
1111
ms.author: samkemp
12-
ms.date: 01/23/2023
12+
ms.date: 07/13/2023
1313
ms.custom: data4ml, event-tier1-build-2022
1414
#Customer intent: As an experienced Python developer, I need secure access to my data in my Azure storage solutions, and I need to use that data to accomplish my machine learning tasks.
1515
---
1616

1717
# Data concepts in Azure Machine Learning
1818

19+
With Azure Machine Learning, you can import data from a local machine or an existing cloud-based storage resource. This article describes key Azure Machine Learning data concepts.
1920

20-
With Azure Machine Learning, you can bring data from a local machine or an existing cloud-based storage. In this article, you'll learn the main Azure Machine Learning data concepts.
21+
## Datastore
22+
23+
An Azure Machine Learning datastore serves as a *reference* to an *existing* Azure storage account. An Azure Machine Learning datastore offers these benefits:
24+
25+
- A common, easy-to-use API that interacts with different storage types (Blob/Files/ADLS).
26+
- Easier discovery of useful datastores in team operations.
27+
- For credential-based access (service principal/SAS/key), Azure Machine Learning datastore secures connection information. This way, you won't need to place that information in your scripts.
28+
29+
When you create a datastore with an existing Azure storage account, you can choose between two different authentication methods:
30+
31+
- **Credential-based** - authenticate data access with a service principal, shared access signature (SAS) token, or account key. Users with *Reader* workspace access can access the credentials.
32+
- **Identity-based** - use your Azure Active Directory identity or managed identity to authenticate data access.
33+
34+
The following table summarizes the Azure cloud-based storage services that an Azure Machine Learning datastore can create. Additionally, the table summarizes the authentication types that can access those services:
35+
36+
Supported storage service | Credential-based authentication | Identity-based authentication
37+
|---|:----:|:---:|
38+
Azure Blob Container| ✓ | ✓|
39+
Azure File Share| ✓ | |
40+
Azure Data Lake Gen1 | ✓ | ✓|
41+
Azure Data Lake Gen2| ✓ | ✓|
42+
43+
See [Create datastores](how-to-datastore.md) for more information about datastores.
44+
45+
## Data types
46+
47+
A URI (storage location) can reference a file, a folder, or a data table. A machine learning job input and output definition requires one of the following three data types:
48+
49+
|Type |V2 API |V1 API |Canonical Scenarios | V2/V1 API Difference
50+
|---------|---------|---------|---------|---------|
51+
|**File**<br>Reference a single file | `uri_file` | `FileDataset` | Read/write a single file - the file can have any format. | A type new to V2 APIs. In V1 APIs, files always mapped to a folder on the compute target filesystem; this mapping required an `os.path.join`. In V2 APIs, the single file is mapped. This way, you can refer to that location in your code. |
52+
|**Folder**<br> Reference a single folder | `uri_folder` | `FileDataset` | You must read/write a folder of parquet/CSV files into Pandas/Spark.<br><br>Deep-learning with images, text, audio, video files located in a folder. | In V1 APIs, `FileDataset` had an associated engine that could take a file sample from a folder. In V2 APIs, a Folder is a simple mapping to the compute target filesystem. |
53+
|**Table**<br> Reference a data table | `mltable` | `TabularDataset` | You have a complex schema subject to frequent changes, or you need a subset of large tabular data.<br><br>AutoML with Tables. | In V1 APIs, the Azure Machine Learning back-end stored the data materialization blueprint. As a result, `TabularDataset` only worked if you had an Azure Machine Learning workspace. `mltable` stores the data materialization blueprint in *your* storage. This storage location means you can use it *disconnected to AzureML* - for example, locally and on-premises. In V2 APIs, you'll find it easier to transition from local to remote jobs. See [Working with tables in Azure Machine Learning](how-to-mltable.md) for more information. |
2154

2255
## URI
2356
A Uniform Resource Identifier (URI) represents a storage location on your local computer, Azure storage, or a publicly available http(s) location. These examples show URIs for different storage options:
2457

2558
|Storage location | URI examples |
2659
|---------|---------|
60+
|Azure Machine Learning [Datastore](#datastore) | `azureml://datastores/<data_store_name>/paths/<folder1>/<folder2>/<folder3>/<file>.parquet` |
2761
|Local computer | `./home/username/data/my_data` |
2862
|Public http(s) server | `https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv` |
2963
|Blob storage | `wasbs://<containername>@<accountname>.blob.core.windows.net/<folder>/`|
3064
|Azure Data Lake (gen2) | `abfss://<file_system>@<account_name>.dfs.core.windows.net/<folder>/<file>.csv` |
31-
| Azure Data Lake (gen1) | `adl://<accountname>.azuredatalakestore.net/<folder1>/<folder2>`
32-
|Azure Machine Learning [Datastore](#datastore) | `azureml://datastores/<data_store_name>/paths/<folder1>/<folder2>/<folder3>/<file>.parquet` |
65+
| Azure Data Lake (gen1) | `adl://<accountname>.azuredatalakestore.net/<folder1>/<folder2>` |
3366

34-
An Azure Machine Learning job maps URIs to the compute target filesystem. This mapping means that in a command that consumes or produces a URI, that URI works like a file or a folder. A URI uses **identity-based authentication** to connect to storage services, with either your Azure Active Directory ID (default), or Managed Identity. Azure Machine Learning [Datastore](#datastore) URIs can apply either identity-based authentication, or **credential-based** (for example, Service Principal, SAS token, account key) without exposure of secrets.
67+
An Azure Machine Learning job maps URIs to the compute target filesystem. This mapping means that in a command that consumes or produces a URI, that URI works like a file or a folder. A URI uses **identity-based authentication** to connect to storage services, with either your Azure Active Directory ID (default), or Managed Identity. Azure Machine Learning [Datastore](#datastore) URIs can apply either identity-based authentication, or **credential-based** (for example, Service Principal, SAS token, account key), without exposure of secrets.
3568

3669
A URI can serve as either *input* or an *output* to an Azure Machine Learning job, and it can map to the compute target filesystem with one of four different *mode* options:
3770

@@ -47,20 +80,16 @@ Job<br>Input or Output | `upload` | `download` | `ro_mount` | `rw_mount` | `dire
4780
Input | | ✓ | ✓ | | ✓ |
4881
Output | ✓ | | | ✓ |
4982

50-
Read [Access data in a job](how-to-read-write-data-v2.md) for more information.
83+
See [Access data in a job](how-to-read-write-data-v2.md) for more information.
5184

52-
## Data types
53-
54-
A URI (storage location) can reference a file, a folder, or a data table. A machine learning job input and output definition requires one of the following three data types:
85+
## Data runtime capability
86+
Azure Machine Learning uses its own *data runtime* for one of three purposes:
5587

56-
|Type |V2 API |V1 API |Canonical Scenarios | V2/V1 API Difference
57-
|---------|---------|---------|---------|---------|
58-
|**File**<br>Reference a single file | `uri_file` | `FileDataset` | Read/write a single file - the file can have any format. | A type new to V2 APIs. In V1 APIs, files always mapped to a folder on the compute target filesystem; this mapping required an `os.path.join`. In V2 APIs, the single file is mapped. This way, you can refer to that location in your code. |
59-
|**Folder**<br> Reference a single folder | `uri_folder` | `FileDataset` | You must read/write a folder of parquet/CSV files into Pandas/Spark.<br><br>Deep-learning with images, text, audio, video files located in a folder. | In V1 APIs, `FileDataset` had an associated engine that could take a file sample from a folder. In V2 APIs, a Folder is a simple mapping to the compute target filesystem. |
60-
|**Table**<br> Reference a data table | `mltable` | `TabularDataset` | You have a complex schema subject to frequent changes, or you need a subset of large tabular data.<br><br>AutoML with Tables. | In V1 APIs, the Azure Machine Learning back-end stored the data materialization blueprint. As a result, `TabularDataset` only worked if you had an Azure Machine Learning workspace. `mltable` stores the data materialization blueprint in *your* storage. This storage location means you can use it *disconnected to AzureML* - for example, local, on-premises. In V2 APIs, you'll find it easier to transition from local to remote jobs. Read [Working with tables in Azure Machine Learning](how-to-mltable.md) for more information. |
88+
- for mounts/uploads/downloads
89+
- to map storage URIs to the compute target filesystem
90+
- to materialize tabular data into pandas/spark with Azure Machine Learning tables (`mltable`)
6191

62-
## Data runtime capability
63-
Azure Machine Learning uses its own *data runtime* for mounts/uploads/downloads, to map storage URIs to the compute target filesystem, or to materialize tabular data into pandas/spark with Azure Machine Learning tables (`mltable`). The Azure Machine Learning data runtime is designed for machine learning task *high speed and high efficiency*. Its key benefits include:
92+
The Azure Machine Learning data runtime is designed for *high speed and high efficiency* of machine learning tasks. It offers these key benefits:
6493

6594
> [!div class="checklist"]
6695
> - [Rust](https://www.rust-lang.org/) language architecture. The Rust language is known for high speed and high memory efficiency.
@@ -69,42 +98,18 @@ Azure Machine Learning uses its own *data runtime* for mounts/uploads/downloads,
6998
> - Data pre-fetches operate as background task on the CPU(s), to enhance utilization of the GPU(s) in deep-learning operations.
7099
> - Seamless authentication to cloud storage.
71100
72-
## Datastore
73-
74-
An Azure Machine Learning datastore serves as a *reference* to an *existing* Azure storage account. The benefits of Azure Machine Learning datastore creation and use include:
75-
76-
1. A common, easy-to-use API that interacts with different storage types (Blob/Files/ADLS).
77-
1. Easier discovery of useful datastores in team operations.
78-
1. For credential-based access (service principal/SAS/key), Azure Machine Learning datastore secures connection information. This way, you won't need to place that information in your scripts.
79-
80-
When you create a datastore with an existing Azure storage account, you can choose between two different authentication methods:
81-
82-
- **Credential-based** - authenticate data access with a service principal, shared access signature (SAS) token, or account key. Users with *Reader* workspace access can access the credentials.
83-
- **Identity-based** - use your Azure Active Directory identity or managed identity to authenticate data access.
84-
85-
The following table summarizes the Azure cloud-based storage services that an Azure Machine Learning datastore can create. Additionally, the table summarizes the authentication types that can access those services:
86-
87-
Supported storage service | Credential-based authentication | Identity-based authentication
88-
|---|:----:|:---:|
89-
Azure Blob Container| ✓ | ✓|
90-
Azure File Share| ✓ | |
91-
Azure Data Lake Gen1 | ✓ | ✓|
92-
Azure Data Lake Gen2| ✓ | ✓|
93-
94-
Read [Create datastores](how-to-datastore.md) for more information about datastores.
95-
96101
## Data asset
97102

98103
An Azure Machine Learning data asset resembles web browser bookmarks (favorites). Instead of remembering long storage paths (URIs) that point to your most frequently used data, you can create a data asset, and then access that asset with a friendly name.
99104

100105
Data asset creation also creates a *reference* to the data source location, along with a copy of its metadata. Because the data remains in its existing location, you incur no extra storage cost, and you don't risk data source integrity. You can create Data assets from Azure Machine Learning datastores, Azure Storage, public URLs, or local files.
101106

102-
Read [Create data assets](how-to-create-data-assets.md) for more information about data assets.
107+
See [Create data assets](how-to-create-data-assets.md) for more information about data assets.
103108

104109
## Next steps
105110

111+
- [Access data in a job](how-to-read-write-data-v2.md)
106112
- [Install and set up the CLI (v2)](how-to-configure-cli.md#install-and-set-up-the-cli-v2)
107113
- [Create datastores](how-to-datastore.md#create-datastores)
108114
- [Create data assets](how-to-create-data-assets.md#create-data-assets)
109-
- [Access data in a job](how-to-read-write-data-v2.md)
110115
- [Data administration](how-to-administrate-data-authentication.md#data-administration)

0 commit comments

Comments
 (0)