Skip to content

Commit e4bc7ad

Browse files
authored
Merge pull request #108008 from nibaccam/dsets-concept
Data access | Concept article definitions
2 parents 9fc0be9 + 9192059 commit e4bc7ad

File tree

2 files changed

+45
-38
lines changed

2 files changed

+45
-38
lines changed
Lines changed: 33 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,50 +1,57 @@
11
---
22
title: Data in Azure Machine Learning
33
titleSuffix: Azure Machine Learning
4-
description: Learn how Azure Machine Learning interacts with your data and how it's utilized across your machine learning experiments.
4+
description: Learn how Azure Machine Learning securely connects to your data, and uses that data for machine learning tasks.
55
services: machine-learning
66
ms.service: machine-learning
77
ms.subservice: core
88
ms.topic: conceptual
99
ms.reviewer: nibaccam
1010
author: nibaccam
1111
ms.author: nibaccam
12-
ms.date: 03/15/2020
12+
ms.date: 03/20/2020
1313

14+
# Customer intent: As an experienced Python developer, I need to securely access my data in my Azure storage solutions and use it to accomplish my machine learning tasks.
1415
---
1516

1617
# Data access in Azure Machine Learning
1718

18-
In this article, you learn about Azure Machine Learning's data management and integration solutions for your machine learning tasks. This article assumes you've already created an [Azure storage account](https://docs.microsoft.com/azure/storage/common/storage-quickstart-create-account?tabs=azure-portal) and [Azure storage service](https://docs.microsoft.com/azure/storage/common/storage-introduction).
19+
Azure Machine Learning makes it easy to connect to your data in the cloud. It provides an abstraction layer over the underlying storage service, so you can securely access and work with your data without having to write code specific to your storage type. Azure Machine Learning also provides the following data capabilities:
1920

20-
When you're ready to use the data in your Azure storage solution, we recommend you
21+
* Versioning and tracking of data lineage
22+
* Data labeling
23+
* Data drift monitoring
24+
* Interoperability with Pandas and Spark DataFrames
2125

22-
1. Create an Azure Machine Learning datastore.
23-
2. From that datastore, create an Azure Machine Learning dataset.
24-
3. Use that dataset in your machine learning experiment by either
25-
1. Mounting it to your experiment's compute target for model training
26+
## Data workflow
2627

27-
**OR**
28+
When you're ready to use the data in your cloud-based storage solution, we recommend the following data delivery workflow. This workflow assumes you have an [Azure storage account](https://docs.microsoft.com/azure/storage/common/storage-quickstart-create-account?tabs=azure-portal) and data in a cloud-based storage service in Azure.
2829

29-
1. Consuming it directly in Azure Machine Learning solutions like automated machine learning (automated ML) experiment runs, machine learning pipelines, and the [Azure Machine Learning designer](concept-designer.md).
30-
4. Create dataset monitors for your model output dataset to detect for data drift.
31-
5. If data drift is detected, update your input dataset and retrain your model accordingly.
30+
1. Create an [Azure Machine Learning datastore](#datastores) to store connection information to your Azure storage.
3231

33-
The following diagram provides a visual demonstration of this recommended data access workflow.
32+
2. From that datastore, create an [Azure Machine Learning dataset](#datasets) to point to a specific file(s) in your underlying storage.
3433

35-
![Data-concept-diagram](./media/concept-data/data-concept-diagram.svg)
34+
3. To use that dataset in your machine learning experiment you can either
35+
1. Mount it to your experiment's compute target for model training.
3636

37-
## Access data in storage
37+
**OR**
38+
39+
1. Consume it directly in Azure Machine Learning solutions like, automated machine learning (automated ML) experiment runs, machine learning pipelines, or the [Azure Machine Learning designer](concept-designer.md).
3840

39-
To access your data in your storage account, Azure Machine Learning offers datastores and datasets. Datastores answer the question: how do I securely connect to my data that's in my Azure Storage? Datastores save the connection information to your Azure Storage. This aids in security and ease of access to your storage, since connection information is kept in the datastore and not exposed in scripts.
41+
4. Create [dataset monitors](#data-drift) for your model output dataset to detect for data drift.
4042

41-
Datasets answer the question: how do I get specific data files in my datastore? Datasets point to the specific file or files in your underlying storage that you want to use for your machine learning experiment. Together, datastores and datasets offer a secure, scalable, and reproducible data delivery workflow for your machine learning tasks.
43+
5. If data drift is detected, update your input dataset and retrain your model accordingly.
44+
45+
The following diagram provides a visual demonstration of this recommended workflow.
46+
47+
![Data-concept-diagram](./media/concept-data/data-concept-diagram.svg)
4248

4349
## Datastores
4450

45-
An Azure Machine Learning datastore keeps the connection information to your storage so you don't have to code it in your scripts. [Register and create a datastore](how-to-access-data.md) to easily connect to your Azure storage account, and access the data in your underlying Azure storage services.
51+
Azure Machine Learning datastores securely keep the connection information to your Azure storage, so you don't have to code it in your scripts. [Register and create a datastore](how-to-access-data.md) to easily connect to your storage account, and access the data in your underlying Azure storage service.
52+
53+
Supported cloud-based storage services in Azure that can be registered as datastores:
4654

47-
Supported Azure storage services that can be registered as datastores:
4855
+ Azure Blob Container
4956
+ Azure File Share
5057
+ Azure Data Lake
@@ -56,9 +63,9 @@ Supported Azure storage services that can be registered as datastores:
5663

5764
## Datasets
5865

59-
[Create an Azure Machine Learning dataset](how-to-create-register-datasets.md) to interact with data in your datastores and package your data into a consumable object for machine learning tasks. Register the dataset to your workspace to share and reuse it across different experiments without data ingestion complexities.
66+
Azure Machine Learning datasets are references that point to the data in your storage service. They aren't copies of your data, so no extra storage cost is incurred. To interact with your data in storage, [create a dataset](how-to-create-register-datasets.md) to package your data into a consumable object for machine learning tasks. Register the dataset to your workspace to share and reuse it across different experiments without data ingestion complexities.
6067

61-
Datasets can be created from local files, public urls, Azure Open Datasets, or specific file(s) in your datastores. To create a dataset from an in memory pandas dataframe, write the data to a local file, like a csv, and create your dataset from that file. Datasets aren't copies of your data, but are references that point to the data in your storage service, so no extra storage cost is incurred.
68+
Datasets can be created from local files, public urls, [Azure Open Datasets](https://azure.microsoft.com/services/open-datasets/), or specific file(s) in your datastores. To create a dataset from an in memory pandas dataframe, write the data to a local file, like a csv, and create your dataset from that file.
6269

6370
The following diagram shows that if you don't have an Azure storage service, you can create a dataset directly from local files, public urls, or an Azure Open Dataset. Doing so connects your dataset to the default datastore that was automatically created with your experiment's [Azure Machine Learning workspace](concept-workspace.md).
6471

@@ -69,9 +76,9 @@ Additional datasets capabilities can be found in the following documentation:
6976
+ [Version and track](how-to-version-track-datasets.md) dataset lineage.
7077
+ [Monitor your dataset](how-to-monitor-datasets.md) to help with data drift detection.
7178
+ See the following for documentation on the two types of datasets:
72-
+ [TabularDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) represents data in a tabular format by parsing the provided file or list of files. Which lets you materialize the data into a Pandas or Spark DataFrame for further manipulation and cleansing. For a complete list of files you can create TabularDatasets from, see the [TabularDatasetFactory class](https://aka.ms/tabulardataset-api-reference).
79+
+ A [TabularDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) represents data in a tabular format by parsing the provided file or list of files. Which lets you materialize the data into a Pandas or Spark DataFrame for further manipulation and cleansing. For a complete list of files you can create TabularDatasets from, see the [TabularDatasetFactory class](https://aka.ms/tabulardataset-api-reference).
7380

74-
+ [FileDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py) references single or multiple files in your datastores or public URLs. By this method, you can download or mount files of your choosing to your compute target as a FileDataset object.
81+
+ A [FileDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py) references single or multiple files in your datastores or public URLs. By this method, you can [download or mount files](how-to-train-with-datasets.md#option-2--mount-files-to-a-remote-compute-target) of your choosing to your compute target as a FileDataset object.
7582

7683
## Work with your data
7784

@@ -95,17 +102,18 @@ Labeling large amounts of data has often been a headache in machine learning pro
95102

96103
Azure Machine Learning gives you a central location to create, manage, and monitor labeling projects. Labeling projects help coordinate the data, labels, and team members, allowing you to more efficiently manage the labeling tasks. Currently supported tasks are image classification, either multi-label or multi-class, and object identification using bounded boxes.
97104

98-
+ Create a [data labeling project](how-to-create-labeling-projects.md), and output a dataset for use in machine learning experiments.
105+
Create a [data labeling project](how-to-create-labeling-projects.md), and output a dataset for use in machine learning experiments.
99106

100107
<a name="drift"></a>
101108

102109
## Data drift
103110

104111
In the context of machine learning, data drift is the change in model input data that leads to model performance degradation. It is one of the top reasons model accuracy degrades over time, thus monitoring data drift helps detect model performance issues.
112+
105113
See the [Create a dataset monitor](how-to-monitor-datasets.md) article, to learn more about how to detect and alert to data drift on new data in a dataset.
106114

107115
## Next steps
108116

109-
+ Create a dataset in Azure Machine Learning studio or with the Python SDK, [use these steps.](how-to-create-register-datasets.md)
117+
+ Create a dataset in Azure Machine Learning studio or with the Python SDK [using these steps.](how-to-create-register-datasets.md)
110118
+ Try out dataset training examples with our [sample notebooks](https://aka.ms/dataset-tutorial).
111119
+ For data drift examples, see this [data drift tutorial](https://aka.ms/datadrift-notebook).

articles/machine-learning/how-to-access-data.md

Lines changed: 12 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Access data in Azure Storage services
2+
title: Access data in Azure storage services
33
titleSuffix: Azure Machine Learning
44
description: Learn how to use datastores to securely connect to Azure storage services during training with Azure Machine Learning
55
services: machine-learning
@@ -12,16 +12,15 @@ ms.reviewer: nibaccam
1212
ms.date: 02/27/2020
1313
ms.custom: seodec18
1414

15-
# Customer intent: As an experienced Python developer, I need to make my data in Azure Storage available to my remote compute to train my machine learning models.
16-
15+
# Customer intent: As an experienced Python developer, I need to make my data in Azure storage available to my remote compute to train my machine learning models.
1716
---
1817

1918
# Access data in Azure storage services
2019
[!INCLUDE [aml-applies-to-basic-enterprise-sku](../../includes/aml-applies-to-basic-enterprise-sku.md)]
2120

22-
In this article, learn how to easily access your data in Azure Storage services via Azure Machine Learning datastores. Datastores store connection information, like your subscription ID and token authorization, so you can access your storage without having to hard code them in your scripts.
21+
In this article, learn how to easily access your data in Azure storage services via Azure Machine Learning datastores. Datastores store connection information, like your subscription ID and token authorization, so you can access your storage without having to hard code them in your scripts.
2322

24-
You can create datastores from [these Azure Storage solutions](#matrix). For unsupported storage solutions, and to save data egress cost during machine learning experiments, we recommend that you [move your data](#move) to supported Azure Storage solutions.
23+
You can create datastores from [these Azure storage solutions](#matrix). For unsupported storage solutions, and to save data egress cost during machine learning experiments, we recommend that you [move your data](#move) to supported Azure storage solutions.
2524

2625
## Prerequisites
2726

@@ -54,10 +53,10 @@ Datastores currently support storing connection information to the storage servi
5453
[Azure&nbsp;File&nbsp;Share](https://docs.microsoft.com/azure/storage/files/storage-files-introduction)| Account key <br> SAS token | ✓ | ✓ | ✓ |✓
5554
[Azure&nbsp;Data Lake&nbsp;Storage Gen&nbsp;1](https://docs.microsoft.com/azure/data-lake-store/)| Service principal| ✓ | ✓ | ✓ |✓
5655
[Azure&nbsp;Data Lake&nbsp;Storage Gen&nbsp;2](https://docs.microsoft.com/azure/storage/blobs/data-lake-storage-introduction)| Service principal| ✓ | ✓ | ✓ |✓
57-
Azure&nbsp;SQL&nbsp;Database| SQL authentication <br>Service principal| ✓ | ✓ | ✓ |✓
58-
Azure&nbsp;PostgreSQL | SQL authentication| ✓ | ✓ | ✓ |✓
59-
Azure&nbsp;Database&nbsp;for&nbsp;MySQL | SQL authentication| | ✓* | ✓* |✓*
60-
Databricks&nbsp;File&nbsp;System| No authentication | | ✓** | ✓ ** |✓**
56+
[Azure&nbsp;SQL&nbsp;Database](https://docs.microsoft.com/azure/sql-database/sql-database-technical-overview)| SQL authentication <br>Service principal| ✓ | ✓ | ✓ |✓
57+
[Azure&nbsp;PostgreSQL](https://docs.microsoft.com/azure/postgresql/overview) | SQL authentication| ✓ | ✓ | ✓ |✓
58+
[Azure&nbsp;Database&nbsp;for&nbsp;MySQL](https://docs.microsoft.com/azure/mysql/overview) | SQL authentication| | ✓* | ✓* |✓*
59+
[Databricks&nbsp;File&nbsp;System](https://docs.microsoft.com/azure/databricks/data/databricks-file-system)| No authentication | | ✓** | ✓ ** |✓**
6160

6261
*MySQL is only supported for pipeline [DataTransferStep](https://docs.microsoft.com/python/api/azureml-pipeline-steps/azureml.pipeline.steps.datatransferstep?view=azure-ml-py). <br>
6362
**Databricks is only supported for pipeline [DatabricksStep](https://docs.microsoft.com/python/api/azureml-pipeline-steps/azureml.pipeline.steps.databricks_step.databricksstep?view=azure-ml-py)
@@ -72,7 +71,7 @@ When you create a workspace, an Azure blob container and an Azure file share are
7271

7372
## Create and register datastores
7473

75-
When you register an Azure Storage solution as a datastore, you automatically create and register that datastore to a specific workspace. You can create and register datastores to a workspace by using the Python SDK or Azure Machine Learning studio.
74+
When you register an Azure storage solution as a datastore, you automatically create and register that datastore to a specific workspace. You can create and register datastores to a workspace by using the Python SDK or Azure Machine Learning studio.
7675

7776
>[!IMPORTANT]
7877
> As part of the initial datastore create and register process, Azure Machine Learning validates that the underlying storage service exists and that the user provided principal (username, service principal or SAS token) has access to that storage. For Azure Data Lake Storage Gen 1 and 2 datastores, however, this validation happens later, when data access methods like [`from_files()`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.dataset_factory.filedatasetfactory?view=azure-ml-py) or [`from_delimited_files()`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory?view=azure-ml-py#from-parquet-files-path--validate-true--include-path-false--set-column-types-none--partition-format-none-) are called.
@@ -169,7 +168,7 @@ Create a new datastore in a few steps in Azure Machine Learning studio:
169168
1. Sign in to [Azure Machine Learning studio](https://ml.azure.com/).
170169
1. Select **Datastores** on the left pane under **Manage**.
171170
1. Select **+ New datastore**.
172-
1. Complete the form for a new datastore. The form intelligently updates itself based on your selections for Azure Storage type and authentication type.
171+
1. Complete the form for a new datastore. The form intelligently updates itself based on your selections for Azure storage type and authentication type.
173172

174173
You can find the information that you need to populate the form on the [Azure portal](https://portal.azure.com). Select **Storage Accounts** on the left pane, and choose the storage account that you want to register. The **Overview** page provides information such as the account name, container, and file share name.
175174

@@ -281,9 +280,9 @@ For situations where the SDK doesn't provide access to datastores, you might be
281280

282281
<a name="move"></a>
283282

284-
## Move data to supported Azure Storage solutions
283+
## Move data to supported Azure storage solutions
285284

286-
Azure Machine Learning supports accessing data from Azure Blob storage, Azure Files, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Database, and Azure Database for PostgreSQL. If you're using unsupported storage, we recommend that you move your data to supported Azure Storage solutions by using [Azure Data Factory and these steps](https://docs.microsoft.com/azure/data-factory/quickstart-create-data-factory-copy-data-tool). Moving data to supported storage can help you save data egress costs during machine learning experiments.
285+
Azure Machine Learning supports accessing data from Azure Blob storage, Azure Files, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Database, and Azure Database for PostgreSQL. If you're using unsupported storage, we recommend that you move your data to supported Azure storage solutions by using [Azure Data Factory and these steps](https://docs.microsoft.com/azure/data-factory/quickstart-create-data-factory-copy-data-tool). Moving data to supported storage can help you save data egress costs during machine learning experiments.
287286

288287
Azure Data Factory provides efficient and resilient data transfer with more than 80 prebuilt connectors at no additional cost. These connectors include Azure data services, on-premises data sources, Amazon S3 and Redshift, and Google BigQuery.
289288

0 commit comments

Comments
 (0)