Skip to content

Commit d998880

Browse files
authored
Merge pull request #96775 from nibaccam/concept-data
Data| New article concept data in Azure Machine Learning
2 parents d8dd41a + 39d4781 commit d998880

File tree

4 files changed

+781
-0
lines changed

4 files changed

+781
-0
lines changed
Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
---
2+
title: Data in Azure Machine Learning
3+
titleSuffix: Azure Machine Learning
4+
description: Learn how Azure Machine Learning interacts with your data and how it's utilized across your machine learning experiments.
5+
services: machine-learning
6+
ms.service: machine-learning
7+
ms.subservice: core
8+
ms.topic: conceptual
9+
ms.reviewer: nibaccam
10+
author: nibaccam
11+
ms.author: nibaccam
12+
ms.date: 12/09/2019
13+
14+
---
15+
16+
# Data access in Azure Machine Learning
17+
18+
In this article, you learn about Azure Machine Learning's data management and integration solutions for your machine learning tasks. This article assumes you've already created an [Azure storage account](https://docs.microsoft.comazure/storage/common/storage-quickstart-create-account?tabs=azure-portal) and [Azure storage service](https://docs.microsoft.com/azure/storage/common/storage-introduction).
19+
20+
When you're ready to use the data in your storage, we recommend you
21+
22+
1. Create an Azure Machine Learning datastore.
23+
2. From that datastore, create an Azure Machine Learning dataset.
24+
3. Use that dataset in your machine learning experiment by either
25+
1. Mounting it to your experiment's compute target for model training
26+
27+
**OR**
28+
29+
1. Consuming it directly in Azure Machine Learning solutions like automated machine learning (automated ML) experiment runs, machine learning pipelines, and the [Azure Machine Learning designer](concept-designer.md).
30+
4. Create dataset monitors for your model input and output datasets to detect for data drift.
31+
5. If data drift is detected, update your dataset and retrain your model accordingly.
32+
33+
The following diagram provides a visual demonstration of this recommended data access workflow.
34+
35+
![Data-concept-diagram](media/concept-data/data-concept-diagram.svg)
36+
37+
## Access data in storage
38+
39+
To access your data in your storage account, Azure Machine Learning offers datastores and datasets. Datastores provide a layer of abstraction over your storage service. This aids in security and ease of access to your storage, since connection information is kept in the datastore and not exposed in scripts. Datasets point to the specific file or files in your underlying storage that you want to use for your machine learning experiment. Together, datastores and datasets offer a secure, scalable, and reproducible data delivery workflow for your machine learning tasks.
40+
41+
### Datastores
42+
43+
An Azure Machine Learning datastore is a storage abstraction over your Azure storage services. [Register and create a datastore](how-to-access-data.md) to easily connect to your Azure storage account, and access the data in your underlying Azure storage services.
44+
45+
Supported Azure storage services that can be registered as datastores:
46+
+ Azure Blob Container
47+
+ Azure File Share
48+
+ Azure Data Lake
49+
+ Azure Data Lake Gen2
50+
+ Azure SQL Database
51+
+ Azure Database for PostgreSQL
52+
+ Databricks File System
53+
+ Azure Database for MySQL
54+
55+
### Datasets
56+
57+
[Create an Azure Machine Learning dataset](how-to-create-register-datasets.md) to interact with data in your datastores and package your data into a consumable object for machine learning tasks. Register the dataset to your workspace to share and reuse it across different experiments without data ingestion complexities.
58+
59+
Datasets can be created from local files, public urls, [Azure Open Datasets](#open), or specific file(s) in your datastores. To create a dataset from an in memory pandas dataframe, write the data to a local file, like a csv, and create your dataset from that file. Datasets aren't copies of your data, but are references that point to the data in your storage service, so no extra storage cost is incurred.
60+
61+
The following diagram shows that if you don't have an Azure storage service, you can create a dataset directly from local files, public urls, or an Azure Open Dataset. Doing so connects your dataset to the default datastore that was automatically created with your experiment's [Azure Machine Learning workspace](concept-workspace.md).
62+
63+
![Data-concept-diagram](media/concept-data/dataset-workflow.svg)
64+
65+
Additional datasets capabilities can be found in the following documentation:
66+
67+
+ [Version and track](how-to-version-track-datasets.md) dataset lineage.
68+
+ [Monitor your dataset](how-to-monitor-datasets.md) to help with data drift detection.
69+
+ See the following for documentation on the two types of datasets:
70+
+ [TabularDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) represents data in a tabular format by parsing the provided file or list of files. Which lets you materialize the data into a Pandas or Spark DataFrame for further manipulation and cleansing. For a complete list of files you can create TabularDatasets from, see the [TabularDatasetFactory class](https://aka.ms/tabulardataset-api-reference).
71+
72+
+ [FileDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py) references single or multiple files in your datastores or public URLs. By this method, you can download or mount files of your choosing to your compute target as a FileDataset object.
73+
74+
## Work with your data
75+
76+
With datasets, you can accomplish a number of machine learning tasks through seamless integration with Azure Machine Learning features.
77+
78+
+ [Train machine learning models](how-to-train-with-datasets.md).
79+
+ Consume datasets in
80+
+ [automated ML experiments](how-to-create-portal-experiments.md)
81+
+ the [designer](tutorial-designer-automobile-price-train-score.md#import-data)
82+
+ Access datasets for scoring with batch inference in [machine learning pipelines](how-to-create-your-first-pipeline.md).
83+
+ Create a [data labeling project](#label).
84+
+ Set up a dataset monitor for [data drift](#drift) detection.
85+
86+
<a name="open"></a>
87+
88+
## Azure Open Datasets
89+
90+
[Azure Open Datasets](how-to-create-register-datasets.md#create-datasets-with-azure-open-datasets) are curated public datasets that you can use to add scenario-specific features to machine learning solutions for more accurate models. Open Datasets are in the cloud on Microsoft Azure and are integrated into Azure Machine Learning. You can also access the datasets through APIs and use them in other products, such as Power BI and Azure Data Factory.
91+
92+
Azure Open Datasets include public-domain data for weather, census, holidays, public safety, and location that help you train machine learning models and enrich predictive solutions. You can also share your public datasets on Azure Open Datasets.
93+
94+
<a name="label"></a>
95+
96+
## Data labeling
97+
98+
Labeling large amounts of data has often been a headache in machine learning projects. Those with a computer vision component, such as image classification or object detection, generally require thousands of images and corresponding labels.
99+
100+
Azure Machine Learning gives you a central location to create, manage, and monitor labeling projects. Labeling projects help coordinate the data, labels, and team members, allowing you to more efficiently manage the labeling tasks. Currently supported tasks are image classification, either multi-label or multi-class, and object identification using bounded boxes.
101+
102+
+ Create a [data labeling project](how-to-create-labeling-projects.md), and output a dataset for use in machine learning experiments.
103+
104+
<a name="drift"></a>
105+
106+
## Data drift
107+
108+
In the context of machine learning, data drift is the change in model input data that leads to model performance degradation. It is one of the top reasons model accuracy degrades over time, thus monitoring data drift helps detect model performance issues.
109+
See the [Create a dataset monitor](how-to-monitor-datasets.md) article, to learn more about how to detect and alert to data drift on new data in a dataset.
110+
111+
## Next steps
112+
113+
+ Create a dataset in Azure Machine Learning studio or with the Python SDK, [use these steps.](how-to-create-register-datasets.md)
114+
+ Try out dataset training examples with our [sample notebooks](https://aka.ms/dataset-tutorial).
115+
+ For data drift examples, see this [data drift tutorial](https://aka.ms/datadrift-notebook).

0 commit comments

Comments
 (0)