You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/service/concept-data.md
+35-19Lines changed: 35 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,55 +9,72 @@ ms.topic: conceptual
9
9
ms.reviewer: nibaccam
10
10
author: nibaccam
11
11
ms.author: nibaccam
12
-
ms.date: 11/25/2019
12
+
ms.date: 11/27/2019
13
13
14
14
---
15
15
16
-
# Data in Azure Machine Learning
16
+
# Data access in Azure Machine Learning
17
17
18
-
In this article, learn about Azure Machine Learning's data integration solutions from data access to data drift.
18
+
In this article, learn about Azure Machine Learning's data management and integration solutions for your machine learning tasks. This article describes a data access workflow that assumes you've already created an [Azure storage account](https://docs.microsoft.comazure/storage/common/storage-quickstart-create-account?tabs=azure-portal) and [Azure storage service](https://docs.microsoft.com/azure/storage/common/storage-introduction).
19
19
20
-
The following diagram demonstrates the recommended data workflow for Azure Machine Learning. This article and workflow assumes you've already created an [ Azure storage account](https://docs.microsoft.comazure/storage/common/storage-quickstart-create-account?tabs=azure-portal) and [service](https://docs.microsoft.com/azure/storage/common/storage-introduction).
21
20
21
+
When you're ready to use the data in your storage, we recommend you
22
+
23
+
1. Create an Azure Machine Learning datastore.
24
+
2. From that datastore, create an Azure Machine Learning dataset.
25
+
3. Use that dataset in your machine learning (ML) experiment by either
26
+
1. Mounting it to your ML experiment's compute target for model training
27
+
28
+
**OR**
29
+
30
+
1. Consuming it directly in Azure Machine Learning solutions like automated machine learning (automated ML) experiment runs, ML pipelines, and the designer.
31
+
4. Create dataset monitors for your model input and output datasets to detect for data drift.
32
+
5. If data drift is detected, retrain your model accordingly.
33
+
34
+
The following diagram provides a visual demonstration of this recommended data access workflow.
To access your data in storage, Azure Machine Learning offers datastores and datasets. These solutions allow you to access and reference your data without compromising security and ease of reuse.
40
+
To access your data in your storage account, Azure Machine Learning offers datastores and datasets. Datastores provide a layer of abstraction over your storage service, this aids in security and ease of access to your storage, since connection information is kept in the datastore and not exposed in scripts. Datasets point to the specific file or files in your underlying storage that you want to use for your machine learning experiment. Together these offer a secure, scalable and reproducible data delivery workflow for your machine learning tasks.
28
41
29
42
### Datastores
30
43
31
-
An Azure Machine Learning datastore is a storage abstraction over an Azure storage services account. Datastores allow you to easily access your data in Azure storage servicesby storing connection information, like your subscription ID and token authorization. This way you don't have to hard code that information in your scripts.
44
+
An Azure Machine Learning datastore is a storage abstraction over an Azure storage services account. Datastores allow you to easily connect to your Azure storage account, and access the data in your underlying Azure storage services. This ease of connection is facilitated by storing security information, like your subscription ID and token authorization, as part of the datastore object so you aren't hard coding that information in your scripts.
32
45
33
46
+[Register and create datastores](how-to-access-data.md)
34
47
35
48
### Datasets
36
49
37
-
To interact with data in your datastores or to package your data into a consumable object for machine learning tasks, create an Azure Machine Learning dataset. Datasets can be created from local files, public urls, [Azure Open Dataset](#open), or datastores. They aren't copies of your data, but are references that point to the data in your storage service, so no extra storage cost is incurred.
50
+
Create an Azure Machine Learning dataset to interact with data in your datastores or to package your data into a consumable object for machine learning tasks.
38
51
39
-
Create an unregistered dataset in memory for your local experiments, or register it to your workspace to share and reuse it across different machine learning experiments without worrying about data ingestion complexities.
52
+
Datasets can be created from local files, public urls, [Azure Open Datasets](#open), or specific file(s) in your datastores. They aren't copies of your data, but are references that point to the data in your storage service, so no extra storage cost is incurred.
40
53
41
-
+[Create and register datasets](how-to-create-register-datasets.md)
54
+
The following articles demonstrate additional datasets capabilities.
55
+
56
+
+[Create and register datasets](how-to-create-register-datasets.md) to your workspace to share and reuse it across different experiments without data ingestion complexities.
42
57
+[Version and track](how-to-version-track-datasets.md) dataset lineage.
58
+
+[Monitor your dataset](how-to-monitor-datasets.md) to help with data drift detection.
43
59
44
60
#### Types of datasets
45
61
46
-
You can create a dataset from paths in datastores, public web urls, Azure Open Datasets, and local files. Datasets provide you with the capability to do sampling, exploratory data analysis, and access data for machine learning experiments.
47
-
48
-
There are two different types of datasets
62
+
There are two different types of datasets:
49
63
50
64
+[TabularDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) represents data in a tabular format by parsing the provided file or list of files. This provides you with the ability to materialize the data into a Pandas or Spark DataFrame for further manipulation and cleansing. For a complete list of files you can create TabularDatasets from, see the [TabularDatasetFactory class](https://aka.ms/tabulardataset-api-reference).
51
65
52
-
+[FileDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py) references single or multiple files in your datastores or public URLs. By this method, you can download or mount files of your choosing to your compute as a FileDataset object.
66
+
+[FileDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py) references single or multiple files in your datastores or public URLs. By this method, you can download or mount files of your choosing to your compute target as a FileDataset object.
53
67
54
68
## Work with your data
55
69
56
70
With datasets, you can accomplish a number of machine learning tasks through seamless integration with Azure Machine Learning features.
57
71
58
-
+ Create a [data labeling project](#label)
59
-
+[Mount or download your dataset for machine learning model training](how-to-train-with-datasets.md).
60
-
+ Consume datasets in your [automated ML experiments](how-to-create-portal-experiments.md), [ML pipelines](how-to-create-your-first-pipeline.md) or the [designer](tutorial-designer-automobile-price-train-score.md#import-data)
+ the [designer](tutorial-designer-automobile-price-train-score.md#import-data)
77
+
+ Create a [data labeling project](#label).
61
78
+ Set up a dataset monitor for [data drift](#drift) detection.
62
79
63
80
<aname="open"></a>
@@ -76,15 +93,14 @@ Labeling large amounts of data has often been a headache in machine learning pro
76
93
77
94
Azure Machine Learning gives you a central location to create, manage, and monitor labeling projects. Labeling projects help coordinate the data, labels, and team members, allowing you to more efficiently manage the labeling tasks. Currently supported tasks are image classification, either multi-label or multi-class, and object identification using bounded boxes.
78
95
79
-
+Use datasets for a [data labeling project](how-to-create-labeling-projects.md).
96
+
+Create a [data labeling project](how-to-create-labeling-projects.md), and output a dataset for use in machine learning experiments.
80
97
81
98
<aname="drift"></a>
82
99
83
100
## Data drift
84
101
85
102
In the context of machine learning, data drift is the change in model input data that leads to model performance degradation. It is one of the top reasons model accuracy degrades over time, thus monitoring data drift helps detect model performance issues.
86
-
87
-
+[Create a dataset monitor](how-to-monitor-datasets.md) to detect and alert to data drift on new data in a dataset.
103
+
See the [Create a dataset monitor](how-to-monitor-datasets.md) article, to learn more about how to detect and alert to data drift on new data in a dataset.
0 commit comments