Skip to content

Commit 5c6dbe6

Browse files
committed
round 3 edits
1 parent a32d925 commit 5c6dbe6

File tree

2 files changed

+35
-19
lines changed

2 files changed

+35
-19
lines changed

articles/machine-learning/service/concept-data.md

Lines changed: 35 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -9,55 +9,72 @@ ms.topic: conceptual
99
ms.reviewer: nibaccam
1010
author: nibaccam
1111
ms.author: nibaccam
12-
ms.date: 11/25/2019
12+
ms.date: 11/27/2019
1313

1414
---
1515

16-
# Data in Azure Machine Learning
16+
# Data access in Azure Machine Learning
1717

18-
In this article, learn about Azure Machine Learning's data integration solutions from data access to data drift.
18+
In this article, learn about Azure Machine Learning's data management and integration solutions for your machine learning tasks. This article describes a data access workflow that assumes you've already created an [Azure storage account](https://docs.microsoft.comazure/storage/common/storage-quickstart-create-account?tabs=azure-portal) and [Azure storage service](https://docs.microsoft.com/azure/storage/common/storage-introduction).
1919

20-
The following diagram demonstrates the recommended data workflow for Azure Machine Learning. This article and workflow assumes you've already created an [ Azure storage account](https://docs.microsoft.comazure/storage/common/storage-quickstart-create-account?tabs=azure-portal) and [service](https://docs.microsoft.com/azure/storage/common/storage-introduction).
2120

21+
When you're ready to use the data in your storage, we recommend you
22+
23+
1. Create an Azure Machine Learning datastore.
24+
2. From that datastore, create an Azure Machine Learning dataset.
25+
3. Use that dataset in your machine learning (ML) experiment by either
26+
1. Mounting it to your ML experiment's compute target for model training
27+
28+
**OR**
29+
30+
1. Consuming it directly in Azure Machine Learning solutions like automated machine learning (automated ML) experiment runs, ML pipelines, and the designer.
31+
4. Create dataset monitors for your model input and output datasets to detect for data drift.
32+
5. If data drift is detected, retrain your model accordingly.
33+
34+
The following diagram provides a visual demonstration of this recommended data access workflow.
2235

2336
![Data-concept-diagram](media/concept-data/data-concept-diagram.png)
2437

2538
## Access data in storage
2639

27-
To access your data in storage, Azure Machine Learning offers datastores and datasets. These solutions allow you to access and reference your data without compromising security and ease of reuse.
40+
To access your data in your storage account, Azure Machine Learning offers datastores and datasets. Datastores provide a layer of abstraction over your storage service, this aids in security and ease of access to your storage, since connection information is kept in the datastore and not exposed in scripts. Datasets point to the specific file or files in your underlying storage that you want to use for your machine learning experiment. Together these offer a secure, scalable and reproducible data delivery workflow for your machine learning tasks.
2841

2942
### Datastores
3043

31-
An Azure Machine Learning datastore is a storage abstraction over an Azure storage services account. Datastores allow you to easily access your data in Azure storage services by storing connection information, like your subscription ID and token authorization. This way you don't have to hard code that information in your scripts.
44+
An Azure Machine Learning datastore is a storage abstraction over an Azure storage services account. Datastores allow you to easily connect to your Azure storage account, and access the data in your underlying Azure storage services. This ease of connection is facilitated by storing security information, like your subscription ID and token authorization, as part of the datastore object so you aren't hard coding that information in your scripts.
3245

3346
+ [Register and create datastores](how-to-access-data.md)
3447

3548
### Datasets
3649

37-
To interact with data in your datastores or to package your data into a consumable object for machine learning tasks, create an Azure Machine Learning dataset. Datasets can be created from local files, public urls, [Azure Open Dataset](#open), or datastores. They aren't copies of your data, but are references that point to the data in your storage service, so no extra storage cost is incurred.
50+
Create an Azure Machine Learning dataset to interact with data in your datastores or to package your data into a consumable object for machine learning tasks.
3851

39-
Create an unregistered dataset in memory for your local experiments, or register it to your workspace to share and reuse it across different machine learning experiments without worrying about data ingestion complexities.
52+
Datasets can be created from local files, public urls, [Azure Open Datasets](#open), or specific file(s) in your datastores. They aren't copies of your data, but are references that point to the data in your storage service, so no extra storage cost is incurred.
4053

41-
+ [Create and register datasets](how-to-create-register-datasets.md)
54+
The following articles demonstrate additional datasets capabilities.
55+
56+
+ [Create and register datasets](how-to-create-register-datasets.md) to your workspace to share and reuse it across different experiments without data ingestion complexities.
4257
+ [Version and track](how-to-version-track-datasets.md) dataset lineage.
58+
+ [Monitor your dataset](how-to-monitor-datasets.md) to help with data drift detection.
4359

4460
#### Types of datasets
4561

46-
You can create a dataset from paths in datastores, public web urls, Azure Open Datasets, and local files. Datasets provide you with the capability to do sampling, exploratory data analysis, and access data for machine learning experiments.
47-
48-
There are two different types of datasets
62+
There are two different types of datasets:
4963

5064
+ [TabularDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) represents data in a tabular format by parsing the provided file or list of files. This provides you with the ability to materialize the data into a Pandas or Spark DataFrame for further manipulation and cleansing. For a complete list of files you can create TabularDatasets from, see the [TabularDatasetFactory class](https://aka.ms/tabulardataset-api-reference).
5165

52-
+ [FileDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py) references single or multiple files in your datastores or public URLs. By this method, you can download or mount files of your choosing to your compute as a FileDataset object.
66+
+ [FileDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py) references single or multiple files in your datastores or public URLs. By this method, you can download or mount files of your choosing to your compute target as a FileDataset object.
5367

5468
## Work with your data
5569

5670
With datasets, you can accomplish a number of machine learning tasks through seamless integration with Azure Machine Learning features.
5771

58-
+ Create a [data labeling project](#label)
59-
+ [Mount or download your dataset for machine learning model training](how-to-train-with-datasets.md).
60-
+ Consume datasets in your [automated ML experiments](how-to-create-portal-experiments.md), [ML pipelines](how-to-create-your-first-pipeline.md) or the [designer](tutorial-designer-automobile-price-train-score.md#import-data)
72+
+ [Train machine learning models](how-to-train-with-datasets.md).
73+
+ Consume datasets in
74+
+ [automated ML experiments](how-to-create-portal-experiments.md)
75+
+ [ML pipelines](how-to-create-your-first-pipeline.md)
76+
+ the [designer](tutorial-designer-automobile-price-train-score.md#import-data)
77+
+ Create a [data labeling project](#label).
6178
+ Set up a dataset monitor for [data drift](#drift) detection.
6279

6380
<a name="open"></a>
@@ -76,15 +93,14 @@ Labeling large amounts of data has often been a headache in machine learning pro
7693

7794
Azure Machine Learning gives you a central location to create, manage, and monitor labeling projects. Labeling projects help coordinate the data, labels, and team members, allowing you to more efficiently manage the labeling tasks. Currently supported tasks are image classification, either multi-label or multi-class, and object identification using bounded boxes.
7895

79-
+ Use datasets for a [data labeling project](how-to-create-labeling-projects.md).
96+
+ Create a [data labeling project](how-to-create-labeling-projects.md), and output a dataset for use in machine learning experiments.
8097

8198
<a name="drift"></a>
8299

83100
## Data drift
84101

85102
In the context of machine learning, data drift is the change in model input data that leads to model performance degradation. It is one of the top reasons model accuracy degrades over time, thus monitoring data drift helps detect model performance issues.
86-
87-
+ [Create a dataset monitor](how-to-monitor-datasets.md) to detect and alert to data drift on new data in a dataset.
103+
See the [Create a dataset monitor](how-to-monitor-datasets.md) article, to learn more about how to detect and alert to data drift on new data in a dataset.
88104

89105
## Next steps
90106

7.89 KB
Loading

0 commit comments

Comments
 (0)