Skip to content

Commit 60bea6b

Browse files
committed
concept data draft
1 parent bdd2d1d commit 60bea6b

File tree

2 files changed

+79
-0
lines changed

2 files changed

+79
-0
lines changed
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
---
2+
title: Data in Azure Machine Learning
3+
titleSuffix: Azure Machine Learning
4+
description: Learn how Azure Machine Learning interacts with your data and how it's utilized across your machine learning experiments.
5+
services: machine-learning
6+
ms.service: machine-learning
7+
ms.subservice: core
8+
ms.topic: conceptual
9+
ms.reviewer: nibaccam
10+
author: nibaccam
11+
ms.author: nibaccam
12+
ms.date: 11/20/2019
13+
14+
---
15+
16+
# Data in Azure Machine Learning
17+
18+
In this article, learn how your data is accessed and utilized across your machine learning experiments in Azure Machine Learning.
19+
20+
21+
22+
## Access data in Azure data storage services
23+
24+
When you save your data in [Azure storage services](https://docs.microsoft.com/azure/storage/common/storage-introduction), you are storing your data in a scalable and secure cloud storage location. To access your data in storage, Azure Machine Learning offers solutions like datastores and datasets that allow you to reference your data without compromising security and ease of reuse.
25+
26+
### Datastores
27+
28+
An Azure datastore is a storage abstraction over an Azure Machine Learning storage account. Datastores allow you to easily access your data in Azure storage services by storing connection information, like your subscription ID and token authorization, without you having to hard code that information in your scripts.
29+
30+
+ [Register and create datastores](how-to-access-data.md)
31+
32+
### Datasets
33+
34+
Azure Machine Learning datasets aren't copies of your data. When you create a dataset, you create a reference point to the data in your storage service, so no extra storage cost is incurred.
35+
36+
Create an unregistered dataset in memory for your local experiments, or register it to your workspace to share and reuse it across different machine learning experiments without worrying about data ingestion complexities.
37+
38+
+ [Create and register datasets](how-to-create-register-datasets.md)
39+
40+
#### What can we do with datasets?
41+
42+
With datasets, you can accomplish a number of machine learning tasks through seamless integration with Azure Machine Learning features.
43+
44+
+ Consume datasets in [automated ML experiments](how-to-create-portal-experiments.md), [ML pipelines](how-to-create-your-first-pipeline.md) and the [designer](tutorial-designer-automobile-price-train-score.md#import-data)
45+
+ Use datasets for a [data labeling project](how-to-create-labeling-projects.md)
46+
+ [Train machine learning models with datasets](how-to-train-with-datasets.md).
47+
+ [Version and track](how-to-track-version-datasets.md) dataset lineage.
48+
+ [Set up a dataset monitor](#drift) for data drift detection.
49+
50+
#### Types of datasets
51+
52+
You can create a dataset from paths in datastores, pubic web urls, Azure Open Datasets and local files. Datasets provide you with the capability to do sampling, exploratory data analysis, and access data for machine learning experiments.
53+
54+
There are two different types of datasets
55+
56+
* [TabularDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) represents data in a tabular format by parsing the provided file or list of files. This provides you with the ability to materialize the data into a Pandas or Spark DataFrame for further manipulation and cleansing.For a complete list of files you can create TabularDatasets from see the [TabularDatasetFactory class](https://aka.ms/tabulardataset-api-reference).
57+
58+
* [FileDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py) references single or multiple files in your datastores or public URLs. By this method, you can download or mount files of your choosing to your compute as a FileDataset object.
59+
60+
## Azure Open Datasets
61+
62+
[Azure Open Datasets](https://docs.microsoft.com/azure/open-datasets/overview-what-are-open-datasets) are curated public datasets that you can use to add scenario-specific features to machine learning solutions for more accurate models. Open Datasets are in the cloud on Microsoft Azure and are integrated into Azure Machine Learning. You can also access the datasets through APIs and use them in other products, such as Power BI and Azure Data Factory.
63+
64+
Datasets include public-domain data for weather, census, holidays, public safety, and location that help you train machine learning models and enrich predictive solutions. You can also share your public datasets on Azure Open Datasets.
65+
66+
<a name="drift"></a>
67+
68+
## Data drift
69+
70+
In the context of machine learning, data drift is the change in model input data that leads to model performance degradation. It is one of the top reasons model accuracy degrades over time, thus monitoring data drift helps detect model performance issues.
71+
72+
+ [Create a dataset monitor](how-to-monitor-datasets.md) to detect and alert to data drift on new data in a dataset.
73+
74+
## Next steps
75+
76+
* For dataset training examples, see [sample notebooks](https://aka.ms/dataset-tutorial).
77+
* For data drift examples, see this [data drift tutorial](https://aka.ms/datadrift-notebook).

articles/machine-learning/service/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,8 @@
7070
items:
7171
- name: Workspace
7272
href: concept-workspace.md
73+
- name: Data
74+
href: concept-data.md
7375
- name: Model training
7476
displayName: run config, estimator, machine learning pipeline, ml pipeline, train model
7577
href: concept-train-machine-learning-model.md

0 commit comments

Comments
 (0)