Skip to content

Commit 31d93e9

Browse files
Merge pull request #107696 from nibaccam/dsets-concept
Data | Concept article refresh
2 parents 21f9b7a + d32fa5d commit 31d93e9

File tree

3 files changed

+21
-27
lines changed

3 files changed

+21
-27
lines changed

articles/machine-learning/concept-data.md

Lines changed: 10 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -9,15 +9,15 @@ ms.topic: conceptual
99
ms.reviewer: nibaccam
1010
author: nibaccam
1111
ms.author: nibaccam
12-
ms.date: 12/09/2019
12+
ms.date: 03/15/2020
1313

1414
---
1515

1616
# Data access in Azure Machine Learning
1717

1818
In this article, you learn about Azure Machine Learning's data management and integration solutions for your machine learning tasks. This article assumes you've already created an [Azure storage account](https://docs.microsoft.com/azure/storage/common/storage-quickstart-create-account?tabs=azure-portal) and [Azure storage service](https://docs.microsoft.com/azure/storage/common/storage-introduction).
1919

20-
When you're ready to use the data in your storage, we recommend you
20+
When you're ready to use the data in your Azure storage solution, we recommend you
2121

2222
1. Create an Azure Machine Learning datastore.
2323
2. From that datastore, create an Azure Machine Learning dataset.
@@ -36,13 +36,13 @@ The following diagram provides a visual demonstration of this recommended data a
3636

3737
## Access data in storage
3838

39-
To access your data in your storage account, Azure Machine Learning offers datastores and datasets. Datastores answer the question: how do I securely connect to my data that's in my Azure Storage? Datastores provide a layer of abstraction over your storage service. This aids in security and ease of access to your storage, since connection information is kept in the datastore and not exposed in scripts.
39+
To access your data in your storage account, Azure Machine Learning offers datastores and datasets. Datastores answer the question: how do I securely connect to my data that's in my Azure Storage? Datastores save the connection information to your Azure Storage. This aids in security and ease of access to your storage, since connection information is kept in the datastore and not exposed in scripts.
4040

4141
Datasets answer the question: how do I get specific data files in my datastore? Datasets point to the specific file or files in your underlying storage that you want to use for your machine learning experiment. Together, datastores and datasets offer a secure, scalable, and reproducible data delivery workflow for your machine learning tasks.
4242

43-
### Datastores
43+
## Datastores
4444

45-
An Azure Machine Learning datastore is a storage abstraction over your Azure storage services. [Register and create a datastore](how-to-access-data.md) to easily connect to your Azure storage account, and access the data in your underlying Azure storage services.
45+
An Azure Machine Learning datastore keeps the connection information to your storage so you don't have to code it in your scripts. [Register and create a datastore](how-to-access-data.md) to easily connect to your Azure storage account, and access the data in your underlying Azure storage services.
4646

4747
Supported Azure storage services that can be registered as datastores:
4848
+ Azure Blob Container
@@ -54,11 +54,11 @@ Supported Azure storage services that can be registered as datastores:
5454
+ Databricks File System
5555
+ Azure Database for MySQL
5656

57-
### Datasets
57+
## Datasets
5858

5959
[Create an Azure Machine Learning dataset](how-to-create-register-datasets.md) to interact with data in your datastores and package your data into a consumable object for machine learning tasks. Register the dataset to your workspace to share and reuse it across different experiments without data ingestion complexities.
6060

61-
Datasets can be created from local files, public urls, [Azure Open Datasets](#open), or specific file(s) in your datastores. To create a dataset from an in memory pandas dataframe, write the data to a local file, like a csv, and create your dataset from that file. Datasets aren't copies of your data, but are references that point to the data in your storage service, so no extra storage cost is incurred.
61+
Datasets can be created from local files, public urls, Azure Open Datasets, or specific file(s) in your datastores. To create a dataset from an in memory pandas dataframe, write the data to a local file, like a csv, and create your dataset from that file. Datasets aren't copies of your data, but are references that point to the data in your storage service, so no extra storage cost is incurred.
6262

6363
The following diagram shows that if you don't have an Azure storage service, you can create a dataset directly from local files, public urls, or an Azure Open Dataset. Doing so connects your dataset to the default datastore that was automatically created with your experiment's [Azure Machine Learning workspace](concept-workspace.md).
6464

@@ -77,22 +77,16 @@ Additional datasets capabilities can be found in the following documentation:
7777

7878
With datasets, you can accomplish a number of machine learning tasks through seamless integration with Azure Machine Learning features.
7979

80+
+ Create a [data labeling project](#label).
81+
+ Create a dataset from an [Azure Open Dataset](how-to-create-register-datasets.md#create-datasets-with-azure-open-datasets).
8082
+ [Train machine learning models](how-to-train-with-datasets.md).
8183
+ Consume datasets in
8284
+ [automated ML experiments](how-to-use-automated-ml-for-ml-models.md)
8385
+ the [designer](tutorial-designer-automobile-price-train-score.md#import-data)
86+
+ [Azure Machine Learning pipelines](how-to-create-your-first-pipeline.md)
8487
+ Access datasets for scoring with batch inference in [machine learning pipelines](how-to-create-your-first-pipeline.md).
85-
+ Create a [data labeling project](#label).
8688
+ Set up a dataset monitor for [data drift](#drift) detection.
8789

88-
<a name="open"></a>
89-
90-
## Azure Open Datasets
91-
92-
[Azure Open Datasets](how-to-create-register-datasets.md#create-datasets-with-azure-open-datasets) are curated public datasets that you can use to add scenario-specific features to machine learning solutions for more accurate models. Open Datasets are in the cloud on Microsoft Azure and are integrated into Azure Machine Learning. You can also access the datasets through APIs and use them in other products, such as Power BI and Azure Data Factory.
93-
94-
Azure Open Datasets include public-domain data for weather, census, holidays, public safety, and location that help you train machine learning models and enrich predictive solutions. You can also share your public datasets on Azure Open Datasets.
95-
9690
<a name="label"></a>
9791

9892
## Data labeling

articles/machine-learning/how-to-access-data.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -19,11 +19,12 @@ ms.custom: seodec18
1919
# Access data in Azure storage services
2020
[!INCLUDE [aml-applies-to-basic-enterprise-sku](../../includes/aml-applies-to-basic-enterprise-sku.md)]
2121

22-
In this article, learn how to easily access your data in Azure Storage services via Azure Machine Learning datastores. Datastores are used to store connection information, like your subscription ID and token authorization. When you use datastores, you can access your storage without having to hard code connection information in your scripts.
22+
In this article, learn how to easily access your data in Azure Storage services via Azure Machine Learning datastores. Datastores store connection information, like your subscription ID and token authorization, so you can access your storage without having to hard code them in your scripts.
2323

2424
You can create datastores from [these Azure Storage solutions](#matrix). For unsupported storage solutions, and to save data egress cost during machine learning experiments, we recommend that you [move your data](#move) to supported Azure Storage solutions.
2525

2626
## Prerequisites
27+
2728
You'll need:
2829
- An Azure subscription. If you don't have an Azure subscription, create a free account before you begin. Try the [free or paid version of Azure Machine Learning](https://aka.ms/AMLFree).
2930

@@ -59,12 +60,11 @@ Azure&nbsp;Database&nbsp;for&nbsp;MySQL | SQL authentication| | ✓* | ✓* |
5960
Databricks&nbsp;File&nbsp;System| No authentication | | ✓** | ✓ ** |✓**
6061

6162
*MySQL is only supported for pipeline [DataTransferStep](https://docs.microsoft.com/python/api/azureml-pipeline-steps/azureml.pipeline.steps.datatransferstep?view=azure-ml-py). <br>
62-
\**Databricks is only supported for pipeline [DatabricksStep](https://docs.microsoft.com/python/api/azureml-pipeline-steps/azureml.pipeline.steps.databricks_step.databricksstep?view=azure-ml-py)
63+
**Databricks is only supported for pipeline [DatabricksStep](https://docs.microsoft.com/python/api/azureml-pipeline-steps/azureml.pipeline.steps.databricks_step.databricksstep?view=azure-ml-py)
6364

6465
### Storage guidance
6566

66-
We recommend creating a datastore for an Azure blob container.
67-
Both standard and premium storage are available for blobs. Although premium storage is more expensive, its faster throughput speeds might improve the speed of your training runs, particularly if you train against a large dataset. For information about the cost of storage accounts, see the [Azure pricing calculator](https://azure.microsoft.com/pricing/calculator/?service=machine-learning-service).
67+
We recommend creating a datastore for an Azure blob container. Both standard and premium storage are available for blobs. Although premium storage is more expensive, its faster throughput speeds might improve the speed of your training runs, particularly if you train against a large dataset. For information about the cost of storage accounts, see the [Azure pricing calculator](https://azure.microsoft.com/pricing/calculator/?service=machine-learning-service).
6868

6969
When you create a workspace, an Azure blob container and an Azure file share are automatically registered to the workspace. They're named `workspaceblobstore` and `workspacefilestore`, respectively. They store the connection information for the blob container and the file share that are provisioned in the storage account attached to the workspace. The `workspaceblobstore` container is set as the default datastore.
7070

@@ -75,9 +75,9 @@ When you create a workspace, an Azure blob container and an Azure file share are
7575
When you register an Azure Storage solution as a datastore, you automatically create and register that datastore to a specific workspace. You can create and register datastores to a workspace by using the Python SDK or Azure Machine Learning studio.
7676

7777
>[!IMPORTANT]
78-
> As part of the current datastore create and register process, Azure Machine Learning validates that the user provided principal (username, service principal or SAS token) has access to the underlying storage service.
78+
> As part of the initial datastore create and register process, Azure Machine Learning validates that the underlying storage service exists and that the user provided principal (username, service principal or SAS token) has access to that storage. For Azure Data Lake Storage Gen 1 and 2 datastores, however, this validation happens later, when data access methods like [`from_files()`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.dataset_factory.filedatasetfactory?view=azure-ml-py) or [`from_delimited_files()`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory?view=azure-ml-py#from-parquet-files-path--validate-true--include-path-false--set-column-types-none--partition-format-none-) are called.
7979
<br><br>
80-
However, for Azure Data Lake Storage Gen 1 and 2 datastores, this validation happens later when data access methods like [`from_files()`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.dataset_factory.filedatasetfactory?view=azure-ml-py) or [`from_delimited_files()`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory?view=azure-ml-py#from-parquet-files-path--validate-true--include-path-false--set-column-types-none--partition-format-none-) are called.
80+
After datastore creation, this validation is only performed for methods that require access to the underlying storage container, **not** each time datastore objects are retrieved. For example, validation happens if you want to download files from your datastore; but if you just want to change your default datastore, then validation does not happen.
8181

8282
### Python SDK
8383

@@ -93,7 +93,7 @@ Select **Storage Accounts** on the left pane, and choose the storage account tha
9393
> [!IMPORTANT]
9494
> If your storage account is in a virtual network, only creation of Blob, File share, ADLS Gen 1 and ADLS Gen 2 datastores **via the SDK** is supported. To grant your workspace access to your storage account, set the parameter `grant_workspace_access` to `True`.
9595
96-
The following examples show how to register an Azure blob container, an Azure file share, and Azure Data Lake Storage Generation 2 as a datastore. For other storage services, please see the [reference documentation for the `register_azure_*` methods](https://docs.microsoft.com/python/api/azureml-core/azureml.core.datastore.datastore?view=azure-ml-py#methods).
96+
The following examples show how to register an Azure blob container, an Azure file share, and Azure Data Lake Storage Generation 2 as a datastore. For other storage services, please see the [reference documentation for the applicable `register_azure_*` methods](https://docs.microsoft.com/python/api/azureml-core/azureml.core.datastore.datastore?view=azure-ml-py#methods).
9797

9898
#### Blob container
9999

@@ -260,7 +260,7 @@ To interact with data in your datastores or to package your data into a consumab
260260

261261
Azure Blob storage has higher throughput speeds than an Azure file share and will scale to large numbers of jobs started in parallel. For this reason, we recommend configuring your runs to use Blob storage for transferring source code files.
262262

263-
The following code example specifies in the run configuration which blob datastore to use for source code transfers:
263+
The following code example specifies in the run configuration which blob datastore to use for source code transfers.
264264

265265
```python
266266
# workspaceblobstore is the default blob storage

articles/machine-learning/toc.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -89,10 +89,10 @@
8989
href: concept-workspace.md
9090
- name: Environments
9191
href: concept-environments.md
92-
- name: Data access
93-
href: concept-data.md
9492
- name: Data ingestion
95-
href: concept-data-ingestion.md
93+
href: concept-data-ingestion.md
94+
- name: Data access
95+
href: concept-data.md
9696
- name: Model training
9797
displayName: run config, estimator, machine learning pipeline, ml pipeline, train model
9898
href: concept-train-machine-learning-model.md

0 commit comments

Comments
 (0)