Skip to content

Commit bb1b4b6

Browse files
authored
Merge pull request #267855 from fbsolo-ms1/freshness-update-branch
Freshness update for the V1 concept-data.md document . . .
2 parents 4c04829 + b60d3be commit bb1b4b6

File tree

1 file changed

+56
-63
lines changed

1 file changed

+56
-63
lines changed

articles/machine-learning/v1/concept-data.md

Lines changed: 56 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,10 @@ services: machine-learning
66
ms.service: machine-learning
77
ms.subservice: enterprise-readiness
88
ms.topic: conceptual
9-
ms.reviewer: ssalgado
9+
ms.reviewer: franksolomon
1010
author: ssalgadodev
1111
ms.author: xunwan
12-
ms.date: 10/21/2021
12+
ms.date: 03/01/2024
1313
ms.custom: UpdateFrequency5, data4ml
1414
#Customer intent: As an experienced Python developer, I need to securely access my data in my Azure storage solutions and use it to accomplish my machine learning tasks.
1515
---
@@ -19,116 +19,109 @@ ms.custom: UpdateFrequency5, data4ml
1919
[!INCLUDE [CLI v1](../includes/machine-learning-cli-v1.md)]
2020
[!INCLUDE [SDK v1](../includes/machine-learning-sdk-v1.md)]
2121

22-
23-
Azure Machine Learning makes it easy to connect to your data in the cloud. It provides an abstraction layer over the underlying storage service, so you can securely access and work with your data without having to write code specific to your storage type. Azure Machine Learning also provides the following data capabilities:
22+
Azure Machine Learning makes it easy to connect to your data in the cloud. It provides an abstraction layer over the underlying storage service, so that you can securely access and work with your data without the need to write code specific to your storage type. Azure Machine Learning also provides these data capabilities:
2423

2524
* Interoperability with Pandas and Spark DataFrames
2625
* Versioning and tracking of data lineage
27-
* Data labeling
26+
* Data labeling
2827
* Data drift monitoring
29-
30-
## Data workflow
3128

32-
When you're ready to use the data in your cloud-based storage solution, we recommend the following data delivery workflow. This workflow assumes you have an [Azure storage account](../../storage/common/storage-account-create.md?tabs=azure-portal) and data in a cloud-based storage service in Azure.
29+
## Data workflow
3330

34-
1. Create an [Azure Machine Learning datastore](#connect-to-storage-with-datastores) to store connection information to your Azure storage.
31+
To use the data in your cloud-based storage solution, we recommend this data delivery workflow. The workflow assumes that you have an [Azure storage account](../../storage/common/storage-account-create.md?tabs=azure-portal), and data in an Azure cloud-based storage service.
3532

36-
2. From that datastore, create an [Azure Machine Learning dataset](#reference-data-in-storage-with-datasets) to point to a specific file(s) in your underlying storage.
33+
1. Create an [Azure Machine Learning datastore](#connect-to-storage-with-datastores) to store connection information to your Azure storage
3734

38-
3. To use that dataset in your machine learning experiment you can either
39-
* Mount it to your experiment's compute target for model training.
35+
2. From that datastore, create an [Azure Machine Learning dataset](#reference-data-in-storage-with-datasets) to point to a specific file or files in your underlying storage
4036

41-
**OR**
37+
3. To use that dataset in your machine learning experiment, you can either
38+
* Mount the dataset to the compute target of your experiment, for model training
4239

43-
* Consume it directly in Azure Machine Learning solutions like, automated machine learning (automated ML) experiment runs, machine learning pipelines, or the [Azure Machine Learning designer](concept-designer.md).
40+
**OR**
4441

45-
4. Create [dataset monitors](#monitor-model-performance-with-data-drift) for your model output dataset to detect for data drift.
42+
* Consume the dataset directly in Azure Machine Learning solutions - for example, automated machine learning (automated ML) experiment runs, machine learning pipelines, or the [Azure Machine Learning designer](concept-designer.md).
4643

47-
5. If data drift is detected, update your input dataset and retrain your model accordingly.
44+
4. Create [dataset monitors](#monitor-model-performance-with-data-drift) for your model output dataset to detect data drift
4845

49-
The following diagram provides a visual demonstration of this recommended workflow.
46+
5. For detected data drift, update your input dataset and retrain your model accordingly
5047

51-
![Diagram shows the Azure Storage Service which flows into a datastore, which flows into a dataset.](./media/concept-data/data-concept-diagram.svg)
48+
This screenshot shows the recommended workflow:
5249

50+
:::image type="content" source="./media/concept-data/data-concept-diagram.svg" alt-text="Screenshot showing the Azure Storage Service, which flows into a datastore and then into a dataset." lightbox="./media/concept-data/data-concept-diagram.svg":::
5351

5452
## Connect to storage with datastores
5553

56-
Azure Machine Learning datastores securely keep the connection information to your data storage on Azure, so you don't have to code it in your scripts. [Register and create a datastore](../how-to-access-data.md) to easily connect to your storage account, and access the data in your underlying storage service.
54+
Azure Machine Learning datastores securely host your data storage connection information on Azure, so you don't have to place that information in your scripts. For more information about connecting to a storage account and data access in your underlying storage service, visit [Register and create a datastore](../how-to-access-data.md).
5755

58-
Supported cloud-based storage services in Azure that can be registered as datastores:
56+
These supported Azure cloud-based storage services can register as datastores:
5957

60-
+ Azure Blob Container
61-
+ Azure File Share
62-
+ Azure Data Lake
63-
+ Azure Data Lake Gen2
64-
+ Azure SQL Database
65-
+ Azure Database for PostgreSQL
66-
+ Databricks File System
67-
+ Azure Database for MySQL
58+
- Azure Blob Container
59+
- Azure File Share
60+
- Azure Data Lake
61+
- Azure Data Lake Gen2
62+
- Azure SQL Database
63+
- Azure Database for PostgreSQL
64+
- Databricks File System
65+
- Azure Database for MySQL
6866

6967
>[!TIP]
70-
> You can create datastores with credential-based authentication for accessing storage services, like a service principal or shared access signature (SAS) token. These credentials can be accessed by users who have *Reader* access to the workspace.
68+
> You can create datastores with credential-based authentication to access storage services, for example a service principal or a shared access signature (SAS) token. Users with *Reader* access to the workspace can access these credentials.
7169
>
72-
> If this is a concern, [create a datastore that uses identity-based data access](../how-to-identity-based-data-access.md) to connect to storage services.
73-
70+
> If this is a concern, visit [create a datastore that uses identity-based data access](../how-to-identity-based-data-access.md) for more information about connections to storage services.
7471
7572
## Reference data in storage with datasets
7673

77-
Azure Machine Learning datasets aren't copies of your data. By creating a dataset, you create a reference to the data in its storage service, along with a copy of its metadata.
74+
Azure Machine Learning datasets aren't copies of your data. Dataset creation itself creates a reference to the data in its storage service, along with a copy of its metadata.
7875

7976
Because datasets are lazily evaluated, and the data remains in its existing location, you
8077

81-
* Incur no extra storage cost.
82-
* Don't risk unintentionally changing your original data sources.
83-
* Improve ML workflow performance speeds.
78+
- Incur no extra storage cost
79+
- Don't risk unintentional changes to your original data sources
80+
- Improve ML workflow performance speeds
8481

85-
To interact with your data in storage, [create a dataset](how-to-create-register-datasets.md) to package your data into a consumable object for machine learning tasks. Register the dataset to your workspace to share and reuse it across different experiments without data ingestion complexities.
82+
To interact with your data in storage, [create a dataset](how-to-create-register-datasets.md) to package your data into a consumable object for machine learning tasks. Register the dataset to your workspace, to share and reuse it across different experiments without data ingestion complexities.
8683

87-
Datasets can be created from local files, public urls, [Azure Open Datasets](https://azure.microsoft.com/services/open-datasets/), or Azure storage services via datastores.
84+
You can create datasets from local files, public urls, [Azure Open Datasets](https://azure.microsoft.com/services/open-datasets/), or Azure storage services via datastores.
8885

89-
There are 2 types of datasets:
86+
There are two types of datasets:
9087

91-
+ A [FileDataset](/python/api/azureml-core/azureml.data.file_dataset.filedataset) references single or multiple files in your datastores or public URLs. If your data is already cleansed and ready to use in training experiments, you can [download or mount files](how-to-train-with-datasets.md#mount-files-to-remote-compute-targets) referenced by FileDatasets to your compute target.
88+
- A [FileDataset](/python/api/azureml-core/azureml.data.file_dataset.filedataset) references single or multiple files in your datastores or public URLs. If your data is already cleansed and ready for training experiments, you can [download or mount files](how-to-train-with-datasets.md#mount-files-to-remote-compute-targets) referenced by FileDatasets to your compute target
9289

93-
+ A [TabularDataset](/python/api/azureml-core/azureml.data.tabulardataset) represents data in a tabular format by parsing the provided file or list of files. You can load a TabularDataset into a pandas or Spark DataFrame for further manipulation and cleansing. For a complete list of data formats you can create TabularDatasets from, see the [TabularDatasetFactory class](/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory).
90+
- A [TabularDataset](/python/api/azureml-core/azureml.data.tabulardataset) represents data in a tabular format, by parsing the provided file or list of files. You can load a TabularDataset into a pandas or Spark DataFrame for further manipulation and cleansing. For a complete list of data formats from which you can create TabularDatasets, visit the [TabularDatasetFactory class](/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory)
9491

95-
Additional datasets capabilities can be found in the following documentation:
92+
These resources offer more information about dataset capabilities:
9693

97-
+ [Version and track](how-to-version-track-datasets.md) dataset lineage.
98-
+ [Monitor your dataset](how-to-monitor-datasets.md) to help with data drift detection.
94+
- [Version and track](how-to-version-track-datasets.md) dataset lineage
95+
- [Monitor your dataset](how-to-monitor-datasets.md) to help with data drift detection
9996

10097
## Work with your data
10198

102-
With datasets, you can accomplish a number of machine learning tasks through seamless integration with Azure Machine Learning features.
103-
104-
+ Create a [data labeling project](#label-data-with-data-labeling-projects).
105-
+ Train machine learning models:
106-
+ [automated ML experiments](../how-to-use-automated-ml-for-ml-models.md)
107-
+ the [designer](tutorial-designer-automobile-price-train-score.md#import-data)
108-
+ [notebooks](how-to-train-with-datasets.md)
109-
+ [Azure Machine Learning pipelines](how-to-create-machine-learning-pipelines.md)
110-
+ Access datasets for scoring with [batch inference](../tutorial-pipeline-batch-scoring-classification.md) in [machine learning pipelines](how-to-create-machine-learning-pipelines.md).
111-
+ Set up a dataset monitor for [data drift](#monitor-model-performance-with-data-drift) detection.
112-
99+
With datasets, you can accomplish machine learning tasks through seamless integration with Azure Machine Learning features.
113100

101+
- Create a [data labeling project](#label-data-with-data-labeling-projects)
102+
- Train machine learning models:
103+
- [automated ML experiments](../how-to-use-automated-ml-for-ml-models.md)
104+
- the [designer](tutorial-designer-automobile-price-train-score.md#import-data)
105+
- [notebooks](how-to-train-with-datasets.md)
106+
- [Azure Machine Learning pipelines](how-to-create-machine-learning-pipelines.md)
107+
- Access datasets for scoring with [batch inference](../tutorial-pipeline-batch-scoring-classification.md) in [machine learning pipelines](how-to-create-machine-learning-pipelines.md)
108+
- Set up a dataset monitor for [data drift](#monitor-model-performance-with-data-drift) detection
114109

115110
## Label data with data labeling projects
116111

117-
Labeling large amounts of data has often been a headache in machine learning projects. Those with a computer vision component, such as image classification or object detection, generally require thousands of images and corresponding labels.
112+
Labeling large volumes of data in machine learning projects can become a headache. Projects that involve a computer vision component, such as image classification or object detection, often require thousands of images and corresponding labels.
118113

119-
Azure Machine Learning gives you a central location to create, manage, and monitor labeling projects. Labeling projects help coordinate the data, labels, and team members, allowing you to more efficiently manage the labeling tasks. Currently supported tasks are image classification, either multi-label or multi-class, and object identification using bounded boxes.
114+
Azure Machine Learning provides a central location to create, manage, and monitor labeling projects. Labeling projects help coordinate the data, labels, and team members, so that you can more efficiently manage the labeling tasks. Currently supported tasks involve image classification, either multi-label or multi-class, and object identification using bounded boxes.
120115

121116
Create an [image labeling project](../how-to-create-image-labeling-projects.md) or [text labeling project](../how-to-create-text-labeling-projects.md), and output a dataset for use in machine learning experiments.
122117

123-
124-
125118
## Monitor model performance with data drift
126119

127-
In the context of machine learning, data drift is the change in model input data that leads to model performance degradation. It is one of the top reasons model accuracy degrades over time, thus monitoring data drift helps detect model performance issues.
120+
In the context of machine learning, data drift involves the change in model input data that leads to model performance degradation. It's a major reason that model accuracy degrades over time, and data drift monitoring helps detect model performance issues.
128121

129-
See the [Create a dataset monitor](how-to-monitor-datasets.md) article, to learn more about how to detect and alert to data drift on new data in a dataset.
122+
For more information, visit [Create a dataset monitor](how-to-monitor-datasets.md) to learn how to detect and alert to data drift on new data in a dataset.
130123

131-
## Next steps
124+
## Next steps
132125

133-
+ Create a dataset in Azure Machine Learning studio or with the Python SDK [using these steps.](how-to-create-register-datasets.md)
134-
+ Try out dataset training examples with our [sample notebooks](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/work-with-data/).
126+
- [Create a dataset in Azure Machine Learning studio or with the Python SDK](how-to-create-register-datasets.md)
127+
- Try out dataset training examples with our [sample notebooks](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/work-with-data/)

0 commit comments

Comments
 (0)