You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-manage-inputs-outputs-pipeline.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -297,7 +297,7 @@ az ml job download --output-name <OUTPUT_PORT_NAME> -n <JOB_NAME> -g <RESOURCE_G
297
297
```
298
298
# [Python SDK](#tab/python)
299
299
300
-
Before we dive in the code, you need a way to reference your workspace. You create `ml_client` for a handle to the workspace. Refer to [Create handle to workspace](./tutorial-explore-data.md#create-handle-to-workspace) to initialize `ml_client`.
300
+
Before we dive in the code, you need a way to reference your workspace. You create `ml_client` for a handle to the workspace. Refer to [Create handle to workspace](./tutorial-explore-data.md#create-a-handle-to-the-workspace) to initialize `ml_client`.
301
301
302
302
```python
303
303
# Download all the outputs of the job
@@ -325,7 +325,7 @@ az ml job download --all -n <JOB_NAME> -g <RESOURCE_GROUP_NAME> -w <WORKSPACE_NA
325
325
326
326
# [Python SDK](#tab/python)
327
327
328
-
Before we dive in the code, you need a way to reference your workspace. You create `ml_client` for a handle to the workspace. Refer to [Create handle to workspace](./tutorial-explore-data.md#create-handle-to-workspace) to initialize `ml_client`.
328
+
Before we dive in the code, you need a way to reference your workspace. You create `ml_client` for a handle to the workspace. Refer to [Create handle to workspace](./tutorial-explore-data.md#create-a-handle-to-the-workspace) to initialize `ml_client`.
Copy file name to clipboardExpand all lines: articles/machine-learning/tutorial-explore-data.md
+44-53Lines changed: 44 additions & 53 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,19 +1,19 @@
1
1
---
2
-
title: "Tutorial: Upload, access and explore your data"
2
+
title: "Tutorial: upload, access, and explore your data"
3
3
titleSuffix: Azure Machine Learning
4
-
description: Upload data to cloud storage, create an Azure Machine Learning data asset, create new versions for data assets, use the data for interactive development
4
+
description: Upload data to cloud storage, create an Azure Machine Learning data asset, create new versions for data assets, and use the data for interactive development
5
5
services: machine-learning
6
6
ms.service: machine-learning
7
7
ms.subservice: core
8
8
ms.topic: tutorial
9
9
ms.reviewer: None
10
10
author: fbsolo-ms1
11
11
ms.author: franksolomon
12
-
ms.date: 07/05/2023
12
+
ms.date: 07/25/2024
13
13
#Customer intent: As a data scientist, I want to know how to prototype and develop machine learning models on a cloud workstation.
14
14
---
15
15
16
-
# Tutorial: Upload, access and explore your data in Azure Machine Learning
16
+
# Tutorial: Upload, access, and explore your data in Azure Machine Learning
@@ -25,9 +25,9 @@ In this tutorial you learn how to:
25
25
> * Access your data in a notebook for interactive development
26
26
> * Create new versions of data assets
27
27
28
-
The start of a machine learning project typically involves exploratory data analysis (EDA), data-preprocessing (cleaning, feature engineering), and the building of Machine Learning model prototypes to validate hypotheses. This _prototyping_ project phase is highly interactive. It lends itself to development in an IDE or a Jupyter notebook, with a _Python interactive console_. This tutorial describes these ideas.
28
+
A machine learning project typically starts with exploratory data analysis (EDA), data-preprocessing (cleaning, feature engineering), and building Machine Learning model prototypes to validate hypotheses. This _prototyping_ project phase is highly interactive. It lends itself to development in an IDE or a Jupyter notebook, with a _Python interactive console_. This tutorial describes these ideas.
29
29
30
-
This video shows how to get started in Azure Machine Learning studio so that you can follow the steps in the tutorial. The video shows how to create a notebook, clone the notebook, create a compute instance, and download the data needed for the tutorial. The steps are also described in the following sections.
30
+
This video shows how to get started in Azure Machine Learning studio, so that you can follow the steps in the tutorial. The video shows how to create a notebook, clone the notebook, create a compute instance, and download the data needed for the tutorial. The steps are also described in the following sections.
* Or, open **tutorials/get-started-notebooks/explore-data.ipynb** from the **Samples** section of studio. [!INCLUDE [clone notebook](includes/prereq-clone-notebook.md)]
43
43
44
-
[!INCLUDE [notebook set kernel](includes/prereq-set-kernel.md)]
44
+
[!INCLUDE [notebook set kernel](includes/prereq-set-kernel.md)]
For data ingestion, the Azure Data Explorer handles raw data in [these formats](/azure/data-explorer/ingestion-supported-formats). This tutorial uses this [CSV-format credit card client data sample](https://azuremlexamples.blob.core.windows.net/datasets/credit_card/default_of_credit_card_clients.csv). We see the steps proceed in an Azure Machine Learning resource. In that resource, we'll create a local folder with the suggested name of **data** directly under the folder where this notebook is located.
50
+
For data ingestion, the Azure Data Explorer handles raw data in [these formats](/azure/data-explorer/ingestion-supported-formats). This tutorial uses this [CSV-format credit card client data sample](https://azuremlexamples.blob.core.windows.net/datasets/credit_card/default_of_credit_card_clients.csv). The steps proceed in an Azure Machine Learning resource. In that resource, we'll create a local folder, with the suggested name of **data**, directly under the folder where this notebook is located.
52
51
53
52
> [!NOTE]
54
-
> This tutorial depends on data placed in an Azure Machine Learning resource folder location. For this tutorial, 'local' means a folder location in that Azure Machine Learning resource.
53
+
> This tutorial depends on data placed in an Azure Machine Learning resource folder location. For this tutorial, 'local' means a folder location in that Azure Machine Learning resource.
55
54
56
55
1. Select **Open terminal** below the three dots, as shown in this image:
57
56
58
57
:::image type="content" source="media/tutorial-cloud-workstation/open-terminal.png" alt-text="Screenshot shows open terminal tool in notebook toolbar.":::
59
58
60
-
1. The terminal window opens in a new tab.
61
-
1. Make sure you `cd` to the same folder where this notebook is located. For example, if the notebook is in a folder named **get-started-notebooks**:
59
+
1. The terminal window opens in a new tab.
60
+
1. Make sure you `cd`(**Change Directory**) to the same folder where this notebook is located. For example, if the notebook is in a folder named **get-started-notebooks**:
62
61
63
62
```bash
64
63
cd get-started-notebooks # modify this to the path where your notebook is located
@@ -73,19 +72,17 @@ For data ingestion, the Azure Data Explorer handles raw data in [these formats](
73
72
```
74
73
1. You can now close the terminal window.
75
74
75
+
For more information about the data in the UC Irvine Machine Learning Repository, visit [this resource](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients).
76
76
77
-
[Learn more about this data on the UCI Machine Learning Repository.](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients)
78
-
79
-
## Create handle to workspace
77
+
## Create a handle to the workspace
80
78
81
-
Before we dive inthe code, you need a way to reference your workspace. You'll create `ml_client` for a handle to the workspace. You'llthen use `ml_client` to manage resources and jobs.
79
+
Before we explore the code, you need a way to reference your workspace. You'll create `ml_client` for a handle to the workspace. You then use `ml_client` to manage resources and jobs.
82
80
83
81
In the next cell, enter your Subscription ID, Resource Group name and Workspace name. To find these values:
84
82
85
-
1. In the upper right Azure Machine Learning studio toolbar, selectyour workspace name.
86
-
1. Copy the value for workspace, resource group and subscription ID into the code.
87
-
1. You'll need to copy one value, close the area and paste, then come back for the next one.
88
-
83
+
1. At the upper right Azure Machine Learning studio toolbar, select your workspace name.
84
+
1. Copy the value for workspace, resource group, and subscription ID into the code.
85
+
1. You must individually copy the values one at a time, close the area and paste, then continue to the next one.
89
86
90
87
```python
91
88
from azure.ai.ml import MLClient
@@ -106,31 +103,29 @@ ml_client = MLClient(
106
103
```
107
104
108
105
> [!NOTE]
109
-
> Creating MLClient will not connect to the workspace. The client initialization is lazy, it will wait for the first time it needs to make a call (this will happen in the next code cell).
110
-
106
+
> Creation of MLClient will not connect to the workspace. The client initialization is lazy. It waits for the first time it needs to make a call. This happenS in the next code cell.
111
107
112
108
## Upload data to cloud storage
113
109
114
-
Azure Machine Learning uses Uniform Resource Identifiers (URIs), which point to storage locations in the cloud. A URI makes it easy to access data in notebooks and jobs. Data URI formats look similar to the web URLs that you use in your web browser to access web pages. For example:
110
+
Azure Machine Learning uses Uniform Resource Identifiers (URIs), which point to storage locations in the cloud. A URI makes it easy to access data in notebooks and jobs. Data URI formats have a format similar to the web URLs that you use in your web browser to access web pages. For example:
115
111
116
112
* Access data from public https server: `https://<account_name>.blob.core.windows.net/<container_name>/<folder>/<file>`
117
113
* Access data from Azure Data Lake Gen 2: `abfss://<file_system>@<account_name>.dfs.core.windows.net/<folder>/<file>`
118
114
119
115
An Azure Machine Learning data asset is similar to web browser bookmarks (favorites). Instead of remembering long storage paths (URIs) that point to your most frequently used data, you can create a data asset, and then access that asset with a friendly name.
120
116
121
-
Data asset creation also creates a *reference* to the data source location, along with a copy of its metadata. Because the data remains in its existing location, you incur no extra storage cost, and don't risk data source integrity. You can create Data assets from Azure Machine Learning datastores, Azure Storage, public URLs, and local files.
117
+
Data asset creation also creates a *reference* to the data source location, along with a copy of its metadata. Because the data remains in its existing location, you incur no extra storage cost, and you don't risk data source integrity. You can create Data assets from Azure Machine Learning datastores, Azure Storage, public URLs, and local files.
122
118
123
119
> [!TIP]
124
-
> For smaller-size data uploads, Azure Machine Learning data asset creation works well for data uploads from local machine resources to cloud storage. This approach avoids the need for extra tools or utilities. However, a larger-size data upload might require a dedicated tool or utility - for example, **azcopy**. The azcopy command-line tool moves data to and from Azure Storage. Learn more about azcopy [here](../storage/common/storage-use-azcopy-v10.md).
120
+
> For smaller-size data uploads, Azure Machine Learning data asset creation works well for data uploads from local machine resources to cloud storage. This approach avoids the need for extra tools or utilities. However, a larger-size data upload might require a dedicated tool or utility - for example, **azcopy**. The azcopy command-line tool moves data to and from Azure Storage. For more information about azcopy, visit [this resource](../storage/common/storage-use-azcopy-v10.md).
125
121
126
-
The next notebook cell creates the data asset. The code sample uploads the raw data file to the designated cloud storage resource.
122
+
The next notebook cell creates the data asset. The code sample uploads the raw data file to the designated cloud storage resource.
127
123
128
-
Each time you create a data asset, you need a unique version for it. If the version already exists, you'll get an error. In this code, we're using the "initial"for the first read of the data. If that version already exists, we'll skip creating it again.
124
+
Each time you create a data asset, you need a unique version for it. If the version already exists, you'll get an error. In this code, we use the "initial" for the first read of the data. If that version already exists, we don't recreate it.
129
125
130
-
You can also omit the **version** parameter, and a version number is generated for you, starting with 1 and then incrementing from there.
131
-
132
-
In this tutorial, we use the name "initial" as the first version. The [Create production machine learning pipelines](tutorial-pipeline-python-sdk.md) tutorial will also use this version of the data, so here we are using a value that you'll see again in that tutorial.
126
+
You can also omit the **version** parameter. In this case, a version number is generated for you, starting with 1 and then incrementing from there.
133
127
128
+
This tutorial uses the name "initial" as the first version. The [Create production machine learning pipelines](tutorial-pipeline-python-sdk.md) tutorial also uses this version of the data, so here we use a value that you'll see again in that tutorial.
You can see the uploaded data by selecting **Data** on the left. You'll see the data is uploaded and a data asset is created:
160
+
To examine the uploaded data, select **Data** on the left. The data is uploaded and a data asset is created:
166
161
167
-
:::image type="content" source="media/tutorial-prepare-data/access-and-explore-data.png" alt-text="Screenshot shows the data in studio.":::
162
+
:::image type="content" source="media/tutorial-explore-data/access-and-explore-data.png" alt-text="Screenshot shows the data in studio.":::
168
163
169
-
This data is named **credit-card**, and in the **Data assets** tab, we can see it in the **Name** column. This data uploaded to your workspace's default datastore named **workspaceblobstore**, seen in the **Data source** column.
164
+
This data is named **credit-card**, and in the **Data assets** tab, we can see it in the **Name** column.
170
165
171
166
An Azure Machine Learning datastore is a *reference* to an *existing* storage account on Azure. A datastore offers these benefits:
172
167
173
-
1. A common and easy-to-use API, to interact with different storage types (Blob/Files/Azure Data Lake Storage) and authentication methods.
168
+
1. A common and easy-to-use API, to interact with different storage types
169
+
170
+
- Azure Data Lake Storage
171
+
- Blob
172
+
- Files
173
+
174
+
and authentication methods.
174
175
1. An easier way to discover useful datastores, when working as a team.
175
176
1. In your scripts, a way to hide connection information for credential-based data access (service principal/SAS/key).
176
177
177
-
178
178
## Access your data in a notebook
179
179
180
180
Pandas directly support URIs - this example shows how to read a CSV file from an Azure Machine Learning Datastore:
However, as mentioned previously, it can become hard to remember these URIs. Additionally, you must manually substitute all **<_substring_>** values in the **pd.read_csv**command with the real values for your resources.
188
+
However, as mentioned previously, it can become hard to remember these URIs. Additionally, you must manually substitute all **<_substring_>** values in the **pd.read_csv** command with the real values for your resources.
189
189
190
190
You'll want to create data assets for frequently accessed data. Here's an easier way to access the CSV file in Pandas:
191
191
192
192
> [!IMPORTANT]
193
193
> In a notebook cell, execute this code to install the `azureml-fsspec` Python library in your Jupyter kernel:
Read [Access data from Azure cloud storage during interactive development](how-to-access-data-interactive.md) to learn more about data access in a notebook.
212
+
For more information about data access in a notebook, visit [Access data from Azure cloud storage during interactive development](how-to-access-data-interactive.md).
215
213
216
214
## Create a new version of the data asset
217
215
218
-
You might have noticed that the data needs a little light cleaning, to make it fit to train a machine learning model. It has:
216
+
The data needs some light cleaning, to make it fit to train a machine learning model. It has:
219
217
220
218
* two headers
221
219
* a client ID column; we wouldn't use this feature in Machine Learning
222
220
* spaces in the response variable name
223
221
224
-
Also, compared to the CSV format, the Parquet file format becomes a better way to store this data. Parquet offers compression, and it maintains schema. Therefore, to clean the data and store it in Parquet, use:
225
-
222
+
Also, compared to the CSV format, the Parquet file format becomes a better way to store this data. Parquet offers compression, and it maintains schema. To clean the data and store it in Parquet, use:
226
223
227
224
```python
228
225
# read in data again, this time using the 2nd row as the header
@@ -250,9 +247,7 @@ This table shows the structure of the data in the original **default_of_credit_c
250
247
|X18-23 | Explanatory | Amount of previous payment (NT dollar) from April to September 2005. |
Next, create a new _version_ of the data asset (the data automatically uploads to cloud storage). For this version, we'll add a time value, so that each time this code is run, a different version number will be created.
254
-
255
-
250
+
Next, create a new _version_ of the data asset (the data automatically uploads to cloud storage). For this version, add a time value, so that each time this code runs, a different version number is created.
0 commit comments