Skip to content

Commit 69715b0

Browse files
authored
Merge pull request #281941 from fbsolo-ms1/catch-up-freshness-updates
Freshness update for tutorial-explore-data.md . . .
2 parents 4deb469 + 4ca5e40 commit 69715b0

File tree

4 files changed

+46
-55
lines changed

4 files changed

+46
-55
lines changed

articles/machine-learning/how-to-manage-inputs-outputs-pipeline.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -297,7 +297,7 @@ az ml job download --output-name <OUTPUT_PORT_NAME> -n <JOB_NAME> -g <RESOURCE_G
297297
```
298298
# [Python SDK](#tab/python)
299299

300-
Before we dive in the code, you need a way to reference your workspace. You create `ml_client` for a handle to the workspace. Refer to [Create handle to workspace](./tutorial-explore-data.md#create-handle-to-workspace) to initialize `ml_client`.
300+
Before we dive in the code, you need a way to reference your workspace. You create `ml_client` for a handle to the workspace. Refer to [Create handle to workspace](./tutorial-explore-data.md#create-a-handle-to-the-workspace) to initialize `ml_client`.
301301

302302
```python
303303
# Download all the outputs of the job
@@ -325,7 +325,7 @@ az ml job download --all -n <JOB_NAME> -g <RESOURCE_GROUP_NAME> -w <WORKSPACE_NA
325325

326326
# [Python SDK](#tab/python)
327327

328-
Before we dive in the code, you need a way to reference your workspace. You create `ml_client` for a handle to the workspace. Refer to [Create handle to workspace](./tutorial-explore-data.md#create-handle-to-workspace) to initialize `ml_client`.
328+
Before we dive in the code, you need a way to reference your workspace. You create `ml_client` for a handle to the workspace. Refer to [Create handle to workspace](./tutorial-explore-data.md#create-a-handle-to-the-workspace) to initialize `ml_client`.
329329

330330
```python
331331
# List all child jobs in the job
136 KB
Loading
Binary file not shown.

articles/machine-learning/tutorial-explore-data.md

Lines changed: 44 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,19 @@
11
---
2-
title: "Tutorial: Upload, access and explore your data"
2+
title: "Tutorial: upload, access, and explore your data"
33
titleSuffix: Azure Machine Learning
4-
description: Upload data to cloud storage, create an Azure Machine Learning data asset, create new versions for data assets, use the data for interactive development
4+
description: Upload data to cloud storage, create an Azure Machine Learning data asset, create new versions for data assets, and use the data for interactive development
55
services: machine-learning
66
ms.service: machine-learning
77
ms.subservice: core
88
ms.topic: tutorial
99
ms.reviewer: None
1010
author: fbsolo-ms1
1111
ms.author: franksolomon
12-
ms.date: 07/05/2023
12+
ms.date: 07/25/2024
1313
#Customer intent: As a data scientist, I want to know how to prototype and develop machine learning models on a cloud workstation.
1414
---
1515

16-
# Tutorial: Upload, access and explore your data in Azure Machine Learning
16+
# Tutorial: Upload, access, and explore your data in Azure Machine Learning
1717

1818
[!INCLUDE [sdk v2](includes/machine-learning-sdk-v2.md)]
1919

@@ -25,9 +25,9 @@ In this tutorial you learn how to:
2525
> * Access your data in a notebook for interactive development
2626
> * Create new versions of data assets
2727
28-
The start of a machine learning project typically involves exploratory data analysis (EDA), data-preprocessing (cleaning, feature engineering), and the building of Machine Learning model prototypes to validate hypotheses. This _prototyping_ project phase is highly interactive. It lends itself to development in an IDE or a Jupyter notebook, with a _Python interactive console_. This tutorial describes these ideas.
28+
A machine learning project typically starts with exploratory data analysis (EDA), data-preprocessing (cleaning, feature engineering), and building Machine Learning model prototypes to validate hypotheses. This _prototyping_ project phase is highly interactive. It lends itself to development in an IDE or a Jupyter notebook, with a _Python interactive console_. This tutorial describes these ideas.
2929

30-
This video shows how to get started in Azure Machine Learning studio so that you can follow the steps in the tutorial. The video shows how to create a notebook, clone the notebook, create a compute instance, and download the data needed for the tutorial. The steps are also described in the following sections.
30+
This video shows how to get started in Azure Machine Learning studio, so that you can follow the steps in the tutorial. The video shows how to create a notebook, clone the notebook, create a compute instance, and download the data needed for the tutorial. The steps are also described in the following sections.
3131

3232
> [!VIDEO https://learn-video.azurefd.net/vod/player?id=514a29e2-0ae7-4a5d-a537-8f10681f5545]
3333
@@ -41,24 +41,23 @@ This video shows how to get started in Azure Machine Learning studio so that you
4141
* [!INCLUDE [new notebook](includes/prereq-new-notebook.md)]
4242
* Or, open **tutorials/get-started-notebooks/explore-data.ipynb** from the **Samples** section of studio. [!INCLUDE [clone notebook](includes/prereq-clone-notebook.md)]
4343

44-
[!INCLUDE [notebook set kernel](includes/prereq-set-kernel.md)]
44+
[!INCLUDE [notebook set kernel](includes/prereq-set-kernel.md)]
4545

4646
<!-- nbstart https://raw.githubusercontent.com/Azure/azureml-examples/main/tutorials/get-started-notebooks/explore-data.ipynb -->
4747

48-
4948
## Download the data used in this tutorial
5049

51-
For data ingestion, the Azure Data Explorer handles raw data in [these formats](/azure/data-explorer/ingestion-supported-formats). This tutorial uses this [CSV-format credit card client data sample](https://azuremlexamples.blob.core.windows.net/datasets/credit_card/default_of_credit_card_clients.csv). We see the steps proceed in an Azure Machine Learning resource. In that resource, we'll create a local folder with the suggested name of **data** directly under the folder where this notebook is located.
50+
For data ingestion, the Azure Data Explorer handles raw data in [these formats](/azure/data-explorer/ingestion-supported-formats). This tutorial uses this [CSV-format credit card client data sample](https://azuremlexamples.blob.core.windows.net/datasets/credit_card/default_of_credit_card_clients.csv). The steps proceed in an Azure Machine Learning resource. In that resource, we'll create a local folder, with the suggested name of **data**, directly under the folder where this notebook is located.
5251

5352
> [!NOTE]
54-
> This tutorial depends on data placed in an Azure Machine Learning resource folder location. For this tutorial, 'local' means a folder location in that Azure Machine Learning resource.
53+
> This tutorial depends on data placed in an Azure Machine Learning resource folder location. For this tutorial, 'local' means a folder location in that Azure Machine Learning resource.
5554
5655
1. Select **Open terminal** below the three dots, as shown in this image:
5756

5857
:::image type="content" source="media/tutorial-cloud-workstation/open-terminal.png" alt-text="Screenshot shows open terminal tool in notebook toolbar.":::
5958

60-
1. The terminal window opens in a new tab.
61-
1. Make sure you `cd` to the same folder where this notebook is located. For example, if the notebook is in a folder named **get-started-notebooks**:
59+
1. The terminal window opens in a new tab.
60+
1. Make sure you `cd` (**Change Directory**) to the same folder where this notebook is located. For example, if the notebook is in a folder named **get-started-notebooks**:
6261

6362
```bash
6463
cd get-started-notebooks # modify this to the path where your notebook is located
@@ -73,19 +72,17 @@ For data ingestion, the Azure Data Explorer handles raw data in [these formats](
7372
```
7473
1. You can now close the terminal window.
7574

75+
For more information about the data in the UC Irvine Machine Learning Repository, visit [this resource](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients).
7676

77-
[Learn more about this data on the UCI Machine Learning Repository.](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients)
78-
79-
## Create handle to workspace
77+
## Create a handle to the workspace
8078

81-
Before we dive in the code, you need a way to reference your workspace. You'll create `ml_client` for a handle to the workspace. You'll then use `ml_client` to manage resources and jobs.
79+
Before we explore the code, you need a way to reference your workspace. You'll create `ml_client` for a handle to the workspace. You then use `ml_client` to manage resources and jobs.
8280
8381
In the next cell, enter your Subscription ID, Resource Group name and Workspace name. To find these values:
8482
85-
1. In the upper right Azure Machine Learning studio toolbar, select your workspace name.
86-
1. Copy the value for workspace, resource group and subscription ID into the code.
87-
1. You'll need to copy one value, close the area and paste, then come back for the next one.
88-
83+
1. At the upper right Azure Machine Learning studio toolbar, select your workspace name.
84+
1. Copy the value for workspace, resource group, and subscription ID into the code.
85+
1. You must individually copy the values one at a time, close the area and paste, then continue to the next one.
8986
9087
```python
9188
from azure.ai.ml import MLClient
@@ -106,31 +103,29 @@ ml_client = MLClient(
106103
```
107104
108105
> [!NOTE]
109-
> Creating MLClient will not connect to the workspace. The client initialization is lazy, it will wait for the first time it needs to make a call (this will happen in the next code cell).
110-
106+
> Creation of MLClient will not connect to the workspace. The client initialization is lazy. It waits for the first time it needs to make a call. This happenS in the next code cell.
111107
112108
## Upload data to cloud storage
113109
114-
Azure Machine Learning uses Uniform Resource Identifiers (URIs), which point to storage locations in the cloud. A URI makes it easy to access data in notebooks and jobs. Data URI formats look similar to the web URLs that you use in your web browser to access web pages. For example:
110+
Azure Machine Learning uses Uniform Resource Identifiers (URIs), which point to storage locations in the cloud. A URI makes it easy to access data in notebooks and jobs. Data URI formats have a format similar to the web URLs that you use in your web browser to access web pages. For example:
115111
116112
* Access data from public https server: `https://<account_name>.blob.core.windows.net/<container_name>/<folder>/<file>`
117113
* Access data from Azure Data Lake Gen 2: `abfss://<file_system>@<account_name>.dfs.core.windows.net/<folder>/<file>`
118114
119115
An Azure Machine Learning data asset is similar to web browser bookmarks (favorites). Instead of remembering long storage paths (URIs) that point to your most frequently used data, you can create a data asset, and then access that asset with a friendly name.
120116
121-
Data asset creation also creates a *reference* to the data source location, along with a copy of its metadata. Because the data remains in its existing location, you incur no extra storage cost, and don't risk data source integrity. You can create Data assets from Azure Machine Learning datastores, Azure Storage, public URLs, and local files.
117+
Data asset creation also creates a *reference* to the data source location, along with a copy of its metadata. Because the data remains in its existing location, you incur no extra storage cost, and you don't risk data source integrity. You can create Data assets from Azure Machine Learning datastores, Azure Storage, public URLs, and local files.
122118

123119
> [!TIP]
124-
> For smaller-size data uploads, Azure Machine Learning data asset creation works well for data uploads from local machine resources to cloud storage. This approach avoids the need for extra tools or utilities. However, a larger-size data upload might require a dedicated tool or utility - for example, **azcopy**. The azcopy command-line tool moves data to and from Azure Storage. Learn more about azcopy [here](../storage/common/storage-use-azcopy-v10.md).
120+
> For smaller-size data uploads, Azure Machine Learning data asset creation works well for data uploads from local machine resources to cloud storage. This approach avoids the need for extra tools or utilities. However, a larger-size data upload might require a dedicated tool or utility - for example, **azcopy**. The azcopy command-line tool moves data to and from Azure Storage. For more information about azcopy, visit [this resource](../storage/common/storage-use-azcopy-v10.md).
125121

126-
The next notebook cell creates the data asset. The code sample uploads the raw data file to the designated cloud storage resource.
122+
The next notebook cell creates the data asset. The code sample uploads the raw data file to the designated cloud storage resource.
127123

128-
Each time you create a data asset, you need a unique version for it. If the version already exists, you'll get an error. In this code, we're using the "initial" for the first read of the data. If that version already exists, we'll skip creating it again.
124+
Each time you create a data asset, you need a unique version for it. If the version already exists, you'll get an error. In this code, we use the "initial" for the first read of the data. If that version already exists, we don't recreate it.
129125

130-
You can also omit the **version** parameter, and a version number is generated for you, starting with 1 and then incrementing from there.
131-
132-
In this tutorial, we use the name "initial" as the first version. The [Create production machine learning pipelines](tutorial-pipeline-python-sdk.md) tutorial will also use this version of the data, so here we are using a value that you'll see again in that tutorial.
126+
You can also omit the **version** parameter. In this case, a version number is generated for you, starting with 1 and then incrementing from there.
133127

128+
This tutorial uses the name "initial" as the first version. The [Create production machine learning pipelines](tutorial-pipeline-python-sdk.md) tutorial also uses this version of the data, so here we use a value that you'll see again in that tutorial.
134129
135130
```python
136131
from azure.ai.ml.entities import Data
@@ -162,19 +157,24 @@ except:
162157
print(f"Data asset created. Name: {my_data.name}, version: {my_data.version}")
163158
```
164159

165-
You can see the uploaded data by selecting **Data** on the left. You'll see the data is uploaded and a data asset is created:
160+
To examine the uploaded data, select **Data** on the left. The data is uploaded and a data asset is created:
166161

167-
:::image type="content" source="media/tutorial-prepare-data/access-and-explore-data.png" alt-text="Screenshot shows the data in studio.":::
162+
:::image type="content" source="media/tutorial-explore-data/access-and-explore-data.png" alt-text="Screenshot shows the data in studio.":::
168163

169-
This data is named **credit-card**, and in the **Data assets** tab, we can see it in the **Name** column. This data uploaded to your workspace's default datastore named **workspaceblobstore**, seen in the **Data source** column.
164+
This data is named **credit-card**, and in the **Data assets** tab, we can see it in the **Name** column.
170165

171166
An Azure Machine Learning datastore is a *reference* to an *existing* storage account on Azure. A datastore offers these benefits:
172167

173-
1. A common and easy-to-use API, to interact with different storage types (Blob/Files/Azure Data Lake Storage) and authentication methods.
168+
1. A common and easy-to-use API, to interact with different storage types
169+
170+
- Azure Data Lake Storage
171+
- Blob
172+
- Files
173+
174+
and authentication methods.
174175
1. An easier way to discover useful datastores, when working as a team.
175176
1. In your scripts, a way to hide connection information for credential-based data access (service principal/SAS/key).
176177

177-
178178
## Access your data in a notebook
179179

180180
Pandas directly support URIs - this example shows how to read a CSV file from an Azure Machine Learning Datastore:
@@ -185,19 +185,17 @@ import pandas as pd
185185
df = pd.read_csv("azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/<filename>.csv")
186186
```
187187

188-
However, as mentioned previously, it can become hard to remember these URIs. Additionally, you must manually substitute all **<_substring_>** values in the **pd.read_csv** command with the real values for your resources.
188+
However, as mentioned previously, it can become hard to remember these URIs. Additionally, you must manually substitute all **<_substring_>** values in the **pd.read_csv** command with the real values for your resources.
189189

190190
You'll want to create data assets for frequently accessed data. Here's an easier way to access the CSV file in Pandas:
191191

192192
> [!IMPORTANT]
193193
> In a notebook cell, execute this code to install the `azureml-fsspec` Python library in your Jupyter kernel:
194194
195-
196195
```python
197196
%pip install -U azureml-fsspec
198197
```
199198

200-
201199
```python
202200
import pandas as pd
203201

@@ -211,18 +209,17 @@ df = pd.read_csv(data_asset.path)
211209
df.head()
212210
```
213211

214-
Read [Access data from Azure cloud storage during interactive development](how-to-access-data-interactive.md) to learn more about data access in a notebook.
212+
For more information about data access in a notebook, visit [Access data from Azure cloud storage during interactive development](how-to-access-data-interactive.md).
215213

216214
## Create a new version of the data asset
217215

218-
You might have noticed that the data needs a little light cleaning, to make it fit to train a machine learning model. It has:
216+
The data needs some light cleaning, to make it fit to train a machine learning model. It has:
219217

220218
* two headers
221219
* a client ID column; we wouldn't use this feature in Machine Learning
222220
* spaces in the response variable name
223221

224-
Also, compared to the CSV format, the Parquet file format becomes a better way to store this data. Parquet offers compression, and it maintains schema. Therefore, to clean the data and store it in Parquet, use:
225-
222+
Also, compared to the CSV format, the Parquet file format becomes a better way to store this data. Parquet offers compression, and it maintains schema. To clean the data and store it in Parquet, use:
226223

227224
```python
228225
# read in data again, this time using the 2nd row as the header
@@ -250,9 +247,7 @@ This table shows the structure of the data in the original **default_of_credit_c
250247
|X18-23 | Explanatory | Amount of previous payment (NT dollar) from April to September 2005. |
251248
|Y | Response | Default payment (Yes = 1, No = 0) |
252249

253-
Next, create a new _version_ of the data asset (the data automatically uploads to cloud storage). For this version, we'll add a time value, so that each time this code is run, a different version number will be created.
254-
255-
250+
Next, create a new _version_ of the data asset (the data automatically uploads to cloud storage). For this version, add a time value, so that each time this code runs, a different version number is created.
256251

257252
```python
258253
from azure.ai.ml.entities import Data
@@ -283,7 +278,6 @@ print(f"Data asset created. Name: {my_data.name}, version: {my_data.version}")
283278

284279
The cleaned parquet file is the latest version data source. This code shows the CSV version result set first, then the Parquet version:
285280

286-
287281
```python
288282
import pandas as pd
289283

@@ -307,16 +301,13 @@ print(v2df.head(5))
307301

308302
<!-- nbend -->
309303

310-
311-
312-
313304
## Clean up resources
314305

315306
If you plan to continue now to other tutorials, skip to [Next steps](#next-steps).
316307

317308
### Stop compute instance
318309

319-
If you're not going to use it now, stop the compute instance:
310+
If you don't plan to use it now, stop the compute instance:
320311

321312
1. In the studio, in the left navigation area, select **Compute**.
322313
1. In the top tabs, select **Compute instances**
@@ -329,11 +320,11 @@ If you're not going to use it now, stop the compute instance:
329320

330321
## Next steps
331322

332-
Read [Create data assets](how-to-create-data-assets.md) for more information about data assets.
323+
For more information about data assets, visit [Create data assets](how-to-create-data-assets.md).
333324

334-
Read [Create datastores](how-to-datastore.md) to learn more about datastores.
325+
For more information about datastores, visit [Create datastores](how-to-datastore.md).
335326

336-
Continue with tutorials to learn how to develop a training script.
327+
Continue with the next tutorial to learn how to develop a training script:
337328

338329
> [!div class="nextstepaction"]
339330
> [Model development on a cloud workstation](tutorial-cloud-workstation.md)

0 commit comments

Comments
 (0)