You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-access-data-interactive.md
+28-29Lines changed: 28 additions & 29 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@ ms.topic: how-to
9
9
author: fbsolo-ms1
10
10
ms.author: franksolomon
11
11
ms.reviewer: samkemp
12
-
ms.date: 09/05/2023
12
+
ms.date: 07/24/2024
13
13
ms.custom: sdkv2
14
14
#Customer intent: As a professional data scientist, I want to know how to build and deploy a model with Azure Machine Learning by using Python in a Jupyter Notebook.
A machine learning project typically starts with exploratory data analysis (EDA), data-preprocessing (cleaning, feature engineering), and includes building prototypes of ML models to validate hypotheses. This *prototyping* project phase is highly interactive in nature, and it lends itself to development in a Jupyter notebook, or an IDE with a *Python interactive console*. In this article you'll learn how to:
21
+
A machine learning project typically starts with exploratory data analysis (EDA), data-preprocessing (cleaning, feature engineering), and it includes building ML model prototypes to validate hypotheses. This *prototyping* project phase is highly interactive in nature, and it lends itself to development in a Jupyter notebook, or in an IDE with a *Python interactive console*. In this article, learn how to:
22
22
23
23
> [!div class="checklist"]
24
-
> * Access data from a Azure Machine Learning Datastores URI as if it were a file system.
25
-
> * Materialize data into Pandas using `mltable` Python library.
26
-
> * Materialize Azure Machine Learning data assets into Pandas using `mltable` Python library.
24
+
> * Access data from an Azure Machine Learning Datastores URI as if it were a file system.
25
+
> * Materialize data into Pandas using the `mltable` Python library.
26
+
> * Materialize Azure Machine Learning data assets into Pandas using the `mltable` Python library.
27
27
> * Materialize data through an explicit download with the `azcopy` utility.
28
28
29
29
## Prerequisites
30
30
31
-
* An Azure Machine Learning workspace. For more information, see[Manage Azure Machine Learning workspaces in the portal or with the Python SDK (v2)](how-to-manage-workspace.md).
32
-
* An Azure Machine Learning Datastore. For more information, see[Create datastores](how-to-datastore.md).
31
+
* An Azure Machine Learning workspace. For more information, visit[Manage Azure Machine Learning workspaces in the portal or with the Python SDK (v2)](how-to-manage-workspace.md).
32
+
* An Azure Machine Learning Datastore. For more information, visit[Create datastores](how-to-datastore.md).
33
33
34
34
> [!TIP]
35
-
> The guidance in this article describes data access during interactive development. It applies to any host that can run a Python session. This can include your local machine, a cloud VM, a GitHub Codespace, etc. We recommend use of an Azure Machine Learning compute instance - a fully managed and pre-configured cloud workstation. For more information, see[Create an Azure Machine Learning compute instance](how-to-create-compute-instance.md).
35
+
> The guidance in this article describes data access during interactive development. It applies to any host that can run a Python session. This can include your local machine, a cloud VM, a GitHub Codespace, etc. We recommend use of an Azure Machine Learning compute instance - a fully managed and pre-configured cloud workstation. For more information, visit[Create an Azure Machine Learning compute instance](how-to-create-compute-instance.md).
36
36
37
37
> [!IMPORTANT]
38
-
> Ensure you have the latest `azure-fsspec` and `mltable` python libraries installed in your python environment:
38
+
> Ensure you have the latest `azure-fsspec` and `mltable` python libraries installed in your Python environment:
39
39
>
40
40
> ```bash
41
41
> pip install -U azureml-fsspec mltable
@@ -66,10 +66,9 @@ path_on_datastore = '<path>'
66
66
uri = f'azureml://subscriptions/{subscription}/resourcegroups/{resource_group}/workspaces/{workspace}/datastores/{datastore_name}/paths/{path_on_datastore}'.
67
67
```
68
68
69
-
These Datastore URIs are a known implementation of the [Filesystem spec](https://filesystem-spec.readthedocs.io/en/latest/index.html) (`fsspec`): a unified pythonic interface to local, remote and embedded file systems and bytes storage.
70
-
You can pip install the `azureml-fsspec` package and its dependency `azureml-dataprep` package. Then, you can use the Azure Machine Learning Datastore `fsspec` implementation.
69
+
These Datastore URIs are a known implementation of the [Filesystem spec](https://filesystem-spec.readthedocs.io/en/latest/index.html) (`fsspec`): a unified pythonic interface to local, remote, and embedded file systems and bytes storage. First, pip install the `azureml-fsspec` package and its dependency `azureml-dataprep` package. Then, you can use the Azure Machine Learning Datastore `fsspec` implementation.
71
70
72
-
The Azure Machine Learning Datastore `fsspec` implementation automatically handles the credential/identity passthrough that the Azure Machine Learning datastore uses. You can avoid both account key exposure in your scripts, and additional sign-in procedures, on a compute instance.
71
+
The Azure Machine Learning Datastore `fsspec` implementation automatically handles the credential/identity passthrough that the Azure Machine Learning datastore uses. You can avoid both account key exposure in your scripts, and extra sign-in procedures, on a compute instance.
73
72
74
73
For example, you can directly use Datastore URIs in Pandas. This example shows how to read a CSV file:
75
74
@@ -81,7 +80,7 @@ df.head()
81
80
```
82
81
83
82
> [!TIP]
84
-
> Rather than remember the datastore URI format, you can copy-and-paste the datastore URI from the Studio UI with these steps:
83
+
> To avoid remembering the datastore URI format, you can copy-and-paste the datastore URI from the Studio UI with these steps:
85
84
> 1. Select **Data** from the left-hand menu, then select the **Datastores** tab.
86
85
> 1. Select your datastore name, and then **Browse**.
87
86
> 1. Find the file/folder you want to read into Pandas, and select the ellipsis (**...**) next to it. Select **Copy URI** from the menu. You can select the **Datastore URI** to copy into your notebook/script.
`lpath` is the local path, and `rpath` is the remote path.
129
-
If the folders you specify in `rpath`do not exist yet, we will create the folders for you.
128
+
If the folders you specify in `rpath`don't yet exist, we create the folders for you.
130
129
131
130
We support three 'overwrite' modes:
132
-
- APPEND: if a file with the same name exists in the destination path, this keeps the original file
133
-
- FAIL_ON_FILE_CONFLICT: if a file with the same name exists in the destination path, this throws an error
134
-
- MERGE_WITH_OVERWRITE: if a file with the same name exists in the destination path, this overwrites that existing file with the new file
131
+
- APPEND: if a file with the same name exists in the destination path, APPEND keeps the original file
132
+
- FAIL_ON_FILE_CONFLICT: if a file with the same name exists in the destination path, FAIL_ON_FILE_CONFLICT throws an error
133
+
- MERGE_WITH_OVERWRITE: if a file with the same name exists in the destination path, MERGE_WITH_OVERWRITE overwrites that existing file with the new file
135
134
136
135
### Download files via AzureMachineLearningFileSystem
The Pandas `read_csv()` method doesn't support reading a folder of CSV files. You must glob csv paths, and concatenate them to a data frame with the Pandas `concat()` method. The next code sample shows how to achieve this concatenation with the Azure Machine Learning filesystem:
161
+
The Pandas `read_csv()` method doesn't support reading a folder of CSV files. To handle this, glob the csv paths, and concatenate them to a data frame with the Pandas `concat()` method. The next code sample shows how to achieve this concatenation with the Azure Machine Learning filesystem:
163
162
164
163
```python
165
164
import pandas as pd
@@ -196,9 +195,9 @@ df.head()
196
195
#### Read a folder of parquet files into Pandas
197
196
As part of an ETL process, Parquet files are typically written to a folder, which can then emit files relevant to the ETL such as progress, commits, etc. This example shows files created from an ETL process (files beginning with`_`) which then produce a parquet file of data.
198
197
199
-
:::image type="content"source="media/how-to-access-data-ci/parquet-auxillary.png" alt-text="Screenshot showing the parquet etl process.":::
198
+
:::image type="content"source="media/how-to-access-data-ci/parquet-auxillary.png" alt-text="Screenshot showing the parquet ETL process.":::
200
199
201
-
In these scenarios, you'll only read the parquet files in the folder, and ignore the ETL process files. This code sample shows how glob patterns can read only parquet files in a folder:
200
+
In these scenarios, you only read the parquet files in the folder, and ignore the ETL process files. This code sample shows how glob patterns can read only parquet files in a folder:
202
201
203
202
```python
204
203
import pandas as pd
@@ -225,18 +224,18 @@ df.head()
225
224
226
225
Filesystem spec (`fsspec`) has a range of [known implementations](https://filesystem-spec.readthedocs.io/en/stable/_modules/index.html), including the Databricks Filesystem (`dbfs`).
227
226
228
-
To access data from`dbfs` you need:
227
+
To access data fromthe `dbfs` resource, you need:
229
228
230
229
-**Instance name**, in the form of `adb-<some-number>.<two digits>.azuredatabricks.net`. You can find this value in the URL of your Azure Databricks workspace.
231
-
-**Personal Access Token (PAT)**; for more information about PAT creation, see [Authentication using Azure Databricks personal access tokens](/azure/databricks/dev-tools/api/latest/authentication)
230
+
-**Personal Access Token (PAT)**; for more information about PAT creation, visit [Authentication using Azure Databricks personal access tokens](/azure/databricks/dev-tools/api/latest/authentication)
232
231
233
-
With these values, you must create an environment variable on your compute instance for the PAT token:
232
+
With these values, you must create an environment variable for the PAT token on your compute instance:
234
233
235
234
```bash
236
235
export ADB_PAT=<pat_token>
237
236
```
238
237
239
-
You can then access data in Pandas as shown in this example:
238
+
You can then access data in Pandas,as shown in this example:
240
239
241
240
```python
242
241
import os
@@ -334,7 +333,7 @@ class CustomImageDataset(Dataset):
334
333
return image, label
335
334
```
336
335
337
-
You can then instantiate the dataset as shown here:
336
+
You can then instantiate the dataset,as shown here:
338
337
339
338
```python
340
339
from azureml.fsspec import AzureMachineLearningFileSystem
@@ -398,7 +397,7 @@ The `mltable` library supports reading of tabular data from different path types
398
397
> [!NOTE]
399
398
>`mltable` does user credential passthrough for paths on Azure Storage and Azure Machine Learning datastores. If you do not have permission to access the data on the underlying storage, you cannot access the data.
400
399
401
-
### Files, folders and globs
400
+
### Files, folders, and globs
402
401
403
402
`mltable` supports reading from:
404
403
@@ -486,7 +485,7 @@ df.head()
486
485
```
487
486
488
487
> [!TIP]
489
-
>Instead of remembering the datastore URIformat, you can copy-and-paste the datastore URIfrom the Studio UIwith these steps:
488
+
>To avoid remembering the datastore URIformat, you can copy-and-paste the datastore URIfrom the Studio UIwith these steps:
490
489
>1. Select **Data**from the left-hand menu, then select the **Datastores** tab.
491
490
>1. Select your datastore name, and then **Browse**.
492
491
>1. Find the file/folder you want to read into Pandas, and select the ellipsis (**...**) next to it. Select **Copy URI**from the menu. You can select the **Datastore URI** to copy into your notebook/script.
@@ -650,7 +649,7 @@ df.head()
650
649
651
650
## A note on reading and processing large data volumes with Pandas
652
651
> [!TIP]
653
-
> Pandas isnot designed to handle large datasets- Pandas can only process data that can fit into the memory of the compute instance.
652
+
> Pandas isnot designed to handle large datasets. Pandas can only process data that can fit into the memory of the compute instance.
654
653
>
655
654
> For large datasets, we recommend use of Azure Machine Learning managed Spark. This provides the [PySpark Pandas API](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html).
656
655
@@ -678,7 +677,7 @@ You can also take subsets of large data with these operations:
678
677
679
678
## Downloading data using the `azcopy` utility
680
679
681
-
Use the `azcopy` utility to download the data to the local SSD of your host (local machine, cloud VM, Azure Machine Learning Compute Instance), into the local filesystem. The `azcopy` utility, which ispre-installed on an Azure Machine Learning compute instance, will handle this. If you **don't** use an Azure Machine Learning compute instance or a Data Science Virtual Machine (DSVM), you may need to install `azcopy`. See [azcopy](../storage/common/storage-ref-azcopy.md) for more information.
680
+
Use the `azcopy` utility to download the data to the local SSD of your host (local machine, cloud VM, Azure Machine Learning Compute Instance, etc.), into the local filesystem. The `azcopy` utility, which ispreinstalled on an Azure Machine Learning compute instance, handles the data download. If you **don't** use an Azure Machine Learning compute instance or a Data Science Virtual Machine (DSVM), you might need to install `azcopy`. For more information, visit [azcopy](../storage/common/storage-ref-azcopy.md).
682
681
683
682
> [!CAUTION]
684
683
> We don't recommend data downloads into the `/home/azureuser/cloudfiles/code` location on a compute instance. This location is designed to store notebook and code artifacts, **not** data. Reading data from this location will incur significant performance overhead when training. Instead, we recommend data storage in the `home/azureuser`, which is the local SSD of the compute node.
0 commit comments