Skip to content

Commit 0f7a820

Browse files
Merge pull request #281801 from fbsolo-ms1/main
Freshness update for how-to-access-data-interactive.md . . .
2 parents cd5e16a + df3cf77 commit 0f7a820

File tree

1 file changed

+28
-29
lines changed

1 file changed

+28
-29
lines changed

articles/machine-learning/how-to-access-data-interactive.md

Lines changed: 28 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ ms.topic: how-to
99
author: fbsolo-ms1
1010
ms.author: franksolomon
1111
ms.reviewer: samkemp
12-
ms.date: 09/05/2023
12+
ms.date: 07/24/2024
1313
ms.custom: sdkv2
1414
#Customer intent: As a professional data scientist, I want to know how to build and deploy a model with Azure Machine Learning by using Python in a Jupyter Notebook.
1515
---
@@ -18,24 +18,24 @@ ms.custom: sdkv2
1818

1919
[!INCLUDE [sdk v2](includes/machine-learning-sdk-v2.md)]
2020

21-
A machine learning project typically starts with exploratory data analysis (EDA), data-preprocessing (cleaning, feature engineering), and includes building prototypes of ML models to validate hypotheses. This *prototyping* project phase is highly interactive in nature, and it lends itself to development in a Jupyter notebook, or an IDE with a *Python interactive console*. In this article you'll learn how to:
21+
A machine learning project typically starts with exploratory data analysis (EDA), data-preprocessing (cleaning, feature engineering), and it includes building ML model prototypes to validate hypotheses. This *prototyping* project phase is highly interactive in nature, and it lends itself to development in a Jupyter notebook, or in an IDE with a *Python interactive console*. In this article, learn how to:
2222

2323
> [!div class="checklist"]
24-
> * Access data from a Azure Machine Learning Datastores URI as if it were a file system.
25-
> * Materialize data into Pandas using `mltable` Python library.
26-
> * Materialize Azure Machine Learning data assets into Pandas using `mltable` Python library.
24+
> * Access data from an Azure Machine Learning Datastores URI as if it were a file system.
25+
> * Materialize data into Pandas using the `mltable` Python library.
26+
> * Materialize Azure Machine Learning data assets into Pandas using the `mltable` Python library.
2727
> * Materialize data through an explicit download with the `azcopy` utility.
2828
2929
## Prerequisites
3030

31-
* An Azure Machine Learning workspace. For more information, see [Manage Azure Machine Learning workspaces in the portal or with the Python SDK (v2)](how-to-manage-workspace.md).
32-
* An Azure Machine Learning Datastore. For more information, see [Create datastores](how-to-datastore.md).
31+
* An Azure Machine Learning workspace. For more information, visit [Manage Azure Machine Learning workspaces in the portal or with the Python SDK (v2)](how-to-manage-workspace.md).
32+
* An Azure Machine Learning Datastore. For more information, visit [Create datastores](how-to-datastore.md).
3333

3434
> [!TIP]
35-
> The guidance in this article describes data access during interactive development. It applies to any host that can run a Python session. This can include your local machine, a cloud VM, a GitHub Codespace, etc. We recommend use of an Azure Machine Learning compute instance - a fully managed and pre-configured cloud workstation. For more information, see [Create an Azure Machine Learning compute instance](how-to-create-compute-instance.md).
35+
> The guidance in this article describes data access during interactive development. It applies to any host that can run a Python session. This can include your local machine, a cloud VM, a GitHub Codespace, etc. We recommend use of an Azure Machine Learning compute instance - a fully managed and pre-configured cloud workstation. For more information, visit [Create an Azure Machine Learning compute instance](how-to-create-compute-instance.md).
3636
3737
> [!IMPORTANT]
38-
> Ensure you have the latest `azure-fsspec` and `mltable` python libraries installed in your python environment:
38+
> Ensure you have the latest `azure-fsspec` and `mltable` python libraries installed in your Python environment:
3939
>
4040
> ```bash
4141
> pip install -U azureml-fsspec mltable
@@ -66,10 +66,9 @@ path_on_datastore = '<path>'
6666
uri = f'azureml://subscriptions/{subscription}/resourcegroups/{resource_group}/workspaces/{workspace}/datastores/{datastore_name}/paths/{path_on_datastore}'.
6767
```
6868
69-
These Datastore URIs are a known implementation of the [Filesystem spec](https://filesystem-spec.readthedocs.io/en/latest/index.html) (`fsspec`): a unified pythonic interface to local, remote and embedded file systems and bytes storage.
70-
You can pip install the `azureml-fsspec` package and its dependency `azureml-dataprep` package. Then, you can use the Azure Machine Learning Datastore `fsspec` implementation.
69+
These Datastore URIs are a known implementation of the [Filesystem spec](https://filesystem-spec.readthedocs.io/en/latest/index.html) (`fsspec`): a unified pythonic interface to local, remote, and embedded file systems and bytes storage. First, pip install the `azureml-fsspec` package and its dependency `azureml-dataprep` package. Then, you can use the Azure Machine Learning Datastore `fsspec` implementation.
7170

72-
The Azure Machine Learning Datastore `fsspec` implementation automatically handles the credential/identity passthrough that the Azure Machine Learning datastore uses. You can avoid both account key exposure in your scripts, and additional sign-in procedures, on a compute instance.
71+
The Azure Machine Learning Datastore `fsspec` implementation automatically handles the credential/identity passthrough that the Azure Machine Learning datastore uses. You can avoid both account key exposure in your scripts, and extra sign-in procedures, on a compute instance.
7372

7473
For example, you can directly use Datastore URIs in Pandas. This example shows how to read a CSV file:
7574

@@ -81,7 +80,7 @@ df.head()
8180
```
8281

8382
> [!TIP]
84-
> Rather than remember the datastore URI format, you can copy-and-paste the datastore URI from the Studio UI with these steps:
83+
> To avoid remembering the datastore URI format, you can copy-and-paste the datastore URI from the Studio UI with these steps:
8584
> 1. Select **Data** from the left-hand menu, then select the **Datastores** tab.
8685
> 1. Select your datastore name, and then **Browse**.
8786
> 1. Find the file/folder you want to read into Pandas, and select the ellipsis (**...**) next to it. Select **Copy URI** from the menu. You can select the **Datastore URI** to copy into your notebook/script.
@@ -126,12 +125,12 @@ fs.upload(lpath='data/upload_files/crime-spring.csv', rpath='data/fsspec', recur
126125
fs.upload(lpath='data/upload_folder/', rpath='data/fsspec_folder', recursive=True, **{'overwrite': 'MERGE_WITH_OVERWRITE'})
127126
```
128127
`lpath` is the local path, and `rpath` is the remote path.
129-
If the folders you specify in `rpath` do not exist yet, we will create the folders for you.
128+
If the folders you specify in `rpath` don't yet exist, we create the folders for you.
130129

131130
We support three 'overwrite' modes:
132-
- APPEND: if a file with the same name exists in the destination path, this keeps the original file
133-
- FAIL_ON_FILE_CONFLICT: if a file with the same name exists in the destination path, this throws an error
134-
- MERGE_WITH_OVERWRITE: if a file with the same name exists in the destination path, this overwrites that existing file with the new file
131+
- APPEND: if a file with the same name exists in the destination path, APPEND keeps the original file
132+
- FAIL_ON_FILE_CONFLICT: if a file with the same name exists in the destination path, FAIL_ON_FILE_CONFLICT throws an error
133+
- MERGE_WITH_OVERWRITE: if a file with the same name exists in the destination path, MERGE_WITH_OVERWRITE overwrites that existing file with the new file
135134

136135
### Download files via AzureMachineLearningFileSystem
137136
```python
@@ -159,7 +158,7 @@ df = pd.read_csv("azureml://subscriptions/<subid>/resourcegroups/<rgname>/worksp
159158

160159
#### Read a folder of CSV files into Pandas
161160

162-
The Pandas `read_csv()` method doesn't support reading a folder of CSV files. You must glob csv paths, and concatenate them to a data frame with the Pandas `concat()` method. The next code sample shows how to achieve this concatenation with the Azure Machine Learning filesystem:
161+
The Pandas `read_csv()` method doesn't support reading a folder of CSV files. To handle this, glob the csv paths, and concatenate them to a data frame with the Pandas `concat()` method. The next code sample shows how to achieve this concatenation with the Azure Machine Learning filesystem:
163162

164163
```python
165164
import pandas as pd
@@ -196,9 +195,9 @@ df.head()
196195
#### Read a folder of parquet files into Pandas
197196
As part of an ETL process, Parquet files are typically written to a folder, which can then emit files relevant to the ETL such as progress, commits, etc. This example shows files created from an ETL process (files beginning with `_`) which then produce a parquet file of data.
198197

199-
:::image type="content" source="media/how-to-access-data-ci/parquet-auxillary.png" alt-text="Screenshot showing the parquet etl process.":::
198+
:::image type="content" source="media/how-to-access-data-ci/parquet-auxillary.png" alt-text="Screenshot showing the parquet ETL process.":::
200199

201-
In these scenarios, you'll only read the parquet files in the folder, and ignore the ETL process files. This code sample shows how glob patterns can read only parquet files in a folder:
200+
In these scenarios, you only read the parquet files in the folder, and ignore the ETL process files. This code sample shows how glob patterns can read only parquet files in a folder:
202201

203202
```python
204203
import pandas as pd
@@ -225,18 +224,18 @@ df.head()
225224

226225
Filesystem spec (`fsspec`) has a range of [known implementations](https://filesystem-spec.readthedocs.io/en/stable/_modules/index.html), including the Databricks Filesystem (`dbfs`).
227226

228-
To access data from `dbfs` you need:
227+
To access data from the `dbfs` resource, you need:
229228

230229
- **Instance name**, in the form of `adb-<some-number>.<two digits>.azuredatabricks.net`. You can find this value in the URL of your Azure Databricks workspace.
231-
- **Personal Access Token (PAT)**; for more information about PAT creation, see [Authentication using Azure Databricks personal access tokens](/azure/databricks/dev-tools/api/latest/authentication)
230+
- **Personal Access Token (PAT)**; for more information about PAT creation, visit [Authentication using Azure Databricks personal access tokens](/azure/databricks/dev-tools/api/latest/authentication)
232231

233-
With these values, you must create an environment variable on your compute instance for the PAT token:
232+
With these values, you must create an environment variable for the PAT token on your compute instance:
234233

235234
```bash
236235
export ADB_PAT=<pat_token>
237236
```
238237

239-
You can then access data in Pandas as shown in this example:
238+
You can then access data in Pandas, as shown in this example:
240239

241240
```python
242241
import os
@@ -334,7 +333,7 @@ class CustomImageDataset(Dataset):
334333
return image, label
335334
```
336335

337-
You can then instantiate the dataset as shown here:
336+
You can then instantiate the dataset, as shown here:
338337

339338
```python
340339
from azureml.fsspec import AzureMachineLearningFileSystem
@@ -398,7 +397,7 @@ The `mltable` library supports reading of tabular data from different path types
398397
> [!NOTE]
399398
> `mltable` does user credential passthrough for paths on Azure Storage and Azure Machine Learning datastores. If you do not have permission to access the data on the underlying storage, you cannot access the data.
400399

401-
### Files, folders and globs
400+
### Files, folders, and globs
402401

403402
`mltable` supports reading from:
404403

@@ -486,7 +485,7 @@ df.head()
486485
```
487486

488487
> [!TIP]
489-
> Instead of remembering the datastore URI format, you can copy-and-paste the datastore URI from the Studio UI with these steps:
488+
> To avoid remembering the datastore URI format, you can copy-and-paste the datastore URI from the Studio UI with these steps:
490489
> 1. Select **Data** from the left-hand menu, then select the **Datastores** tab.
491490
> 1. Select your datastore name, and then **Browse**.
492491
> 1. Find the file/folder you want to read into Pandas, and select the ellipsis (**...**) next to it. Select **Copy URI** from the menu. You can select the **Datastore URI** to copy into your notebook/script.
@@ -650,7 +649,7 @@ df.head()
650649

651650
## A note on reading and processing large data volumes with Pandas
652651
> [!TIP]
653-
> Pandas is not designed to handle large datasets - Pandas can only process data that can fit into the memory of the compute instance.
652+
> Pandas is not designed to handle large datasets. Pandas can only process data that can fit into the memory of the compute instance.
654653
>
655654
> For large datasets, we recommend use of Azure Machine Learning managed Spark. This provides the [PySpark Pandas API](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html).
656655

@@ -678,7 +677,7 @@ You can also take subsets of large data with these operations:
678677

679678
## Downloading data using the `azcopy` utility
680679

681-
Use the `azcopy` utility to download the data to the local SSD of your host (local machine, cloud VM, Azure Machine Learning Compute Instance), into the local filesystem. The `azcopy` utility, which is pre-installed on an Azure Machine Learning compute instance, will handle this. If you **don't** use an Azure Machine Learning compute instance or a Data Science Virtual Machine (DSVM), you may need to install `azcopy`. See [azcopy](../storage/common/storage-ref-azcopy.md) for more information.
680+
Use the `azcopy` utility to download the data to the local SSD of your host (local machine, cloud VM, Azure Machine Learning Compute Instance, etc.), into the local filesystem. The `azcopy` utility, which is preinstalled on an Azure Machine Learning compute instance, handles the data download. If you **don't** use an Azure Machine Learning compute instance or a Data Science Virtual Machine (DSVM), you might need to install `azcopy`. For more information, visit [azcopy](../storage/common/storage-ref-azcopy.md).
682681

683682
> [!CAUTION]
684683
> We don't recommend data downloads into the `/home/azureuser/cloudfiles/code` location on a compute instance. This location is designed to store notebook and code artifacts, **not** data. Reading data from this location will incur significant performance overhead when training. Instead, we recommend data storage in the `home/azureuser`, which is the local SSD of the compute node.

0 commit comments

Comments
 (0)