Skip to content

Commit 1eae127

Browse files
samuel100facebook-github-bot
authored andcommitted
azureml datastore example (#946)
Summary: ### Changes - added a section to the tutorial documentation on Accessing Azure ML Datastores with `fsspec` DataPipes. Fixes #946 Pull Request resolved: #946 Reviewed By: ejguan Differential Revision: D43875104 Pulled By: NivekT fbshipit-source-id: fa88b8540700a91a15c11dc718b29bd4f8638f46
1 parent 1d81a61 commit 1eae127

File tree

1 file changed

+107
-0
lines changed

1 file changed

+107
-0
lines changed

docs/source/dp_tutorial.rst

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -436,3 +436,110 @@ directory ``curated/covid-19/ecdc_cases/latest``, belonging to account ``pandemi
436436
437437
If necessary, you can also access data in Azure Data Lake Storage Gen1 by using URIs staring with
438438
``adl://`` and ``abfs://``, as described in `README of adlfs repo <https://github.com/fsspec/adlfs/blob/main/README.md>`_
439+
440+
Accessing Azure ML Datastores with ``fsspec`` DataPipes
441+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
442+
An Azure ML datastore is a *reference* to an existing storage account on Azure. The key benefits of creating and using an Azure ML datastore are:
443+
444+
- A common and easy-to-use API to interact with different storage types in Azure (Blob/Files/<datastore>).
445+
- Easier to discover useful datastores when working as a team.
446+
- Authentication is automatically handled - both *credential-based* access (service principal/SAS/key) and *identity-based* access (Azure Active Directory/managed identity) are supported. When using credential-based authentication, you do not need to expose secrets in your code.
447+
448+
This requires the installation of the library ``azureml-fsspec``
449+
(`documentation <https://learn.microsoft.com/python/api/azureml-fsspec/?view=azure-ml-py>`_).
450+
451+
You can access data in an Azure ML datastore by providing URIs staring with ``azureml://``.
452+
For example,
453+
`FSSpecFileLister <generated/torchdata.datapipes.iter.FSSpecFileLister.html>`_ (``.list_files_by_fsspec(...)``)
454+
can be used to list files in a directory in a container:
455+
456+
.. code:: python
457+
458+
from torchdata.datapipes.iter import IterableWrapper
459+
460+
# set the subscription_id, resource_group, and AzureML workspace_name
461+
subscription_id = "<subscription_id>"
462+
resource_group = "<resource_group>"
463+
workspace_name = "<workspace_name>"
464+
465+
# set the datastore name and path on the datastore
466+
datastore_name = "<datastore_name>"
467+
path_on_datastore = "<path_on_datastore>"
468+
469+
uri = f"azureml://subscriptions/{subscription_id}/resourcegroups/{resource_group}/workspaces/{workspace_name}/datastores/{datastore_name}/paths/{path_on_datastore}"
470+
471+
dp = IterableWrapper([uri]).list_files_by_fsspec()
472+
print(list(dp))
473+
# ['azureml:///<sub_id>/resourcegroups/<rg_name>/workspaces/<ws_name>/datastores/<datastore>/paths/<folder>/file1.txt',
474+
# 'azureml:///<sub_id>/resourcegroups/<rg_name>/workspaces/<ws_name>/datastores/<datastore>/paths/<folder>/file2.txt', ...]
475+
476+
You can also open files using `FSSpecFileOpener <generated/torchdata.datapipes.iter.FSSpecFileOpener.html>`_
477+
(``.open_files_by_fsspec(...)``) and stream them
478+
(if supported by the file format).
479+
480+
Here is an example of loading a tar file from the default Azure ML datastore ``workspaceblobstore`` where the path is ``/cifar-10-python.tar.gz`` (top-level folder).
481+
482+
.. code:: python
483+
484+
from torchdata.datapipes.iter import IterableWrapper
485+
486+
# set the subscription_id, resource_group, and AzureML workspace_name
487+
subscription_id = "<subscription_id>"
488+
resource_group = "<resource_group>"
489+
workspace_name = "<workspace_name>"
490+
491+
# set the datastore name and path on the datastore
492+
datastore_name = "workspaceblobstore"
493+
path_on_datastore = "cifar-10-python.tar.gz"
494+
495+
uri = f"azureml://subscriptions/{subscription_id}/resourcegroups/{resource_group}/workspaces/{workspace_name}/datastores/{datastore_name}/paths/{path_on_datastore}"
496+
497+
dp = IterableWrapper([uri]) \
498+
.open_files_by_fsspec(mode="rb") \
499+
.load_from_tar()
500+
501+
for path, filestream in dp:
502+
print(path)
503+
# ['azureml:/subscriptions/<sub_id>/resourcegroups/<rg_name>/workspaces/<ws_name>/datastores/<datastore>/paths/cifar-10-python.tar.gz/cifar-10-batches-py/data_batch_4',
504+
# 'azureml:/subscriptions/<sub_id>/resourcegroups/<rg_name>/workspaces/<ws_name>/datastores/<datastore>/paths/cifar-10-python.tar.gz/cifar-10-batches-py/readme.html',
505+
# 'azureml:/subscriptions/<sub_id>/resourcegroups/<rg_name>/workspaces/<ws_name>/datastores/<datastore>/paths/cifar-10-python.tar.gz/cifar-10-batches-py/test_batch',
506+
# 'azureml:/subscriptions/<sub_id>/resourcegroups/<rg_name>/workspaces/<ws_name>/datastores/<datastore>/paths/cifar-10-python.tar.gz/cifar-10-batches-py/data_batch_3',
507+
# 'azureml:/subscriptions/<sub_id>/resourcegroups/<rg_name>/workspaces/<ws_name>/datastores/<datastore>/paths/cifar-10-python.tar.gz/cifar-10-batches-py/batches.meta',
508+
# 'azureml:/subscriptions/<sub_id>/resourcegroups/<rg_name>/workspaces/<ws_name>/datastores/<datastore>/paths/cifar-10-python.tar.gz/cifar-10-batches-py/data_batch_2',
509+
# 'azureml:/subscriptions/<sub_id>/resourcegroups/<rg_name>/workspaces/<ws_name>/datastores/<datastore>/paths/cifar-10-python.tar.gz/cifar-10-batches-py/data_batch_5',
510+
# 'azureml:/subscriptions/<sub_id>/resourcegroups/<rg_name>/workspaces/<ws_name>/datastores/<datastore>/paths/cifar-10-python.tar.gz/cifar-10-batches-py/data_batch_1]
511+
512+
Here is an example of loading a CSV file - the famous Titanic dataset (`download <https://raw.githubusercontent.com/Azure/azureml-examples/main/cli/assets/data/sample-data/titanic.csv>`_) - from the Azure ML datastore ``workspaceblobstore`` where the path is ``/titanic.csv`` (top-level folder).
513+
514+
.. code:: python
515+
516+
from torchdata.datapipes.iter import IterableWrapper
517+
518+
# set the subscription_id, resource_group, and AzureML workspace_name
519+
subscription_id = "<subscription_id>"
520+
resource_group = "<resource_group>"
521+
workspace_name = "<workspace_name>"
522+
523+
# set the datastore name and path on the datastore
524+
datastore_name = "workspaceblobstore"
525+
path_on_datastore = "titanic.csv"
526+
527+
uri = f"azureml://subscriptions/{subscription_id}/resourcegroups/{resource_group}/workspaces/{workspace_name}/datastores/{datastore_name}/paths/{path_on_datastore}"
528+
529+
def row_processer(row):
530+
# return the label and data (the class and age of the passenger)
531+
# if missing age, set to 50
532+
if row[5] == "":
533+
row[5] = 50.0
534+
return {"label": np.array(row[1], np.int32), "data": np.array([row[2],row[5]], dtype=np.float32)}
535+
536+
dp = IterableWrapper([uri]) \
537+
.open_files_by_fsspec() \
538+
.parse_csv(delimiter=",", skip_lines=1) \
539+
.map(row_processer)
540+
541+
print(list(dp)[:3])
542+
# [{'label': array(0, dtype=int32), 'data': array([ 3., 22.], dtype=float32)},
543+
# {'label': array(1, dtype=int32), 'data': array([ 1., 38.], dtype=float32)},
544+
# {'label': array(1, dtype=int32), 'data': array([ 3., 26.], dtype=float32)}]
545+

0 commit comments

Comments
 (0)