Skip to content

Commit 0b41545

Browse files
authored
Merge pull request #105152 from nibaccam/adls-g2
Data | address bugs and clarifications
2 parents 4d5a630 + 6a67af2 commit 0b41545

File tree

3 files changed

+21
-8
lines changed

3 files changed

+21
-8
lines changed

articles/machine-learning/how-to-access-data.md

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ When you register an Azure Storage solution as a datastore, you automatically cr
7676

7777
>[!IMPORTANT]
7878
> As part of the current datastore create and register process, Azure Machine Learning validates that the user provided principal (username, service principal or SAS token) has access to the underlying storage service.
79-
<br>
79+
<br><br>
8080
However, for Azure Data Lake Storage Gen 1 and 2 datastores, this validation happens later when data access methods like [`from_files()`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.dataset_factory.filedatasetfactory?view=azure-ml-py) or [`from_delimited_files()`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory?view=azure-ml-py#from-parquet-files-path--validate-true--include-path-false--set-column-types-none--partition-format-none-) are called.
8181

8282
### Python SDK
@@ -86,10 +86,13 @@ All the register methods are on the [`Datastore`](https://docs.microsoft.com/pyt
8686
You can find the information that you need to populate the `register()` method by using the [Azure portal](https://portal.azure.com):
8787

8888
1. Select **Storage Accounts** on the left pane, and choose the storage account that you want to register.
89-
2. For information like the account name, container, and file share name, go to the **Overview** page. For authentication information, like account key or SAS token, go to **Access Keys** on the **Settings** pane.
89+
2. For information like the account name, container, and file share name, go to the **Overview** page.
90+
3. For authentication information, like account key or SAS token, go to **Access Keys** on the **Settings** pane.
91+
92+
4. For service principal items like, tenant ID and client ID, go to the **Overview** page of your **App registrations**.
9093

9194
> [!IMPORTANT]
92-
> If your storage account is in a virtual network, only the creation of an Azure blob datastore is supported. To grant your workspace access to your storage account, set the parameter `grant_workspace_access` to `True`.
95+
> If your storage account is in a virtual network, only creation of Blob, File share, ADLS Gen 1 and ADLS Gen 2 datastores **via the SDK** is supported. To grant your workspace access to your storage account, set the parameter `grant_workspace_access` to `True`.
9396
9497
The following examples show how to register an Azure blob container, an Azure file share, and Azure Data Lake Storage Generation 2 as a datastore. For other storage services, please see the [reference documentation for the `register_azure_*` methods](https://docs.microsoft.com/python/api/azureml-core/azureml.core.datastore.datastore?view=azure-ml-py#methods).
9598

@@ -133,7 +136,7 @@ file_datastore = Datastore.register_azure_file_share(workspace=ws,
133136

134137
#### Azure Data Lake Storage Generation 2
135138

136-
For an Azure Data Lake Storage Generation 2 (ADLS Gen 2) datastore, use [register_azure_data_lake_gen2()](https://docs.microsoft.com/python/api/azureml-core/azureml.core.datastore.datastore?view=azure-ml-py#register-azure-data-lake-gen2-workspace--datastore-name--filesystem--account-name--tenant-id--client-id--client-secret--resource-url-none--authority-url-none--protocol-none--endpoint-none--overwrite-false-) to register a credential datastore connected to an Azure DataLake Gen 2 storage with [service principal permissions](https://docs.microsoft.com/azure/active-directory/develop/howto-create-service-principal-portal). Learn more about [access control set up for ADLS Gen 2](https://docs.microsoft.com/azure/storage/blobs/data-lake-storage-access-control).
139+
For an Azure Data Lake Storage Generation 2 (ADLS Gen 2) datastore, use [register_azure_data_lake_gen2()](https://docs.microsoft.com/python/api/azureml-core/azureml.core.datastore.datastore?view=azure-ml-py#register-azure-data-lake-gen2-workspace--datastore-name--filesystem--account-name--tenant-id--client-id--client-secret--resource-url-none--authority-url-none--protocol-none--endpoint-none--overwrite-false-) to register a credential datastore connected to an Azure DataLake Gen 2 storage with [service principal permissions](https://docs.microsoft.com/azure/active-directory/develop/howto-create-service-principal-portal). In order to utilize your service principal you need to [register your application](https://docs.microsoft.com/azure/active-directory/develop/app-objects-and-service-principals). Learn more about [access control set up for ADLS Gen 2](https://docs.microsoft.com/azure/storage/blobs/data-lake-storage-access-control).
137140

138141
The following code creates and registers the `adlsgen2_datastore_name` datastore to the `ws` workspace. This datastore accesses the file system `test` on the `account_name` storage account, by using the provided service principal credentials.
139142

@@ -161,12 +164,19 @@ adlsgen2_datastore = Datastore.register_azure_data_lake_gen2(workspace=ws,
161164

162165
Create a new datastore in a few steps in Azure Machine Learning studio:
163166

167+
> [!IMPORTANT]
168+
> If your storage account is in a virtual network, only creation of datastores [via the SDK](#python-sdk) is supported.
169+
164170
1. Sign in to [Azure Machine Learning studio](https://ml.azure.com/).
165171
1. Select **Datastores** on the left pane under **Manage**.
166172
1. Select **+ New datastore**.
167173
1. Complete the form for a new datastore. The form intelligently updates itself based on your selections for Azure Storage type and authentication type.
168174

169-
You can find the information that you need to populate the form on the [Azure portal](https://portal.azure.com). Select **Storage Accounts** on the left pane, and choose the storage account that you want to register. The **Overview** page provides information such as the account name, container, and file share name. For authentication items, like account key or SAS token, go to **Account Keys** on the **Settings** pane.
175+
You can find the information that you need to populate the form on the [Azure portal](https://portal.azure.com). Select **Storage Accounts** on the left pane, and choose the storage account that you want to register. The **Overview** page provides information such as the account name, container, and file share name.
176+
177+
* For authentication items, like account key or SAS token, go to **Account Keys** on the **Settings** pane.
178+
179+
* For service principal items like, tenant ID and client ID, go to the **Overview** page of your **App registrations**.
170180

171181
The following example demonstrates what the form looks like when you create an Azure blob datastore:
172182

articles/machine-learning/how-to-create-your-first-pipeline.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -337,6 +337,8 @@ pipeline1 = Pipeline(workspace=ws, steps=steps)
337337

338338
To use either a `TabularDataset` or `FileDataset` in your pipeline, you need to turn it into a [DatasetConsumptionConfig](https://docs.microsoft.com/python/api/azureml-core/azureml.data.dataset_consumption_config.datasetconsumptionconfig?view=azure-ml-py) object by calling [as_named_input(name)](https://docs.microsoft.com/python/api/azureml-core/azureml.data.abstract_dataset.abstractdataset?view=azure-ml-py#as-named-input-name-). You pass this `DatasetConsumptionConfig` object as one of the `inputs` to your pipeline step.
339339

340+
Datasets created from Azure Blob storage, Azure Files, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Database, and Azure Database for PostgreSQL can be used as input to any pipeline step. With the exception of writing output to a [DataTransferStep](https://docs.microsoft.com/python/api/azureml-pipeline-steps/azureml.pipeline.steps.datatransferstep?view=azure-ml-py) or [DatabricksStep](https://docs.microsoft.com/python/api/azureml-pipeline-steps/azureml.pipeline.steps.databricks_step.databricksstep?view=azure-ml-py), output data ([PipelineData](https://docs.microsoft.com/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelinedata?view=azure-ml-py)) can only be written to Azure Blob and Azure File share datastores.
341+
340342
```python
341343
dataset_consuming_step = PythonScriptStep(
342344
script_name="iris_train.py",

articles/machine-learning/how-to-train-with-datasets.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ In this article, you learn the two ways to consume [Azure Machine Learning datas
2525

2626
- Option 2: If you have unstructured data, create a FileDataset and mount or download files to a remote compute for training.
2727

28-
Azure Machine Learning datasets provide a seamless integration with Azure Machine Learning training products like [ScriptRun](https://docs.microsoft.com/python/api/azureml-core/azureml.core.scriptrun?view=azure-ml-py), [Estimator](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.estimator?view=azure-ml-py) and [HyperDrive](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.hyperdrive?view=azure-ml-py).
28+
Azure Machine Learning datasets provide a seamless integration with Azure Machine Learning training products like [ScriptRun](https://docs.microsoft.com/python/api/azureml-core/azureml.core.scriptrun?view=azure-ml-py), [Estimator](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.estimator?view=azure-ml-py), [HyperDrive](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.hyperdrive?view=azure-ml-py) and [Azure Machine Learning pipelines](how-to-create-your-first-pipeline.md).
2929

3030
## Prerequisites
3131

@@ -98,11 +98,12 @@ experiment_run = experiment.submit(est)
9898
experiment_run.wait_for_completion(show_output=True)
9999
```
100100

101+
101102
## Option 2: Mount files to a remote compute target
102103

103104
If you want to make your data files available on the compute target for training, use [FileDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py) to mount or download files referred by it.
104105

105-
### Mount v.s. Download
106+
### Mount vs. Download
106107
When you mount a dataset, you attach the files referenced by the dataset to a directory (mount point) and make it available on the compute target. Mounting is supported for Linux-based computes, including Azure Machine Learning Compute, virtual machines, and HDInsight. If your data size exceeds the compute disk size, or you are only loading part of dataset in your script, mounting is recommended. Because downloading a dataset bigger than the disk size will fail, and mounting will only load the part of data used by your script at the time of processing.
107108

108109
When you download a dataset, all the files referenced by the dataset will be downloaded to the compute target. Downloading is supported for all compute types. If your script process all files referenced by the dataset, and your compute disk can fit in your full dataset, downloading is recommended to avoid the overhead of streaming data from storage services.
@@ -197,4 +198,4 @@ The [dataset notebooks](https://aka.ms/dataset-tutorial) demonstrate and expand
197198

198199
* [Train image classification models](https://aka.ms/filedataset-samplenotebook) with FileDatasets
199200

200-
* [Create and manage environments for training and deployment](how-to-use-environments.md)
201+
* [Train with datasets using pipelines](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/work-with-data/datasets-tutorial/pipeline-with-datasets/pipeline-for-image-classification.ipynb)

0 commit comments

Comments
 (0)