You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Customer intent: As an experienced Python developer, I need to make my data in Azure Storage available to my compute for training my machine learning models.
15
15
---
16
16
17
17
# Connect to storage by using identity-based data access with SDK v1
18
18
19
-
In this article, you learn how to connect to storage services on Azure by using identity-based data access and Azure Machine Learning datastores via the [Azure Machine Learning SDK for Python](/python/api/overview/azure/ml/intro).
19
+
In this article, you'll learn how to connect to storage services on Azure, with identity-based data access and Azure Machine Learning datastores via the [Azure Machine Learning SDK for Python](/python/api/overview/azure/ml/intro).
20
+
21
+
Typically, datastores use **credential-based authentication** to confirm you have permission to access the storage service. They keep connection information, like your subscription ID and token authorization, in the [key vault](https://azure.microsoft.com/services/key-vault/) that's associated with the workspace. When you create a datastore that uses **identity-based data access**, your Azure account ([Azure Active Directory token](../../active-directory/fundamentals/active-directory-whatis.md)) is used to confirm you have permission to access the storage service. In the **identity-based data access** scenario, no authentication credentials are saved. Only the storage account information is stored in the datastore.
20
22
21
-
Typically, datastores use **credential-based authentication** to confirm you have permission to access the storage service. They keep connection information, like your subscription ID and token authorization, in the [key vault](https://azure.microsoft.com/services/key-vault/) that's associated with the workspace. When you create a datastore that uses **identity-based data access**, your Azure account ([Azure Active Directory token](../../active-directory/fundamentals/active-directory-whatis.md)) is used to confirm you have permission to access the storage service. In the **identity-based data access** scenario, no authentication credentials are saved. Only the storage account information is stored in the datastore.
22
-
23
23
To create datastores with **identity-based** data access via the Azure Machine Learning studio UI, see [Connect to data with the Azure Machine Learning studio](how-to-connect-data-ui.md#create-datastores).
24
24
25
25
To create datastores that use **credential-based** authentication, like access keys or service principals, see [Connect to storage services on Azure](how-to-access-data.md).
@@ -36,11 +36,11 @@ There are two scenarios in which you can apply identity-based data access in Azu
36
36
37
37
### Accessing storage services
38
38
39
-
You can connect to storage services via identity-based data access with Azure Machine Learning datastores or [Azure Machine Learning datasets](how-to-create-register-datasets.md).
39
+
You can connect to storage services via identity-based data access with Azure Machine Learning datastores or [Azure Machine Learning datasets](how-to-create-register-datasets.md).
40
40
41
-
Your authentication credentials are usually kept in a datastore, which is used to ensure you have permission to access the storage service. When these credentials are registered via datastores, any user with the workspace Reader role can retrieve them. That scale of access can be a security concern for some organizations. [Learn more about the workspace Reader role.](../how-to-assign-roles.md#default-roles)
41
+
Your authentication credentials are kept in a datastore, which is used to ensure you have permission to access the storage service. When these credentials are registered via datastores, any user with the workspace Reader role can retrieve them. That scale of access can be a security concern for some organizations. [Learn more about the workspace Reader role.](../how-to-assign-roles.md#default-roles)
42
42
43
-
When you use identity-based data access, Azure Machine Learning prompts you for your Azure Active Directory token for data access authentication instead of keeping your credentials in the datastore. That approach allows for data access management at the storage level and keeps credentials confidential.
43
+
When you use identity-based data access, Azure Machine Learning prompts you for your Azure Active Directory token for data access authentication, instead of keeping your credentials in the datastore. That approach allows for data access management at the storage level and keeps credentials confidential.
44
44
45
45
The same behavior applies when you:
46
46
@@ -58,7 +58,7 @@ Certain machine learning scenarios involve training models with private data. In
58
58
59
59
- An Azure subscription. If you don't have an Azure subscription, create a free account before you begin. Try the [free or paid version of Azure Machine Learning](https://azure.microsoft.com/free/).
60
60
61
-
- An Azure storage account with a supported storage type. These storage types are supported:
61
+
- An Azure storage account with a supported storage type. These storage types are supported:
-[Azure Data Lake Storage Gen1](../../data-lake-store/index.yml)
64
64
-[Azure Data Lake Storage Gen2](../../storage/blobs/data-lake-storage-introduction.md)
@@ -68,18 +68,18 @@ Certain machine learning scenarios involve training models with private data. In
68
68
69
69
- An Azure Machine Learning workspace.
70
70
71
-
Either [create an Azure Machine Learning workspace](../how-to-manage-workspace.md) or use an [existing one via the Python SDK](../how-to-manage-workspace.md#connect-to-a-workspace).
71
+
Either [create an Azure Machine Learning workspace](../how-to-manage-workspace.md) or use an [existing one via the Python SDK](../how-to-manage-workspace.md#connect-to-a-workspace).
72
72
73
73
## Create and register datastores
74
74
75
-
When you register a storage service on Azure as a datastore, you automatically create and register that datastore to a specific workspace. See [Storage access permissions](#storage-access-permissions) for guidance on required permission types. You also have the option to manually create the storage you want to connect to without any special permissions, and you just need the name.
75
+
When you register a storage service on Azure as a datastore, you automatically create and register that datastore to a specific workspace. See [Storage access permissions](#storage-access-permissions) for guidance on required permission types. You can also manually create the storage you want to connect to without any special permissions, and you just need the name.
76
76
77
77
See [Work with virtual networks](#work-with-virtual-networks) for details on how to connect to data storage behind virtual networks.
78
78
79
79
In the following code, notice the absence of authentication parameters like `sas_token`, `account_key`, `subscription_id`, and the service principal `client_id`. This omission indicates that Azure Machine Learning will use identity-based data access for authentication. Creation of datastores typically happens interactively in a notebook or via the studio. So your Azure Active Directory token is used for data access authentication.
80
80
81
81
> [!NOTE]
82
-
> Datastore names should consist only of lowercase letters, numbers, and underscores.
82
+
> Datastore names should consist only of lowercase letters, numbers, and underscores.
To help ensure that you securely connect to your storage service on Azure, Azure Machine Learning requires that you have permission to access the corresponding data storage.
@@ -153,25 +152,25 @@ Identity-based data access supports connections to **only** the following storag
153
152
* Azure Data Lake Storage Gen2
154
153
* Azure SQL Database
155
154
156
-
To access these storage services, you must have at least [Storage Blob Data Reader](../../role-based-access-control/built-in-roles.md#storage-blob-data-reader) access to the storage account. Only storage account owners can [change your access level via the Azure portal](../../storage/blobs/assign-azure-role-data-access.md).
155
+
To access these storage services, you must have at least [Storage Blob Data Reader](../../role-based-access-control/built-in-roles.md#storage-blob-data-reader) access to the storage account. Only storage account owners can [change your access level via the Azure portal](../../storage/blobs/assign-azure-role-data-access.md).
157
156
158
-
If you prefer to not use your user identity (Azure Active Directory), you also have the option to grant a workspace managed-system identity (MSI) permission to create the datastore. To do so, you must have Owner permissions to the storage account and add the `grant_workspace_access= True` parameter to your data register method.
157
+
If you prefer to not use your user identity (Azure Active Directory), you can also grant a workspace managed-system identity (MSI) permission to create the datastore. To do so, you must have Owner permissions to the storage account and add the `grant_workspace_access= True` parameter to your data register method.
159
158
160
159
If you're training a model on a remote compute target and want to access the data for training, the compute identity must be granted at least the Storage Blob Data Reader role from the storage service. Learn how to [set up managed identity on a compute cluster](how-to-create-attach-compute-cluster.md#set-up-managed-identity).
161
160
162
161
## Work with virtual networks
163
162
164
163
By default, Azure Machine Learning can't communicate with a storage account that's behind a firewall or in a virtual network.
165
164
166
-
You can configure storage accounts to allow access only from within specific virtual networks. This configuration requires additional steps to ensure data isn't leaked outside of the network. This behavior is the same for credential-based data access. For more information, see [How to configure virtual network scenarios](how-to-access-data.md#virtual-network).
165
+
You can configure storage accounts to allow access only from within specific virtual networks. This configuration requires more steps, to ensure that data doesn't leak outside of the network. This behavior is the same for credential-based data access. For more information, see [How to configure virtual network scenarios](how-to-access-data.md#virtual-network).
166
+
167
+
If your storage account has virtual network settings, they dictate the needed identity type and permissions access. For example for data preview and data profile, the virtual network settings determine what type of identity is used to authenticate data access.
167
168
168
-
If your storage account has virtual network settings, that dictates what identity type and permissions access is needed. For example for data preview and data profile, the virtual network settings determine what type of identity is used to authenticate data access.
169
-
170
169
* In scenarios where only certain IPs and subnets are allowed to access the storage, then Azure Machine Learning uses the workspace MSI to accomplish data previews and profiles.
171
170
172
-
* If your storage is ADLS Gen 2 or Blob and has virtual network settings, customers can use either user identity or workspace MSI depending on the datastore settings defined during creation.
171
+
* If your storage is ADLS Gen 2 or Blob and has virtual network settings, customers can use either user identity or workspace MSI depending on the datastore settings defined during creation.
173
172
174
-
* If the virtual network setting is “Allow Azure services on the trusted services list to access this storage account”, then Workspace MSI is used.
173
+
* If the virtual network setting is “Allow Azure services on the trusted services list to access this storage account”, then Workspace MSI is used.
175
174
176
175
## Use data in storage
177
176
@@ -182,12 +181,12 @@ We recommend that you use [Azure Machine Learning datasets](how-to-create-regist
182
181
183
182
Datasets package your data into a lazily evaluated consumable object for machine learning tasks like training. Also, with datasets you can [download or mount](how-to-train-with-datasets.md#mount-vs-download) files of any format from Azure storage services like Azure Blob Storage and Azure Data Lake Storage to a compute target.
184
183
185
-
To create a dataset, you can reference paths from datastores that also use identity-based data access .
184
+
To create a dataset, you can reference paths from datastores that also use identity-based data access.
186
185
187
-
* If you're underlying storage account type is Blob or ADLS Gen 2, your user identity needs Blob Reader role.
188
-
* If your underlying storage is ADLS Gen 1, permissions need can be set via the storage's Access Control List (ACL).
186
+
* If you're underlying storage account type is Blob or ADLS Gen 2, your user identity needs Blob Reader role.
187
+
* If your underlying storage is ADLS Gen 1, permissions need can be set via the storage's Access Control List (ACL).
189
188
190
-
In the following example, `blob_datastore` already exists and uses identity-based data access.
189
+
In the following example, `blob_datastore` already exists and uses identity-based data access.
When you submit a training job that consumes a dataset created with identity-based data access, the managed identity of the training compute is used for data access authentication. Your Azure Active Directory token isn't used. For this scenario, ensure that the managed identity of the compute is granted at least the Storage Blob Data Reader role from the storage service. For more information, see [Set up managed identity on compute clusters](how-to-create-attach-compute-cluster.md#set-up-managed-identity).
203
-
204
-
## Access data for training jobs on compute clusters (preview)
When training on [Azure Machine Learning compute clusters](how-to-create-attach-compute-cluster.md#what-is-a-compute-cluster), you can authenticate to storage with your Azure Active Directory token.
209
-
210
-
This authentication mode allows you to:
211
-
* Set up fine-grained permissions, where different workspace users can have access to different storage accounts or folders within storage accounts.
212
-
* Audit storage access because the storage logs show which identities were used to access data.
213
-
214
-
> [!IMPORTANT]
215
-
> This functionality has the following limitations
216
-
> * Feature is only supported for experiments submitted via the [Azure Machine Learning CLI](../how-to-configure-cli.md)
217
-
> * Only CommandJobs, and PipelineJobs with CommandSteps and AutoMLSteps are supported
218
-
> * User identity and compute managed identity cannot be used for authentication within same job.
219
-
220
-
> [!WARNING]
221
-
> This feature is __public preview__ and is __not secure for production workloads__. Ensure that only trusted users have permissions to access your workspace and storage accounts.
222
-
>
223
-
> Preview features are provided without a service-level agreement, and are not recommended for production workloads. Certain features might not be supported or might have constrained capabilities.
224
-
>
225
-
> For more information, see [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/).
226
-
227
-
The following steps outline how to set up identity-based data access for training jobs on compute clusters.
228
-
229
-
1. Grant the user identity access to storage resources. For example, grant StorageBlobReader access to the specific storage account you want to use or grant ACL-based permission to specific folders or files in Azure Data Lake Gen 2 storage.
230
-
231
-
1. Create an Azure Machine Learning datastore without cached credentials for the storage account. If a datastore has cached credentials, such as storage account key, those credentials are used instead of user identity.
232
-
233
-
1. Submit a training job with property **identity** set to **type: user_identity**, as shown in following job specification. During the training job, the authentication to storage happens via the identity of the user that submits the job.
234
-
235
-
> [!NOTE]
236
-
> If the **identity** property is left unspecified and datastore does not have cached credentials, then compute managed identity becomes the fallback option.
When you submit a training job that consumes a dataset created with identity-based data access, the managed identity of the training compute is used for data access authentication. Your Azure Active Directory token isn't used. For this scenario, ensure that the managed identity of the compute is granted at least the Storage Blob Data Reader role from the storage service. For more information, see [Set up managed identity on compute clusters](how-to-create-attach-compute-cluster.md#set-up-managed-identity).
0 commit comments