Skip to content

Commit fae2bbd

Browse files
authored
Merge pull request #194288 from nibaccam/train-id-access
ID based data access | Train with compute clusters
2 parents e479d77 + df29fd2 commit fae2bbd

File tree

1 file changed

+44
-0
lines changed

1 file changed

+44
-0
lines changed

articles/machine-learning/how-to-identity-based-data-access.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -201,6 +201,50 @@ blob_dset = Dataset.File.from_files('https://myblob.blob.core.windows.net/may/ke
201201

202202
When you submit a training job that consumes a dataset created with identity-based data access, the managed identity of the training compute is used for data access authentication. Your Azure Active Directory token isn't used. For this scenario, ensure that the managed identity of the compute is granted at least the Storage Blob Data Reader role from the storage service. For more information, see [Set up managed identity on compute clusters](how-to-create-attach-compute-cluster.md#managed-identity).
203203

204+
## Access data for training jobs on compute clusters (preview)
205+
206+
[!INCLUDE [cli v2](../../includes/machine-learning-cli-v2.md)]
207+
208+
[!INCLUDE [preview disclaimer](../../includes/machine-learning-preview-generic-disclaimer.md)]
209+
210+
When training on [Azure Machine Learning compute clusters](how-to-create-attach-compute-cluster.md#what-is-a-compute-cluster), you can authenticate to storage with your Azure Active Directory token.
211+
212+
This authentication mode allows you to:
213+
* Set up fine-grained permissions, where different workspace users can have access to different storage accounts or folders within storage accounts.
214+
* Audit storage access because the storage logs show which identities were used to access data.
215+
216+
> [!WARNING]
217+
> This functionality has the following limitations
218+
> * Feature is only supported for experiments submitted via the [Azure Machine Learning CLI v2 (preview)](how-to-configure-cli.md)
219+
> * Only CommandJobs, and PipelineJobs with CommandSteps and AutoMLSteps are supported
220+
> * User identity and compute managed identity cannot be used for authentication within same job.
221+
222+
The following steps outline how to set up identity-based data access for training jobs on compute clusters.
223+
224+
1. Grant the user identity access to storage resources. For example, grant StorageBlobReader access to the specific storage account you want to use or grant ACL-based permission to specific folders or files in Azure Data Lake Gen 2 storage.
225+
226+
1. Create an Azure Machine Learning datastore without cached credentials for the storage account. If a datastore has cached credentials, such as storage account key, those credentials are used instead of user identity.
227+
228+
1. Submit a training job with property **identity** set to **type: user_identity**, as shown in following job specification. During the training job, the authentication to storage happens via the identity of the user that submits the job.
229+
230+
> [!NOTE]
231+
> If the **identity** property is left unspecified and datastore does not have cached credentials, then compute managed identity becomes the fallback option.
232+
233+
```yaml
234+
command: |
235+
echo "--census-csv: ${{inputs.census_csv}}"
236+
python hello-census.py --census-csv ${{inputs.census_csv}}
237+
code: src
238+
inputs:
239+
census_csv:
240+
type: uri_file
241+
path: azureml://datastores/mydata/paths/census.csv
242+
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
243+
compute: azureml:cpu-cluster
244+
identity:
245+
type: user_identity
246+
```
247+
204248
## Next steps
205249
206250
* [Create an Azure Machine Learning dataset](how-to-create-register-datasets.md)

0 commit comments

Comments
 (0)