Skip to content

Commit e07c317

Browse files
Merge pull request #278970 from vijaysr/junepdates
Update apache-spark-secure-credentials-with-tokenlibrary.md
2 parents 1846b00 + b8019b2 commit e07c317

File tree

1 file changed

+83
-32
lines changed

1 file changed

+83
-32
lines changed

articles/synapse-analytics/spark/apache-spark-secure-credentials-with-tokenlibrary.md

Lines changed: 83 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,15 @@
11
---
22
title: Secure access credentials with Linked Services in Apache Spark for Azure Synapse Analytics
33
description: This article provides concepts on how to securely integrate Apache Spark for Azure Synapse Analytics with other services using linked services and token library
4-
author: vijaysr
5-
ms.service: synapse-analytics
6-
ms.topic: overview
7-
ms.subservice: spark
8-
ms.custom: devx-track-python
9-
ms.date: 10/31/2023
4+
author: vijaysr
105
ms.author: vijaysr
116
ms.reviewer: shravan
7+
ms.date: 06/24/2024
8+
ms.service: synapse-analytics
9+
ms.subservice: spark
10+
ms.topic: overview
11+
ms.custom:
12+
- devx-track-python
1213
zone_pivot_groups: programming-languages-spark-all-minus-sql-r
1314
---
1415

@@ -20,7 +21,7 @@ Azure Synapse Analytics uses Microsoft Entra passthrough by default for authenti
2021

2122
Microsoft Entra passthrough uses permissions assigned to you as a user in Microsoft Entra ID, rather than permissions assigned to Synapse or a separate service principal. For example, if you want to use Microsoft Entra passthrough to access a blob in a storage account, then you should go to that storage account and assign blob contributor role to yourself.
2223

23-
When retrieving secrets from Azure Key Vault, we recommend creating a linked service to your Azure Key Vault. Ensure that the Synapse workspace managed service identity (MSI) has Secret Get privileges on your Azure Key Vault. Synapse will authenticate to Azure Key Vault using the Synapse workspace managed service identity. If you connect directly to Azure Key Vault without a linked service, you will authenticate using your user Microsoft Entra credential.
24+
When retrieving secrets from Azure Key Vault, we recommend creating a linked service to your Azure Key Vault. Ensure that the Synapse workspace managed service identity (MSI) has Secret Get privileges on your Azure Key Vault. Synapse will authenticate to Azure Key Vault using the Synapse workspace managed service identity. If you connect directly to Azure Key Vault without a linked service, authenticate using your user Microsoft Entra credential.
2425

2526
For more information, see [linked services](../../data-factory/concepts-linked-services.md?context=/azure/synapse-analytics/context/context).
2627

@@ -69,7 +70,7 @@ Get result:
6970
putSecretWithLS(linkedService: String, secretName: String, secretValue: String): puts AKV secret for a given linked service, secretName
7071
```
7172

72-
## Accessing Azure Data Lake Storage Gen2
73+
## <a id="accessing-azure-data-lake-storage-gen2"></a> Access Azure Data Lake Storage Gen2
7374

7475
#### ADLS Gen2 Primary Storage
7576

@@ -97,9 +98,12 @@ display(df.limit(10))
9798

9899
Azure Synapse Analytics provides an integrated linked services experience when connecting to Azure Data Lake Storage Gen2. Linked services can be configured to authenticate using an **Account Key**, **Service Principal**, **Managed Identity**, or **Credential**.
99100

100-
When the linked service authentication method is set to **Account Key**, the linked service will authenticate using the provided storage account key, request a SAS key, and automatically apply it to the storage request using the **LinkedServiceBasedSASProvider**.
101+
When the linked service authentication method is set to **Account Key**, the linked service authenticates using the provided storage account key, request a SAS key, and automatically apply it to the storage request using the **LinkedServiceBasedSASProvider**.
102+
103+
Synapse allows users to set the linked service for a particular storage account. This makes it possible to read/write data from **multiple storage accounts** in a single spark application/query. Once we set `spark.storage.synapse.{source_full_storage_account_name}.linkedServiceName` for each storage account that will be used, Synapse figures out which linked service to use for a particular read/write operation. However if our spark job only deals with a single storage account, we can omit the storage account name and use `spark.storage.synapse.linkedServiceName`.
101104

102-
Synapse allows users to set the linked service for a particular storage account. This makes it possible to read/write data from **multiple storage accounts** in a single spark application/query. Once we set **spark.storage.synapse.{source_full_storage_account_name}.linkedServiceName** for each storage account that will be used, Synapse figures out which linked service to use for a particular read/write operation. However if our spark job only deals with a single storage account, we can simply omit the storage account name and use **spark.storage.synapse.linkedServiceName**
105+
> [!NOTE]
106+
> It is not possible to change the authentication method of the default ABFS storage container.
103107
104108
::: zone pivot = "programming-language-scala"
105109

@@ -135,7 +139,7 @@ df.show()
135139

136140
::: zone-end
137141

138-
When the linked service authentication method is set to **Managed Identity** or **Service Principal**, the linked service will use the Managed Identity or Service Principal token with the **LinkedServiceBasedTokenProvider** provider.
142+
When the linked service authentication method is set to **Managed Identity** or **Service Principal**, the linked service uses the Managed Identity or Service Principal token with the **LinkedServiceBasedTokenProvider** provider.
139143

140144

141145
::: zone pivot = "programming-language-scala"
@@ -168,6 +172,15 @@ df.show()
168172

169173
::: zone-end
170174

175+
### <a id="setting-authentication-settings-through-spark-configuration"></a> Set authentication settings through spark configuration
176+
177+
Authentication settings can also be specified through spark configurations, instead of running spark statements. All spark configurations should be prefixed with `spark.` and all hadoop configurations should be prefixed with `spark.hadoop.`.
178+
179+
|Spark config name|Config value|
180+
|------------------|-----------|
181+
| `spark.storage.synapse.teststorage.dfs.core.windows.net.linkedServiceName` |LINKED SERVICE NAME|
182+
| `spark.hadoop.fs.azure.account.oauth.provider.type.teststorage.dfs.core.windows.net` |`microsoft.azure.synapse.tokenlibrary.LinkedServiceBasedTokenProvider`|
183+
171184
#### ADLS Gen2 storage without linked services
172185

173186
Connect to ADLS Gen2 storage directly by using a SAS key. Use the `ConfBasedSASProvider` and provide the SAS key to the `spark.storage.synapse.sas` configuration setting. SAS tokens can be set at the container level, account level, or global. We do not recommend setting SAS keys at the global level, as the job will not be able to read/write from more than one storage account.
@@ -271,7 +284,44 @@ display(df.limit(10))
271284

272285
::: zone-end
273286

274-
#### ADLS Gen2 storage with Azure Key Vault
287+
#### Use MSAL to acquire tokens (using custom app credentials)
288+
289+
When the ABFS storage driver is [configured](https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html) to use MSAL directly for authentications, the provider doesn't cache tokens. This can result in reliability issues. We recommend using the `ClientCredsTokenProvider` is part of the Synapse Spark.
290+
291+
::: zone pivot = "programming-language-scala"
292+
293+
```scala
294+
%%spark
295+
val source_full_storage_account_name = "teststorage.dfs.core.windows.net"
296+
sc.hadoopConfiguration.set("fs.azure.sas.token.provider.type", "com.microsoft.azure.synapse.tokenlibrary.ClientCredsTokenProvider")
297+
spark.conf.set("fs.azure.account.oauth2.client.id.$source_full_storage_account_name", "<Entra AppId>")
298+
spark.conf.set("fs.azure.account.oauth2.client.secret.$source_full_storage_account_name", "<Entra app secret>")
299+
spark.conf.set("fs.azure.account.oauth2.client.endpoint.$source_full_storage_account_name", "https://login.microsoftonline.com/<tenantid>")
300+
301+
val df = spark.read.csv("abfss://<CONTAINER>@<ACCOUNT>.dfs.core.windows.net/<FILE PATH>")
302+
303+
display(df.limit(10))
304+
```
305+
306+
::: zone-end
307+
308+
::: zone pivot = "programming-language-python"
309+
310+
```python
311+
%%pyspark
312+
source_full_storage_account_name = "teststorage.dfs.core.windows.net"
313+
sc._jsc.hadoopConfiguration().set("fs.azure.sas.token.provider.type", "com.microsoft.azure.synapse.tokenlibrary.ClientCredsTokenProvider")
314+
spark.conf.set(f"fs.azure.account.oauth2.client.id.{source_full_storage_account_name}.linkedServiceName", "<Entra AppId>")
315+
spark.conf.set(f"fs.azure.account.oauth2.client.secret.{source_full_storage_account_name}.linkedServiceName", "<Entra app secret>")
316+
spark.conf.set(f"fs.azure.account.oauth2.client.endpoint.{source_full_storage_account_name}.linkedServiceName", "https://login.microsoftonline.com/<tenantid>")
317+
318+
df = spark.read.csv('abfss://<CONTAINER>@<ACCOUNT>.dfs.core.windows.net/<FILE PATH>')
319+
display(df.limit(10))
320+
```
321+
322+
::: zone-end
323+
324+
### ADLS Gen2 storage with SAS token (from Azure Key Vault)
275325

276326
Connect to ADLS Gen2 storage using a SAS token stored in Azure Key Vault secret.
277327

@@ -313,7 +363,7 @@ To connect to other linked services, you can make a direct call to the TokenLibr
313363

314364
#### getConnectionString()
315365

316-
To retrieve the connection string, use the **getConnectionString** function and pass in the **linked service name**.
366+
To retrieve the connection string, use the `getConnectionString` function and pass in the **linked service name**.
317367

318368
::: zone pivot = "programming-language-scala"
319369

@@ -378,11 +428,11 @@ The output will look like
378428

379429
To retrieve a secret stored from Azure Key Vault, we recommend that you create a linked service to Azure Key Vault within the Synapse workspace. The Synapse workspace managed service identity will need to be granted **GET** Secrets permission to the Azure Key Vault. The linked service will use the managed service identity to connect to Azure Key Vault service to retrieve the secret. Otherwise, connecting directly to Azure Key Vault will use the user's Microsoft Entra credential. In this case, the user will need to be granted the Get Secret permissions in Azure Key Vault.
380430

381-
In government clouds, please provide the fully qualified domain name of the keyvault.
431+
In government clouds, provide the fully qualified domain name of the keyvault.
382432

383433
`mssparkutils.credentials.getSecret("<AZURE KEY VAULT NAME>", "<SECRET KEY>" [, <LINKED SERVICE NAME>])`
384434

385-
To retrieve a secret from Azure Key Vault, use the **mssparkutils.credentials.getSecret()** function.
435+
To retrieve a secret from Azure Key Vault, use the `mssparkutils.credentials.getSecret()` function.
386436

387437
::: zone pivot = "programming-language-scala"
388438

@@ -415,7 +465,7 @@ Console.WriteLine(connectionString);
415465

416466
#### Linked service connections supported from the Spark runtime
417467

418-
While Azure Synapse Analytics supports a variety of linked service connections (from pipelines and other Azure products), not all of them are supported from the Spark runtime. Here is the list of supported linked services:
468+
While Azure Synapse Analytics supports various linked service connections (from pipelines and other Azure products), not all of them are supported from the Spark runtime. Here is the list of supported linked services:
419469

420470
- Azure Blob Storage
421471
- Azure AI services
@@ -434,33 +484,34 @@ While Azure Synapse Analytics supports a variety of linked service connections (
434484
#### mssparkutils.credentials.getToken()
435485
When you need an OAuth bearer token to access services directly, you can use the `getToken` method. The following resources are supported:
436486

437-
| Service Name | String literal to be used in API call |
487+
| Service Name | String literal to be used in API call |
438488
|-------------------------------------------------------|---------------------------------------|
439-
| Azure Storage | `Storage` |
440-
| Azure Key Vault | `Vault` |
441-
| Azure Management | `AzureManagement` |
442-
| Azure SQL Data Warehouse (Dedicated and Serverless) | `DW` |
443-
| Azure Synapse | `Synapse` |
444-
| Azure Data Lake Store | `DataLakeStore` |
445-
| Azure Data Factory | `ADF` |
446-
| Azure Data Explorer | `AzureDataExplorer` |
447-
| Azure Database for MySQL | `AzureOSSDB` |
448-
| Azure Database for MariaDB | `AzureOSSDB` |
449-
| Azure Database for PostgreSQL | `AzureOSSDB` |
489+
| `Azure Storage` | `Storage` |
490+
| `Azure Key Vault` | `Vault` |
491+
| `Azure Management` | `AzureManagement` |
492+
| `Azure SQL Data Warehouse (Dedicated and Serverless)` | `DW` |
493+
| `Azure Synapse` | `Synapse` |
494+
| `Azure Data Lake Store` | `DataLakeStore` |
495+
| `Azure Data Factory` | `ADF` |
496+
| `Azure Data Explorer` | `AzureDataExplorer` |
497+
| `Azure Database for MySQL` | `AzureOSSDB` |
498+
| `Azure Database for MariaDB` | `AzureOSSDB` |
499+
| `Azure Database for PostgreSQL` | `AzureOSSDB` |
450500

451501
#### Unsupported linked service access from the Spark runtime
452502

453503
The following methods of accessing the linked services are not supported from the Spark runtime:
454504

455505
- Passing arguments to parameterized linked service
456-
- Connections with User assigned managed identities (UAMI)
457-
- System Assigned Managed identities are not supported on Keyvault resource
506+
- Connections with user-assigned managed identities (UAMI)
507+
- Getting the bearer token to Keyvault resource when your Notebook / SparkJobDefinition runs as managed identity
508+
- As an alternative, instead of getting an access token, you can create a linked service to Keyvault and get the secret from your Notebook / batch job
458509
- For Azure Cosmos DB connections, key based access alone is supported. Token based access is not supported.
459510

460-
While running a notebook or a Spark job, requests to get a token / secret using a linked service might fail with an error message that indicates 'BadRequest'. This is often caused by a configuration issue with the linked service. If you see this error message, please check the configuration of your linked service. If you have any questions, please contact Microsoft Azure Support at the [Azure portal](https://portal.azure.com).
511+
While running a notebook or a Spark job, requests to get a token / secret using a linked service might fail with an error message that indicates 'BadRequest'. This is often caused by a configuration issue with the linked service. If you see this error message, please check the configuration of your linked service. If you have any questions, contact Microsoft Azure Support at the [Azure portal](https://portal.azure.com).
461512

462513
## Related content
463514

464-
- [Write to dedicated SQL pool](./synapse-spark-sql-pool-import-export.md)
515+
- [Write to dedicated SQL pool](synapse-spark-sql-pool-import-export.md)
465516
- [Apache Spark in Azure Synapse Analytics](apache-spark-overview.md)
466517
- [Introduction to Microsoft Spark Utilities](microsoft-spark-utilities.md)

0 commit comments

Comments
 (0)