Skip to content

Commit bfcfc37

Browse files
authored
Merge pull request #275180 from TheovanKraay/cosmos-spark-entra-id-howto
how to for using cosmos spark connector with entra id
2 parents 8b61429 + 1e99424 commit bfcfc37

File tree

2 files changed

+307
-0
lines changed

2 files changed

+307
-0
lines changed

articles/cosmos-db/nosql/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -569,6 +569,8 @@
569569
items:
570570
- name: Spark 3.x online transaction processing (OLTP) connector
571571
href: sdk-java-spark-v3.md
572+
- name: Use a service principal with the OLTP Spark 3 connector
573+
href: how-to-spark-service-principal.md
572574
- name: Throughput control
573575
href: throughput-control-spark.md
574576
- name: ASP.NET session state and cache provider
Lines changed: 305 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,305 @@
1+
---
2+
title: Use a service principal with Spark
3+
titleSuffix: Azure Cosmos DB for NoSQL
4+
description: Use a Microsoft Entra service principal to authenticate to Azure Cosmos DB for NoSQL from Spark.
5+
author: seesharprun
6+
ms.author: sidandrews
7+
ms.service: cosmos-db
8+
ms.subservice: nosql
9+
ms.topic: how-to
10+
ms.date: 04/01/2024
11+
zone_pivot_groups: programming-languages-spark-all-minus-sql-r-csharp
12+
#CustomerIntent: As a data scientist, I want to connect to Azure Cosmos DB for NoSQL using Spark and a service principal, so that I can avoid using connection strings.
13+
---
14+
15+
# Use a service principal with the Spark 3 connector for Azure Cosmos DB for NoSQL
16+
17+
In this article, you learn how to create a Microsoft Entra application and service principal that can be used with the role-based access control. You can then use this service principal to connect to an Azure Cosmos DB for NoSQL account from Spark 3.
18+
19+
## Prerequisites
20+
21+
- An existing Azure Cosmos DB for NoSQL account.
22+
- If you have an existing Azure subscription, [create a new account](how-to-create-account.md?tabs=azure-portal).
23+
- No Azure subscription? You can [try Azure Cosmos DB free](../try-free.md) with no credit card required.
24+
- An existing Azure Databricks workspace.
25+
- Registered Microsoft Entra application and service principal
26+
- If you don't have a service principal and application, [register an application using the Azure portal](/entra/identity-platform/howto-create-service-principal-portal).
27+
28+
## Create secret and record credentials
29+
30+
In this section we will create a client secret and record the value for use later.
31+
32+
1. Open the Azure portal (<https://portal.azure.com>).
33+
34+
1. Navigate to your existing Microsoft Entra application.
35+
36+
1. Navigate to the **Certificates & secrets** page. Then, create a new secret. Save the **Client Secret** value to use later in this guide.
37+
38+
1. Navigate to the **Overview** page. Locate and record the values for **Application (client) ID**, **Object ID**, and **Directory (tenant) ID**. You also use these values later in this guide.
39+
40+
1. Navigate to your existing Azure Cosmos DB for NoSQL account.
41+
42+
1. Record the **URI** value in the **Overview** page. Also record the **Subscription ID** and **Resource Group** values. You' use these values too later in this guide.
43+
44+
## Create definition and assignment
45+
46+
In this section we will create a Microsoft Entra ID role definition and assign that role with permissions to read and write items in the containers.
47+
48+
1. Create a role using the `az role definition create` command. Pass in the Azure Cosmos DB for NoSQL account name and resource group, followed by a body of JSON that defines the custom role. The role is also scoped to the account level using `/`. Ensure that you provide a unique name for your role using the `RoleName` property of the request body.
49+
50+
```azurecli
51+
az cosmosdb sql role definition create \
52+
--resource-group "<resource-group-name>" \
53+
--account-name "<account-name>" \
54+
--body '{
55+
"RoleName": "<role-definition-name>",
56+
"Type": "CustomRole",
57+
"AssignableScopes": ["/"],
58+
"Permissions": [{
59+
"DataActions": [
60+
"Microsoft.DocumentDB/databaseAccounts/readMetadata",
61+
"Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers/items/*",
62+
"Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers/*"
63+
]
64+
}]
65+
}'
66+
```
67+
68+
1. List the role definition you created to fetch its unique identifier in the JSON output. Record the `id` value of the JSON output.
69+
70+
```azurecli
71+
az cosmosdb sql role definition list \
72+
--resource-group "<resource-group-name>" \
73+
--account-name "<account-name>"
74+
```
75+
76+
```json
77+
[
78+
{
79+
...,
80+
"id": "/subscriptions/<subscription-id>/resourceGroups/<resource-grou-name>/providers/Microsoft.DocumentDB/databaseAccounts/<account-name>/sqlRoleDefinitions/<role-definition-id>",
81+
...
82+
"permissions": [
83+
{
84+
"dataActions": [
85+
"Microsoft.DocumentDB/databaseAccounts/readMetadata",
86+
"Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers/items/*",
87+
"Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers/*"
88+
],
89+
"notDataActions": []
90+
}
91+
],
92+
...
93+
}
94+
]
95+
```
96+
97+
1. Use `az cosmosdb sql role assignment create` to create a role assignment. Replace the`<aad-principal-id>` with the **Object ID** you recorded earlier in this guide. Also, replace `<role-definition-id>` with the `id` value fetched from running the `az cosmosdb sql role definition list` command in a previous step.
98+
99+
```azurecli
100+
az cosmosdb sql role assignment create \
101+
--resource-group "<resource-group-name>" \
102+
--account-name "<account-name>" \
103+
--scope "/" \
104+
--principal-id "<account-name>" \
105+
--role-definition-id "<role-definition-id>"
106+
```
107+
108+
## Use service principal
109+
110+
Now that you created a Microsoft Entra application and service principal, created a custom role, and assigned that role permissions to your Azure Cosmos DB for NoSQL account, you should be able to run a notebook.
111+
112+
1. Open your Azure Databricks workspace.
113+
114+
1. In the workspace interface, create a new **cluster**. Configure the cluster with these settings, at a minimum:
115+
116+
| | **Value** |
117+
| --- | --- |
118+
| **Runtime version** | `13.3 LTS (Scala 2.12, Spark 3.4.1)` |
119+
120+
1. Use the workspace interface to search for **Maven** packages from **Maven Central** with a **Group Id** of `com.azure.cosmos.spark`. Install the package specific for Spark 3.4 with an **Artifact Id** prefixed with `azure-cosmos-spark_3-4` to the cluster.
121+
122+
1. Finally, create a new **notebook**.
123+
124+
> [!TIP]
125+
> By default, the notebook will be attached to the recently created cluster.
126+
127+
1. Within the notebook, set Cosmos DB Spark Connector configuration settings for NoSQL account endpoint, database name, and container name. Use the **Subscription ID**, **Resource Group**, **Application (client) ID**, **Directory (tenant) ID**, and **Client Secret** values recorded earlier in this guide.
128+
129+
::: zone pivot="programming-language-python"
130+
131+
```python
132+
# Set configuration settings
133+
config = {
134+
"spark.cosmos.accountEndpoint": "<nosql-account-endpoint>",
135+
"spark.cosmos.auth.type": "ServicePrincipal",
136+
"spark.cosmos.account.subscriptionId": "<subscription-id>",
137+
"spark.cosmos.account.resourceGroupName": "<resource-group-name>",
138+
"spark.cosmos.account.tenantId": "<entra-tenant-id>",
139+
"spark.cosmos.auth.aad.clientId": "<entra-app-client-id>",
140+
"spark.cosmos.auth.aad.clientSecret": "<entra-app-client-secret>",
141+
"spark.cosmos.database": "<database-name>",
142+
"spark.cosmos.container": "<container-name>"
143+
}
144+
```
145+
146+
::: zone-end
147+
148+
::: zone pivot="programming-language-scala"
149+
150+
```scala
151+
// Set configuration settings
152+
val config = Map(
153+
"spark.cosmos.accountEndpoint" -> "<nosql-account-endpoint>",
154+
"spark.cosmos.auth.type" -> "ServicePrincipal",
155+
"spark.cosmos.account.subscriptionId" -> "<subscription-id>",
156+
"spark.cosmos.account.resourceGroupName" -> "<resource-group-name>",
157+
"spark.cosmos.account.tenantId" -> "<entra-tenant-id>",
158+
"spark.cosmos.auth.aad.clientId" -> "<entra-app-client-id>",
159+
"spark.cosmos.auth.aad.clientSecret" -> "<entra-app-client-secret>",
160+
"spark.cosmos.database" -> "<database-name>",
161+
"spark.cosmos.container" -> "<container-name>"
162+
)
163+
```
164+
165+
::: zone-end
166+
167+
1. Configure the Catalog API to manage API for NoSQL resources using Spark.
168+
169+
::: zone pivot="programming-language-python"
170+
171+
```python
172+
# Configure Catalog Api
173+
spark.conf.set("spark.sql.catalog.cosmosCatalog", "com.azure.cosmos.spark.CosmosCatalog")
174+
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountEndpoint", "<nosql-account-endpoint>")
175+
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.auth.type", "ServicePrincipal")
176+
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.account.subscriptionId", "<subscription-id>")
177+
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.account.resourceGroupName", "<resource-group-name>")
178+
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.account.tenantId", "<entra-tenant-id>")
179+
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.auth.aad.clientId", "<entra-app-client-id>")
180+
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.auth.aad.clientSecret", "<entra-app-client-secret>")
181+
```
182+
183+
::: zone-end
184+
185+
::: zone pivot="programming-language-scala"
186+
187+
```scala
188+
// Configure Catalog Api
189+
spark.conf.set(s"spark.sql.catalog.cosmosCatalog", "com.azure.cosmos.spark.CosmosCatalog")
190+
spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.accountEndpoint", "<nosql-account-endpoint>")
191+
spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.auth.type", "ServicePrincipal")
192+
spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.account.subscriptionId", "<subscription-id>")
193+
spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.account.resourceGroupName", "<resource-group-name>")
194+
spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.account.tenantId", "<entra-tenant-id>")
195+
spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.auth.aad.clientId", "<entra-app-client-id>")
196+
spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.auth.aad.clientSecret", "<entra-app-client-secret>")
197+
```
198+
199+
::: zone-end
200+
201+
1. Create a new database using `CREATE DATABASE IF NOT EXISTS`. Ensure that you provide your database name.
202+
203+
::: zone pivot="programming-language-python"
204+
205+
```python
206+
# Create a database using the Catalog API
207+
spark.sql("CREATE DATABASE IF NOT EXISTS cosmosCatalog.{};".format("<database-name>"))
208+
```
209+
210+
::: zone-end
211+
212+
::: zone pivot="programming-language-scala"
213+
214+
```scala
215+
// Create a database using the Catalog API
216+
spark.sql(s"CREATE DATABASE IF NOT EXISTS cosmosCatalog.<database-name>;")
217+
```
218+
219+
::: zone-end
220+
221+
1. Create a new container using database name, container name, partition key path, and throughput values that you specify.
222+
223+
::: zone pivot="programming-language-python"
224+
225+
```python
226+
# Create a products container using the Catalog API
227+
spark.sql("CREATE TABLE IF NOT EXISTS cosmosCatalog.{}.{} USING cosmos.oltp TBLPROPERTIES(partitionKeyPath = '{}', manualThroughput = '{}')".format("<database-name>", "<container-name>", "<partition-key-path>", "<throughput>"))
228+
```
229+
230+
::: zone-end
231+
232+
::: zone pivot="programming-language-scala"
233+
234+
```scala
235+
// Create a products container using the Catalog API
236+
spark.sql(s"CREATE TABLE IF NOT EXISTS cosmosCatalog.<database-name>.<container-name> using cosmos.oltp TBLPROPERTIES(partitionKeyPath = '<partition-key-path>', manualThroughput = '<throughput>')")
237+
```
238+
239+
::: zone-end
240+
241+
1. Create a sample data set.
242+
243+
::: zone pivot="programming-language-python"
244+
245+
```python
246+
# Create sample data
247+
products = (
248+
("68719518391", "gear-surf-surfboards", "Yamba Surfboard", 12, 850.00, False),
249+
("68719518371", "gear-surf-surfboards", "Kiama Classic Surfboard", 25, 790.00, True)
250+
)
251+
```
252+
253+
::: zone-end
254+
255+
::: zone pivot="programming-language-scala"
256+
257+
```scala
258+
// Create sample data
259+
val products = Seq(
260+
("68719518391", "gear-surf-surfboards", "Yamba Surfboard", 12, 850.00, false),
261+
("68719518371", "gear-surf-surfboards", "Kiama Classic Surfboard", 25, 790.00, true)
262+
)
263+
```
264+
265+
::: zone-end
266+
267+
1. Use `spark.createDataFrame` and the previously saved OLTP configuration to add sample data to the target container.
268+
269+
::: zone pivot="programming-language-python"
270+
271+
```python
272+
# Ingest sample data
273+
spark.createDataFrame(products) \
274+
.toDF("id", "category", "name", "quantity", "price", "clearance") \
275+
.write \
276+
.format("cosmos.oltp") \
277+
.options(config) \
278+
.mode("APPEND") \
279+
.save()
280+
```
281+
282+
::: zone-end
283+
284+
::: zone pivot="programming-language-scala"
285+
286+
```scala
287+
// Ingest sample data
288+
spark.createDataFrame(products)
289+
.toDF("id", "category", "name", "quantity", "price", "clearance")
290+
.write
291+
.format("cosmos.oltp")
292+
.options(config)
293+
.mode("APPEND")
294+
.save()
295+
```
296+
297+
::: zone-end
298+
299+
> [!TIP]
300+
> In this quickstart example credentials are assigned to variables in clear-text, but for security we recommend the usage of secrets. For more information on configuring secrets, see [add secrets to your Spark configuration](/azure/databricks/security/secrets/secrets#read-a-secret).
301+
302+
## Related content
303+
304+
- [Tutorial: Connect using Spark 3](tutorial-spark-connector.md)
305+
- [Quickstart: Java](quickstart-java.md)

0 commit comments

Comments
 (0)