|
| 1 | +--- |
| 2 | +title: Use a service principal with Spark |
| 3 | +titleSuffix: Azure Cosmos DB for NoSQL |
| 4 | +description: Use a Microsoft Entra service principal to authenticate to Azure Cosmos DB for NoSQL from Spark. |
| 5 | +author: seesharprun |
| 6 | +ms.author: sidandrews |
| 7 | +ms.service: cosmos-db |
| 8 | +ms.subservice: nosql |
| 9 | +ms.topic: how-to |
| 10 | +ms.date: 04/01/2024 |
| 11 | +zone_pivot_groups: programming-languages-spark-all-minus-sql-r-csharp |
| 12 | +#CustomerIntent: As a data scientist, I want to connect to Azure Cosmos DB for NoSQL using Spark and a service principal, so that I can avoid using connection strings. |
| 13 | +--- |
| 14 | + |
| 15 | +# Use a service principal with the Spark 3 connector for Azure Cosmos DB for NoSQL |
| 16 | + |
| 17 | +In this article, you learn how to create a Microsoft Entra application and service principal that can be used with the role-based access control. You can then use this service principal to connect to an Azure Cosmos DB for NoSQL account from Spark 3. |
| 18 | + |
| 19 | +## Prerequisites |
| 20 | + |
| 21 | +- An existing Azure Cosmos DB for NoSQL account. |
| 22 | + - If you have an existing Azure subscription, [create a new account](how-to-create-account.md?tabs=azure-portal). |
| 23 | + - No Azure subscription? You can [try Azure Cosmos DB free](../try-free.md) with no credit card required. |
| 24 | +- An existing Azure Databricks workspace. |
| 25 | +- Registered Microsoft Entra application and service principal |
| 26 | + - If you don't have a service principal and application, [register an application using the Azure portal](/entra/identity-platform/howto-create-service-principal-portal). |
| 27 | + |
| 28 | +## Create secret and record credentials |
| 29 | + |
| 30 | +In this section we will create a client secret and record the value for use later. |
| 31 | + |
| 32 | +1. Open the Azure portal (<https://portal.azure.com>). |
| 33 | + |
| 34 | +1. Navigate to your existing Microsoft Entra application. |
| 35 | + |
| 36 | +1. Navigate to the **Certificates & secrets** page. Then, create a new secret. Save the **Client Secret** value to use later in this guide. |
| 37 | + |
| 38 | +1. Navigate to the **Overview** page. Locate and record the values for **Application (client) ID**, **Object ID**, and **Directory (tenant) ID**. You also use these values later in this guide. |
| 39 | + |
| 40 | +1. Navigate to your existing Azure Cosmos DB for NoSQL account. |
| 41 | + |
| 42 | +1. Record the **URI** value in the **Overview** page. Also record the **Subscription ID** and **Resource Group** values. You' use these values too later in this guide. |
| 43 | + |
| 44 | +## Create definition and assignment |
| 45 | + |
| 46 | +In this section we will create a Microsoft Entra ID role definition and assign that role with permissions to read and write items in the containers. |
| 47 | + |
| 48 | +1. Create a role using the `az role definition create` command. Pass in the Azure Cosmos DB for NoSQL account name and resource group, followed by a body of JSON that defines the custom role. The role is also scoped to the account level using `/`. Ensure that you provide a unique name for your role using the `RoleName` property of the request body. |
| 49 | + |
| 50 | + ```azurecli |
| 51 | + az cosmosdb sql role definition create \ |
| 52 | + --resource-group "<resource-group-name>" \ |
| 53 | + --account-name "<account-name>" \ |
| 54 | + --body '{ |
| 55 | + "RoleName": "<role-definition-name>", |
| 56 | + "Type": "CustomRole", |
| 57 | + "AssignableScopes": ["/"], |
| 58 | + "Permissions": [{ |
| 59 | + "DataActions": [ |
| 60 | + "Microsoft.DocumentDB/databaseAccounts/readMetadata", |
| 61 | + "Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers/items/*", |
| 62 | + "Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers/*" |
| 63 | + ] |
| 64 | + }] |
| 65 | + }' |
| 66 | + ``` |
| 67 | +
|
| 68 | +1. List the role definition you created to fetch its unique identifier in the JSON output. Record the `id` value of the JSON output. |
| 69 | +
|
| 70 | + ```azurecli |
| 71 | + az cosmosdb sql role definition list \ |
| 72 | + --resource-group "<resource-group-name>" \ |
| 73 | + --account-name "<account-name>" |
| 74 | + ``` |
| 75 | +
|
| 76 | + ```json |
| 77 | + [ |
| 78 | + { |
| 79 | + ..., |
| 80 | + "id": "/subscriptions/<subscription-id>/resourceGroups/<resource-grou-name>/providers/Microsoft.DocumentDB/databaseAccounts/<account-name>/sqlRoleDefinitions/<role-definition-id>", |
| 81 | + ... |
| 82 | + "permissions": [ |
| 83 | + { |
| 84 | + "dataActions": [ |
| 85 | + "Microsoft.DocumentDB/databaseAccounts/readMetadata", |
| 86 | + "Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers/items/*", |
| 87 | + "Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers/*" |
| 88 | + ], |
| 89 | + "notDataActions": [] |
| 90 | + } |
| 91 | + ], |
| 92 | + ... |
| 93 | + } |
| 94 | + ] |
| 95 | + ``` |
| 96 | +
|
| 97 | +1. Use `az cosmosdb sql role assignment create` to create a role assignment. Replace the`<aad-principal-id>` with the **Object ID** you recorded earlier in this guide. Also, replace `<role-definition-id>` with the `id` value fetched from running the `az cosmosdb sql role definition list` command in a previous step. |
| 98 | +
|
| 99 | + ```azurecli |
| 100 | + az cosmosdb sql role assignment create \ |
| 101 | + --resource-group "<resource-group-name>" \ |
| 102 | + --account-name "<account-name>" \ |
| 103 | + --scope "/" \ |
| 104 | + --principal-id "<account-name>" \ |
| 105 | + --role-definition-id "<role-definition-id>" |
| 106 | + ``` |
| 107 | +
|
| 108 | +## Use service principal |
| 109 | +
|
| 110 | +Now that you created a Microsoft Entra application and service principal, created a custom role, and assigned that role permissions to your Azure Cosmos DB for NoSQL account, you should be able to run a notebook. |
| 111 | +
|
| 112 | +1. Open your Azure Databricks workspace. |
| 113 | +
|
| 114 | +1. In the workspace interface, create a new **cluster**. Configure the cluster with these settings, at a minimum: |
| 115 | +
|
| 116 | + | | **Value** | |
| 117 | + | --- | --- | |
| 118 | + | **Runtime version** | `13.3 LTS (Scala 2.12, Spark 3.4.1)` | |
| 119 | +
|
| 120 | +1. Use the workspace interface to search for **Maven** packages from **Maven Central** with a **Group Id** of `com.azure.cosmos.spark`. Install the package specific for Spark 3.4 with an **Artifact Id** prefixed with `azure-cosmos-spark_3-4` to the cluster. |
| 121 | +
|
| 122 | +1. Finally, create a new **notebook**. |
| 123 | +
|
| 124 | + > [!TIP] |
| 125 | + > By default, the notebook will be attached to the recently created cluster. |
| 126 | +
|
| 127 | +1. Within the notebook, set Cosmos DB Spark Connector configuration settings for NoSQL account endpoint, database name, and container name. Use the **Subscription ID**, **Resource Group**, **Application (client) ID**, **Directory (tenant) ID**, and **Client Secret** values recorded earlier in this guide. |
| 128 | +
|
| 129 | + ::: zone pivot="programming-language-python" |
| 130 | +
|
| 131 | + ```python |
| 132 | + # Set configuration settings |
| 133 | + config = { |
| 134 | + "spark.cosmos.accountEndpoint": "<nosql-account-endpoint>", |
| 135 | + "spark.cosmos.auth.type": "ServicePrincipal", |
| 136 | + "spark.cosmos.account.subscriptionId": "<subscription-id>", |
| 137 | + "spark.cosmos.account.resourceGroupName": "<resource-group-name>", |
| 138 | + "spark.cosmos.account.tenantId": "<entra-tenant-id>", |
| 139 | + "spark.cosmos.auth.aad.clientId": "<entra-app-client-id>", |
| 140 | + "spark.cosmos.auth.aad.clientSecret": "<entra-app-client-secret>", |
| 141 | + "spark.cosmos.database": "<database-name>", |
| 142 | + "spark.cosmos.container": "<container-name>" |
| 143 | + } |
| 144 | + ``` |
| 145 | +
|
| 146 | + ::: zone-end |
| 147 | +
|
| 148 | + ::: zone pivot="programming-language-scala" |
| 149 | +
|
| 150 | + ```scala |
| 151 | + // Set configuration settings |
| 152 | + val config = Map( |
| 153 | + "spark.cosmos.accountEndpoint" -> "<nosql-account-endpoint>", |
| 154 | + "spark.cosmos.auth.type" -> "ServicePrincipal", |
| 155 | + "spark.cosmos.account.subscriptionId" -> "<subscription-id>", |
| 156 | + "spark.cosmos.account.resourceGroupName" -> "<resource-group-name>", |
| 157 | + "spark.cosmos.account.tenantId" -> "<entra-tenant-id>", |
| 158 | + "spark.cosmos.auth.aad.clientId" -> "<entra-app-client-id>", |
| 159 | + "spark.cosmos.auth.aad.clientSecret" -> "<entra-app-client-secret>", |
| 160 | + "spark.cosmos.database" -> "<database-name>", |
| 161 | + "spark.cosmos.container" -> "<container-name>" |
| 162 | + ) |
| 163 | + ``` |
| 164 | +
|
| 165 | + ::: zone-end |
| 166 | +
|
| 167 | +1. Configure the Catalog API to manage API for NoSQL resources using Spark. |
| 168 | +
|
| 169 | + ::: zone pivot="programming-language-python" |
| 170 | +
|
| 171 | + ```python |
| 172 | + # Configure Catalog Api |
| 173 | + spark.conf.set("spark.sql.catalog.cosmosCatalog", "com.azure.cosmos.spark.CosmosCatalog") |
| 174 | + spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountEndpoint", "<nosql-account-endpoint>") |
| 175 | + spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.auth.type", "ServicePrincipal") |
| 176 | + spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.account.subscriptionId", "<subscription-id>") |
| 177 | + spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.account.resourceGroupName", "<resource-group-name>") |
| 178 | + spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.account.tenantId", "<entra-tenant-id>") |
| 179 | + spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.auth.aad.clientId", "<entra-app-client-id>") |
| 180 | + spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.auth.aad.clientSecret", "<entra-app-client-secret>") |
| 181 | + ``` |
| 182 | +
|
| 183 | + ::: zone-end |
| 184 | +
|
| 185 | + ::: zone pivot="programming-language-scala" |
| 186 | +
|
| 187 | + ```scala |
| 188 | + // Configure Catalog Api |
| 189 | + spark.conf.set(s"spark.sql.catalog.cosmosCatalog", "com.azure.cosmos.spark.CosmosCatalog") |
| 190 | + spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.accountEndpoint", "<nosql-account-endpoint>") |
| 191 | + spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.auth.type", "ServicePrincipal") |
| 192 | + spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.account.subscriptionId", "<subscription-id>") |
| 193 | + spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.account.resourceGroupName", "<resource-group-name>") |
| 194 | + spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.account.tenantId", "<entra-tenant-id>") |
| 195 | + spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.auth.aad.clientId", "<entra-app-client-id>") |
| 196 | + spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.auth.aad.clientSecret", "<entra-app-client-secret>") |
| 197 | + ``` |
| 198 | +
|
| 199 | + ::: zone-end |
| 200 | +
|
| 201 | +1. Create a new database using `CREATE DATABASE IF NOT EXISTS`. Ensure that you provide your database name. |
| 202 | +
|
| 203 | + ::: zone pivot="programming-language-python" |
| 204 | +
|
| 205 | + ```python |
| 206 | + # Create a database using the Catalog API |
| 207 | + spark.sql("CREATE DATABASE IF NOT EXISTS cosmosCatalog.{};".format("<database-name>")) |
| 208 | + ``` |
| 209 | +
|
| 210 | + ::: zone-end |
| 211 | +
|
| 212 | + ::: zone pivot="programming-language-scala" |
| 213 | +
|
| 214 | + ```scala |
| 215 | + // Create a database using the Catalog API |
| 216 | + spark.sql(s"CREATE DATABASE IF NOT EXISTS cosmosCatalog.<database-name>;") |
| 217 | + ``` |
| 218 | +
|
| 219 | + ::: zone-end |
| 220 | +
|
| 221 | +1. Create a new container using database name, container name, partition key path, and throughput values that you specify. |
| 222 | +
|
| 223 | + ::: zone pivot="programming-language-python" |
| 224 | +
|
| 225 | + ```python |
| 226 | + # Create a products container using the Catalog API |
| 227 | + spark.sql("CREATE TABLE IF NOT EXISTS cosmosCatalog.{}.{} USING cosmos.oltp TBLPROPERTIES(partitionKeyPath = '{}', manualThroughput = '{}')".format("<database-name>", "<container-name>", "<partition-key-path>", "<throughput>")) |
| 228 | + ``` |
| 229 | +
|
| 230 | + ::: zone-end |
| 231 | +
|
| 232 | + ::: zone pivot="programming-language-scala" |
| 233 | +
|
| 234 | + ```scala |
| 235 | + // Create a products container using the Catalog API |
| 236 | + spark.sql(s"CREATE TABLE IF NOT EXISTS cosmosCatalog.<database-name>.<container-name> using cosmos.oltp TBLPROPERTIES(partitionKeyPath = '<partition-key-path>', manualThroughput = '<throughput>')") |
| 237 | + ``` |
| 238 | +
|
| 239 | + ::: zone-end |
| 240 | +
|
| 241 | +1. Create a sample data set. |
| 242 | +
|
| 243 | + ::: zone pivot="programming-language-python" |
| 244 | +
|
| 245 | + ```python |
| 246 | + # Create sample data |
| 247 | + products = ( |
| 248 | + ("68719518391", "gear-surf-surfboards", "Yamba Surfboard", 12, 850.00, False), |
| 249 | + ("68719518371", "gear-surf-surfboards", "Kiama Classic Surfboard", 25, 790.00, True) |
| 250 | + ) |
| 251 | + ``` |
| 252 | +
|
| 253 | + ::: zone-end |
| 254 | +
|
| 255 | + ::: zone pivot="programming-language-scala" |
| 256 | +
|
| 257 | + ```scala |
| 258 | + // Create sample data |
| 259 | + val products = Seq( |
| 260 | + ("68719518391", "gear-surf-surfboards", "Yamba Surfboard", 12, 850.00, false), |
| 261 | + ("68719518371", "gear-surf-surfboards", "Kiama Classic Surfboard", 25, 790.00, true) |
| 262 | + ) |
| 263 | + ``` |
| 264 | +
|
| 265 | + ::: zone-end |
| 266 | +
|
| 267 | +1. Use `spark.createDataFrame` and the previously saved OLTP configuration to add sample data to the target container. |
| 268 | +
|
| 269 | + ::: zone pivot="programming-language-python" |
| 270 | +
|
| 271 | + ```python |
| 272 | + # Ingest sample data |
| 273 | + spark.createDataFrame(products) \ |
| 274 | + .toDF("id", "category", "name", "quantity", "price", "clearance") \ |
| 275 | + .write \ |
| 276 | + .format("cosmos.oltp") \ |
| 277 | + .options(config) \ |
| 278 | + .mode("APPEND") \ |
| 279 | + .save() |
| 280 | + ``` |
| 281 | +
|
| 282 | + ::: zone-end |
| 283 | +
|
| 284 | + ::: zone pivot="programming-language-scala" |
| 285 | +
|
| 286 | + ```scala |
| 287 | + // Ingest sample data |
| 288 | + spark.createDataFrame(products) |
| 289 | + .toDF("id", "category", "name", "quantity", "price", "clearance") |
| 290 | + .write |
| 291 | + .format("cosmos.oltp") |
| 292 | + .options(config) |
| 293 | + .mode("APPEND") |
| 294 | + .save() |
| 295 | + ``` |
| 296 | +
|
| 297 | + ::: zone-end |
| 298 | +
|
| 299 | + > [!TIP] |
| 300 | + > In this quickstart example credentials are assigned to variables in clear-text, but for security we recommend the usage of secrets. For more information on configuring secrets, see [add secrets to your Spark configuration](/azure/databricks/security/secrets/secrets#read-a-secret). |
| 301 | +
|
| 302 | +## Related content |
| 303 | +
|
| 304 | +- [Tutorial: Connect using Spark 3](tutorial-spark-connector.md) |
| 305 | +- [Quickstart: Java](quickstart-java.md) |
0 commit comments