You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Use a service principal with the Spark 3 connector for Azure Cosmos DB for NoSQL
16
16
17
-
In this article, you learn how to create a Microsoft Entra application and service principal that can be used with the role-based access control. You can then use this service principal to connect to an Azure Cosmos DB for NoSQL account from Spark 3.
17
+
In this article, you learn how to create a Microsoft Entra application and service principal that can be used with role-based access control. You can then use this service principal to connect to an Azure Cosmos DB for NoSQL account from Spark 3.
18
18
19
19
## Prerequisites
20
20
@@ -117,14 +117,14 @@ Now that you've created a Microsoft Entra application and service principal, cre
1. Use the workspace interface to search for **Maven** packages from **Maven Central** with a **Group ID** of `com.azure.cosmos.spark`. Install the package specific for Spark 3.4 with an **Artifact ID** prefixed with `azure-cosmos-spark_3-4` to the cluster.
120
+
1. Use the workspace interface to search for **Maven** packages from **Maven Central** with a **Group ID** of `com.azure.cosmos.spark`. Install the package specifically for Spark 3.4 with an **Artifact ID** prefixed with `azure-cosmos-spark_3-4` to the cluster.
121
121
122
122
1. Finally, create a new **notebook**.
123
123
124
124
> [!TIP]
125
-
> By default, the notebook will be attached to the recently created cluster.
125
+
> By default, the notebook is attached to the recently created cluster.
126
126
127
-
1. Within the notebook, set Azure Cosmos DB Spark Connector configuration settings for the NoSQL account endpoint, database name, and container name. Use the **Subscription ID**, **Resource Group**, **Application (client) ID**, **Directory (tenant) ID**, and **Client Secret** values recorded earlier in this article.
127
+
1. Within the notebook, set Azure Cosmos DB Spark connector configuration settings for the NoSQL account endpoint, database name, and container name. Use the **Subscription ID**, **Resource Group**, **Application (client) ID**, **Directory (tenant) ID**, and **Client Secret** values recorded earlier in this article.
128
128
129
129
::: zone pivot="programming-language-python"
130
130
@@ -297,7 +297,7 @@ Now that you've created a Microsoft Entra application and service principal, cre
297
297
::: zone-end
298
298
299
299
> [!TIP]
300
-
> In this quickstart example, credentials are assigned to variables in clear text. For security, we recommend that you use secrets. For more information on configuring secrets, see [Add secrets to your Spark configuration](/azure/databricks/security/secrets/secrets#read-a-secret).
300
+
> In this quickstart example, credentials are assigned to variables in clear text. For security, we recommend that you use secrets. For more information on how to configure secrets, see [Add secrets to your Spark configuration](/azure/databricks/security/secrets/secrets#read-a-secret).
Copy file name to clipboardExpand all lines: articles/cosmos-db/nosql/throughput-control-spark.md
+17-17Lines changed: 17 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
---
2
2
title: 'Azure Cosmos DB Spark connector: Throughput control'
3
-
description: In this article, you learn how to control throughput for bulk data movements in the Azure Cosmos DB Spark connector.
3
+
description: Learn how you can control throughput for bulk data movements in the Azure Cosmos DB Spark connector.
4
4
author: TheovanKraay
5
5
ms.service: cosmos-db
6
6
ms.subservice: nosql
@@ -15,26 +15,26 @@ ms.author: thvankra
15
15
16
16
The [Spark connector](quickstart-spark.md) allows you to communicate with Azure Cosmos DB by using [Apache Spark](https://spark.apache.org/). This article describes how the throughput control feature works. Check out our [Spark samples in GitHub](https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/cosmos/azure-cosmos-spark_3_2-12/Samples) to get started using throughput control.
17
17
18
-
This article documents the use of global throughput control groups in the Azure Cosmos DB Spark connector, but the functionality is also available in the [Java SDK](./sdk-java-v4.md). In the SDK, you can use both global and local throughput control groups to limit the RU consumption in the context of a single client connection instance. For example, you can apply this to different operations within a single microservice, or maybe to a single data loading program. For more information, see how to [use throughput control](quickstart-java.md) in the Java SDK.
18
+
This article documents the use of global throughput control groups in the Azure Cosmos DB Spark connector, but the functionality is also available in the [Java SDK](./sdk-java-v4.md). In the SDK, you can use global and local throughput control groups to limit the request unit (RU) consumption in the context of a single client connection instance. For example, you can apply this approach to different operations within a single microservice, or maybe to a single data loading program. For more information, see how to [use throughput control](quickstart-java.md) in the Java SDK.
19
19
20
20
> [!WARNING]
21
-
> Throughput control isn't supported for gateway mode. Currently, for [serverless Azure Cosmos DB accounts](../serverless.md), attempting to use `targetThroughputThreshold` to define a percentage results in failure. You can only provide an absolute value for target throughput/RU by using `spark.cosmos.throughputControl.targetThroughput`.
21
+
> Throughput control isn't supported for gateway mode. Currently, for [serverless Azure Cosmos DB accounts](../serverless.md), attempting to use `targetThroughputThreshold` to define a percentage results in failure. You can only provide an absolute value for target throughput/RU by using `spark.cosmos.throughputControl.targetThroughput`.
22
22
23
23
## Why is throughput control important?
24
24
25
-
Throughput control helps to isolate the performance needs of applications that run against a container by limiting the amount of [request units](../request-units.md)(RUs) that can be consumed by a specific Spark client.
25
+
Throughput control helps to isolate the performance needs of applications that run against a container. Throughput control limits the amount of [RUs](../request-units.md) that a specific Spark client can consume.
26
26
27
27
Several advanced scenarios benefit from client-side throughput control:
28
28
29
29
-**Different operations and tasks have different priorities:** There can be a need to prevent normal transactions from being throttled because of data ingestion or copy activities. Some operations or tasks aren't sensitive to latency and are more tolerant to being throttled than others.
30
-
-**Provide fairness/isolation to different users or tenants:** An application will usually have many users. Some users might send too many requests, which consume all available throughput and cause others to get throttled.
30
+
-**Provide fairness/isolation to different users or tenants:** An application usually has many users. Some users might send too many requests, which consume all available throughput and cause others to get throttled.
31
31
-**Load balancing of throughput between different Azure Cosmos DB clients:** In some use cases, it's important to make sure all the clients get a fair (equal) share of the throughput.
32
32
33
33
Throughput control enables the capability for more granular-level RU rate limiting, as needed.
34
34
35
35
## How does throughput control work?
36
36
37
-
Throughput control for the Spark connector is configured by first creating a container that defines throughput control metadata, with a partition key of`groupId`, and `ttl` enabled. Here, we create this container by using Spark SQL and call it `ThroughputControl`:
37
+
To configure throughput control for the Spark connector, you first create a container that defines throughput control metadata. The partition key is`groupId` and `ttl`is enabled. Here, you create this container by using Spark SQL and call it `ThroughputControl`:
38
38
39
39
```sql
40
40
%sql
@@ -47,9 +47,9 @@ Throughput control for the Spark connector is configured by first creating a con
47
47
The preceding example creates a container with [autoscale](../provision-throughput-autoscale.md). If you prefer standard provisioning, you can replace `autoScaleMaxThroughput` with `manualThroughput`.
48
48
49
49
> [!IMPORTANT]
50
-
> The partition key must be defined as `/groupId`, and `ttl` must be enabled for the throughput control feature to work.
50
+
> The partition key must be defined as `/groupId` and `ttl` must be enabled for the throughput control feature to work.
51
51
52
-
Within the Spark configuration of a specific application, we can then specify parameters for our workload. The following example sets throughput control as `enabled` and defines a throughput control group `name` parameter and a `targetThroughputThreshold` parameter. We also define the `database` and `container` parameters in which the throughput control group is maintained:
52
+
Within the Spark configuration of a specific application, you can then specify parameters for the workload. The following example sets throughput control as `enabled`. The example defines a throughput control group `name` parameter and a `targetThroughputThreshold` parameter. You also define the `database` and `container` parameters in which the throughput control group is maintained:
53
53
54
54
```scala
55
55
"spark.cosmos.throughputControl.enabled"->"true",
@@ -59,7 +59,7 @@ Within the Spark configuration of a specific application, we can then specify pa
In the preceding example, the `targetThroughputThreshold` parameter is defined as **0.95**. Rate limiting occurs (and requests are retried) when clients consume more than 95 percent (+/- 5-10 percent) of the throughput that's allocated to the container. This configuration is stored as a document in the throughput container, which looks like this example:
62
+
In the preceding example, the `targetThroughputThreshold` parameter is defined as **0.95**. Rate limiting occurs (and requests are retried) when clients consume more than 95 percent (+/- 5-10 percent) of the throughput allocated to the container. This configuration is stored as a document in the throughput container, which looks like this example:
63
63
64
64
```json
65
65
{
@@ -76,16 +76,16 @@ In the preceding example, the `targetThroughputThreshold` parameter is defined a
76
76
}
77
77
```
78
78
79
-
Throughput control doesn't do RU pre-calculation of each operation. Instead, it tracks the RU usages *after* the operation based on the response header. As such, throughput control is based on an approximation and *doesn't guarantee* that amount of throughput is available for the group at any certain time.
79
+
Throughput control doesn't do RU precalculation of each operation. Instead, it tracks the RU usages *after* the operation based on the response header. As such, throughput control is based on an approximation and *doesn't guarantee* that amount of throughput is available for the group at any certain time.
80
80
81
-
For this reason, if the configured RU is so low that a single operation can use it all, throughput control can't avoid the RU exceeding the configured limit. Therefore, throughput control works best when the configured limit is higher than any single operation that can be executed by a client in the specific control group.
81
+
For this reason, if the configured RU is so low that a single operation can use it all, throughput control can't avoid the RU exceeding the configured limit. Throughput control works best when the configured limit is higher than any single operation that a client in the specific control group can execute.
82
82
83
-
When you read via query or change feed, you should configure the page size in `spark.cosmos.read.maxItemCount` (default 1000) to be a modest amount so that client throughput control can be recalculated with higher frequency, and therefore reflected more accurately at any specific time. However, when you use throughput control for a write job using bulk, the number of documents executed in a single request are automatically tuned based on the throttling rate to allow the throughput control to kick in as early as possible.
83
+
When you read via query or change feed, you should configure the page size in `spark.cosmos.read.maxItemCount` (default 1000) to be a modest amount. In this way, the client throughput control can be recalculated with higher frequency and reflected more accurately at any specific time. When you use throughput control for a write job using bulk, the number of documents executed in a single request is automatically tuned based on the throttling rate to allow the throughput control to begin as early as possible.
84
84
85
85
> [!WARNING]
86
-
> The `targetThroughputThreshold` parameter is *immutable*. If you change the target throughput threshold value, a new throughput control group is created. (As long as you use Version 4.10.0 or later, it can have the same name.) You need to restart all Spark jobs that are using the group if you want to ensure that they all consume the new threshold immediately. Otherwise, they pick up the new threshold after the next restart.
86
+
> The `targetThroughputThreshold` parameter is *immutable*. If you change the target throughput threshold value, a new throughput control group is created. (If you use version 4.10.0 or later, it can have the same name.) You need to restart all Spark jobs that are using the group if you want to ensure that they all consume the new threshold immediately. Otherwise, they pick up the new threshold after the next restart.
87
87
88
-
For each Spark client that uses the throughput control group, a record is created in the `ThroughputControl` container, with a ttl of a few seconds. As a result, the documents vanish pretty quickly if a Spark client isn't actively running anymore. Here's an example:
88
+
For each Spark client that uses the throughput control group, a record is created in the `ThroughputControl` container, with a `ttl` of a few seconds. As a result, the documents vanish quickly if a Spark client isn't actively running anymore. Here's an example:
89
89
90
90
```json
91
91
{
@@ -108,6 +108,6 @@ In each client record, the `loadFactor` attribute represents the load on the spe
108
108
109
109
## Related content
110
110
111
-
*[Spark samples in GitHub](https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/cosmos/azure-cosmos-spark_3_2-12/Samples)
112
-
*[Manage data with Azure Cosmos DB Spark 3 OLTP connector for API for NoSQL](quickstart-spark.md).
113
-
* Learn more about [Apache Spark](https://spark.apache.org/)
111
+
*See [Spark samples in GitHub](https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/cosmos/azure-cosmos-spark_3_2-12/Samples).
112
+
*Learn how to [manage data with Azure Cosmos DB Spark 3 OLTP connector for API for NoSQL](quickstart-spark.md).
113
+
* Learn more about [Apache Spark](https://spark.apache.org/).
In this tutorial, you use the Azure Cosmos DB Spark connector to read or write data from an Azure Cosmos DB for NoSQL account. This tutorial uses Azure Databricks and a Jupyter notebook to illustrate how to integrate with the API for NoSQL from Spark. This tutorial focuses on Python and Scala even though you can use any language or interface supported by Spark.
20
+
In this tutorial, you use the Azure Cosmos DB Spark connector to read or write data from an Azure Cosmos DB for NoSQL account. This tutorial uses Azure Databricks and a Jupyter notebook to illustrate how to integrate with the API for NoSQL from Spark. This tutorial focuses on Python and Scala, although you can use any language or interface supported by Spark.
21
21
22
22
In this tutorial, you learn how to:
23
23
@@ -47,12 +47,12 @@ Use your existing Azure Databricks workspace to create a compute cluster ready t
1. Use the workspace interface to search for **Maven** packages from **Maven Central** with a **Group ID** of `com.azure.cosmos.spark`. Install the package specific for Spark 3.4 with an **Artifact ID** prefixed with `azure-cosmos-spark_3-4` to the cluster.
50
+
1. Use the workspace interface to search for **Maven** packages from **Maven Central** with a **Group ID** of `com.azure.cosmos.spark`. Install the package specifically for Spark 3.4 with an **Artifact ID** prefixed with `azure-cosmos-spark_3-4` to the cluster.
51
51
52
52
1. Finally, create a new **notebook**.
53
53
54
54
> [!TIP]
55
-
> By default, the notebook will be attached to the recently created cluster.
55
+
> By default, the notebook is attached to the recently created cluster.
56
56
57
57
1. Within the notebook, set online transaction processing (OLTP) configuration settings for the NoSQL account endpoint, database name, and container name.
58
58
@@ -132,7 +132,7 @@ Use the Catalog API to manage account resources such as databases and containers
132
132
133
133
::: zone-end
134
134
135
-
1. Create a new container named `products` by using `CREATETABLEIFNOTEXISTS`. Ensure that you set the partition key path to `/category`and enable autoscale throughput with a maximum throughput of `1000` request units per second (RUs).
135
+
1. Create a new container named `products` by using `CREATETABLEIFNOTEXISTS`. Ensure that you set the partition key path to `/category`and enable autoscale throughput with a maximum throughput of `1000` request units (RUs) per second.
136
136
137
137
::: zone pivot="programming-language-python"
138
138
@@ -152,7 +152,7 @@ Use the Catalog API to manage account resources such as databases and containers
152
152
153
153
::: zone-end
154
154
155
-
1. Create another container named `employees` by using a hierarchical partition key configurationwith`/organization`, `/department`, and`/team`as the set of partition key paths. Follow that specific order. Also, set the throughput to a manual amount of `400` RUs.
155
+
1. Create another container named `employees` by using a hierarchical partition key configuration. Use`/organization`, `/department`, and`/team`as the set of partition key paths. Follow that specific order. Also, set the throughput to a manual amount of `400` RUs.
156
156
157
157
::: zone pivot="programming-language-python"
158
158
@@ -240,7 +240,7 @@ Create a sample dataset. Then use OLTP to ingest that data to the API for NoSQL
240
240
241
241
Load OLTP data into a data frame to perform common queries on the data. You can use various syntaxes to filteror query data.
242
242
243
-
1. Use `spark.read` to load the OLTP data into a dataframeobject. Use the same configuration you used earlier in this tutorial. Also, set`spark.cosmos.read.inferSchema.enabled` to `true` to allow the Spark connector to infer the schema by sampling existing items.
243
+
1. Use `spark.read` to load the OLTP data into a data-frameobject. Use the same configuration you used earlier in this tutorial. Also, set`spark.cosmos.read.inferSchema.enabled` to `true` to allow the Spark connector to infer the schema by sampling existing items.
244
244
245
245
::: zone pivot="programming-language-python"
246
246
@@ -266,7 +266,7 @@ Load OLTP data into a data frame to perform common queries on the data. You can
266
266
267
267
::: zone-end
268
268
269
-
1. Render the schema of the data loaded in the dataframe by using `printSchema`.
269
+
1. Render the schema of the data loaded in the data frame by using `printSchema`.
270
270
271
271
::: zone pivot="programming-language-python"
272
272
@@ -380,7 +380,7 @@ When you work with API for NoSQL data in Spark, you can perform partial updates
380
380
381
381
1. To perform a partial update of an item:
382
382
383
-
1. Copy the existing `config` configuration variable and modify the properties in the new copy. Specifically, configure the write strategy to `ItemPatch`, disable bulk support, set the columns and mapped operations, andfinallyset the default operation type to `Set`.
383
+
1. Copy the existing `config` configuration variable and modify the properties in the new copy. Specifically, configure the write strategy to `ItemPatch`. Then disable bulk support. Set the columns and mapped operations. Finally,set the default operation type to `Set`.
0 commit comments