Skip to content

Commit dbcf781

Browse files
authored
Merge pull request #202473 from TheovanKraay/throughput-control-spark
spark throughput control
2 parents f2df0a0 + 39995ec commit dbcf781

File tree

3 files changed

+115
-0
lines changed

3 files changed

+115
-0
lines changed

articles/cosmos-db/sql/TOC.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -209,6 +209,10 @@
209209
href: ../security-controls-policy.md?toc=/azure/cosmos-db/sql/toc.json
210210
- name: Security baseline
211211
href: /security/benchmark/azure/baselines/cosmos-db-security-baseline?context=/azure/cosmos-db/context/context
212+
- name: Spark Connector
213+
items:
214+
- name: Throughput control
215+
href: throughput-control-spark.md
212216
- name: Analytics with Azure Synapse Link
213217
items:
214218
- name: Azure Synapse Link for Cosmos DB

articles/cosmos-db/sql/create-sql-api-spark.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -169,4 +169,5 @@ If you are using our older Spark 2.4 Connector, you can find out how to migrate
169169

170170
* Azure Cosmos DB Apache Spark 3 OLTP Connector for Core (SQL) API: [Release notes and resources](sql-api-sdk-java-spark-v3.md)
171171
* Learn more about [Apache Spark](https://spark.apache.org/).
172+
* Learn how to configure [throughput control](throughput-control-spark.md).
172173
* Check out more [samples in GitHub](https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/cosmos/azure-cosmos-spark_3_2-12/Samples).
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
---
2+
title: Azure Cosmos DB Spark Connector - Throughput Control
3+
description: Learn about controlling throughput for bulk data movements in the Azure Cosmos DB Spark Connector
4+
author: TheovanKraay
5+
ms.service: cosmos-db
6+
ms.subservice: cosmosdb-sql
7+
ms.topic: how-to
8+
ms.date: 06/22/2022
9+
ms.author: thvankra
10+
11+
---
12+
13+
# Azure Cosmos DB Spark Connector - throughput control
14+
[!INCLUDE[appliesto-sql-api](../includes/appliesto-sql-api.md)]
15+
16+
The [Spark Connector](create-sql-api-spark.md) allows you to communicate with Azure Cosmos DB using [Apache Spark](https://spark.apache.org/). This article describes how the throughput control feature works. Check out our [Spark samples in GitHub](https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/cosmos/azure-cosmos-spark_3_2-12/Samples) to get started using throughput control.
17+
18+
## Why is throughput control important?
19+
20+
Having throughput control helps to isolate the performance needs of applications running against a container, by limiting the amount of [request units](../request-units.md) that can be consumed by a given Spark client.
21+
22+
There are several advanced scenarios that benefit from client-side throughput control:
23+
24+
- **Different operations and tasks have different priorities** - there can be a need to prevent normal transactions from being throttled due to data ingestion or copy activities. Some operations and/or tasks aren't sensitive to latency, and are more tolerant to being throttled than others.
25+
26+
- **Provide fairness/isolation to different end users/tenants** - An application will usually have many end users. Some users may send too many requests, which consume all available throughput, causing others to get throttled.
27+
28+
- **Load balancing of throughput between different Azure Cosmos DB clients** - in some use cases, it's important to make sure all the clients get a fair (equal) share of the throughput
29+
30+
31+
Throughput control enables the capability for more granular level RU rate limiting as needed.
32+
33+
## How does throughput control work?
34+
35+
Throughput control for the Spark Connector is configured by first creating a container that will define throughput control metadata, with a partition key of `groupId`, and `ttl` enabled. Here we create this container using Spark SQL, and call it `ThroughputControl`:
36+
37+
38+
```sql
39+
%sql
40+
CREATE TABLE IF NOT EXISTS cosmosCatalog.`database-v4`.ThroughputControl
41+
USING cosmos.oltp
42+
OPTIONS(spark.cosmos.database = 'database-v4')
43+
TBLPROPERTIES(partitionKeyPath = '/groupId', autoScaleMaxThroughput = '4000', indexingPolicy = 'AllProperties', defaultTtlInSeconds = '-1');
44+
```
45+
46+
> [!NOTE]
47+
> The above example creates a container with [autoscale](../provision-throughput-autoscale.md). If you prefer standard provisioning, you can replace `autoScaleMaxThroughput` with `manualThroughput` instead.
48+
49+
> [!IMPORTANT]
50+
> The partition key must be defined as `/groupId`, and `ttl` must be enabled, for the throughput control feature to work.
51+
52+
Within the Spark config of a given application, we can then specify parameters for our workload. The below example sets throughput control as `enabled`, as well as defining a throughput control group `name` and a `targetThroughputThreshold`. We also define the `database` and `container` in which through control group is maintained:
53+
54+
```scala
55+
"spark.cosmos.throughputControl.enabled" -> "true",
56+
"spark.cosmos.throughputControl.name" -> "SourceContainerThroughputControl",
57+
"spark.cosmos.throughputControl.targetThroughputThreshold" -> "0.95",
58+
"spark.cosmos.throughputControl.globalControl.database" -> "database-v4",
59+
"spark.cosmos.throughputControl.globalControl.container" -> "ThroughputControl"
60+
```
61+
62+
In the above example, the `targetThroughputThreshold` is defined as **0.95**, so rate limiting will occur (and requests will be retried) when clients consume more than 95% (+/- 5-10 percent) of the throughput that is allocated to the container. This configuration is stored as a document in the throughput container that looks like the below:
63+
64+
```json
65+
{
66+
"id": "ZGF0YWJhc2UtdjQvY3VzdG9tZXIvU291cmNlQ29udGFpbmVyVGhyb3VnaHB1dENvbnRyb2w.info",
67+
"groupId": "database-v4/customer/SourceContainerThroughputControl.config",
68+
"targetThroughput": "",
69+
"targetThroughputThreshold": "0.95",
70+
"isDefault": true,
71+
"_rid": "EHcYAPolTiABAAAAAAAAAA==",
72+
"_self": "dbs/EHcYAA==/colls/EHcYAPolTiA=/docs/EHcYAPolTiABAAAAAAAAAA==/",
73+
"_etag": "\"2101ea83-0000-1100-0000-627503dd0000\"",
74+
"_attachments": "attachments/",
75+
"_ts": 1651835869
76+
}
77+
```
78+
> [!NOTE]
79+
> Throughput control does not do RU pre-calculation of each operation. Instead, it tracks the RU usages after the operation based on the response header. As such, throughput control is based on an approximation - and does not guarantee that amount of throughput will be available for the group at any given time.
80+
81+
> [!WARNING]
82+
> The `targetThroughputThreshold` is **immutable**. If you change the target throughput threshold value, this will create a new throughput control group (but as long as you use Version 4.10.0 or later it can have the same name). You need to restart all Spark jobs that are using the group if you want to ensure they all consume the new threshold immediately (otherwise they will pick-up the new threshold after the next restart).
83+
84+
For each Spark client that uses the throughput control group, a record will be created in the `ThroughputControl` container - with a ttl of a few seconds - so the documents will vanish pretty quickly if a Spark client isn't actively running anymore - which looks like the below:
85+
86+
```json
87+
{
88+
"id": "Zhjdieidjojdook3osk3okso3ksp3ospojsp92939j3299p3oj93pjp93jsps939pkp9ks39kp9339skp",
89+
"groupId": "database-v4/customer/SourceContainerThroughputControl.config",
90+
"_etag": "\"1782728-w98999w-ww9998w9-99990000\"",
91+
"ttl": 10,
92+
"initializeTime": "2022-06-26T02:24:40.054Z",
93+
"loadFactor": 0.97636377638898,
94+
"allocatedThroughput": 484.89444487847,
95+
"_rid": "EHcYAPolTiABAAAAAAAAAA==",
96+
"_self": "dbs/EHcYAA==/colls/EHcYAPolTiA=/docs/EHcYAPolTiABAAAAAAAAAA==/",
97+
"_etag": "\"2101ea83-0000-1100-0000-627503dd0000\"",
98+
"_attachments": "attachments/",
99+
"_ts": 1651835869
100+
}
101+
```
102+
103+
In each client record, the `loadFactor` attribute represents the load on the given client, relative to other clients in the throughput control group. The `allocatedThroughput` attribute shows how many RUs are currently allocated to this client. The Spark Connector will adjust allocated throughput for each client based on its load. This will ensure that each client gets a share of the throughput available that is proportional to its load, and all clients together don't consume more than the total allocated for the throughput control group to which they belong.
104+
105+
106+
## Next steps
107+
108+
* [Spark samples in GitHub](https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/cosmos/azure-cosmos-spark_3_2-12/Samples).
109+
* [Manage data with Azure Cosmos DB Spark 3 OLTP Connector for SQL API](create-sql-api-spark.md).
110+
* Learn more about [Apache Spark](https://spark.apache.org/).

0 commit comments

Comments
 (0)