Merge pull request #202473 from TheovanKraay/throughput-control-spark

v-stsavell · web-flow · commit dbcf781c1c85 · 2022-06-23T12:57:39.000-05:00
spark throughput control
diff --git a/articles/cosmos-db/sql/TOC.yml b/articles/cosmos-db/sql/TOC.yml
@@ -209,6 +209,10 @@
           href: ../security-controls-policy.md?toc=/azure/cosmos-db/sql/toc.json
         - name: Security baseline
           href: /security/benchmark/azure/baselines/cosmos-db-security-baseline?context=/azure/cosmos-db/context/context
+    - name: Spark Connector
+      items:
+        - name: Throughput control  
+          href: throughput-control-spark.md        
     - name: Analytics with Azure Synapse Link
       items:
         - name: Azure Synapse Link for Cosmos DB
diff --git a/articles/cosmos-db/sql/create-sql-api-spark.md b/articles/cosmos-db/sql/create-sql-api-spark.md
@@ -169,4 +169,5 @@ If you are using our older Spark 2.4 Connector, you can find out how to migrate
 
 * Azure Cosmos DB Apache Spark 3 OLTP Connector for Core (SQL) API: [Release notes and resources](sql-api-sdk-java-spark-v3.md)
 * Learn more about [Apache Spark](https://spark.apache.org/).
+* Learn how to configure [throughput control](throughput-control-spark.md).
 * Check out more [samples in GitHub](https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/cosmos/azure-cosmos-spark_3_2-12/Samples).
diff --git a/articles/cosmos-db/sql/throughput-control-spark.md b/articles/cosmos-db/sql/throughput-control-spark.md
@@ -0,0 +1,110 @@
+---
+title: Azure Cosmos DB Spark Connector - Throughput Control
+description: Learn about controlling throughput for bulk data movements in the Azure Cosmos DB Spark Connector
+author: TheovanKraay
+ms.service: cosmos-db
+ms.subservice: cosmosdb-sql
+ms.topic: how-to
+ms.date: 06/22/2022
+ms.author: thvankra
+
+---
+
+# Azure Cosmos DB Spark Connector - throughput control
+[!INCLUDE[appliesto-sql-api](../includes/appliesto-sql-api.md)]
+
+The [Spark Connector](create-sql-api-spark.md) allows you to communicate with Azure Cosmos DB using [Apache Spark](https://spark.apache.org/). This article describes how the throughput control feature works. Check out our [Spark samples in GitHub](https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/cosmos/azure-cosmos-spark_3_2-12/Samples) to get started using throughput control.
+
+## Why is throughput control important?
+
+ Having throughput control helps to isolate the performance needs of applications running against a container, by limiting the amount of [request units](../request-units.md) that can be consumed by a given Spark client. 
+
+There are several advanced scenarios that benefit from client-side throughput control:
+
+- **Different operations and tasks have different priorities** - there can be a need to prevent normal transactions from being throttled due to data ingestion or copy activities. Some operations and/or tasks aren't sensitive to latency, and are more tolerant to being throttled than others.
+
+- **Provide fairness/isolation to different end users/tenants** - An application will usually have many end users. Some users may send too many requests, which consume all available throughput, causing others to get throttled.
+
+- **Load balancing of throughput between different Azure Cosmos DB clients** - in some use cases, it's important to make sure all the clients get a fair (equal) share of the throughput
+
+
+Throughput control enables the capability for more granular level RU rate limiting as needed.
+
+## How does throughput control work?
+
+Throughput control for the Spark Connector is configured by first creating a container that will define throughput control metadata, with a partition key of `groupId`, and `ttl` enabled. Here we create this container using Spark SQL, and call it `ThroughputControl`:
+
+
+```sql
+    %sql
+    CREATE TABLE IF NOT EXISTS cosmosCatalog.`database-v4`.ThroughputControl 
+    USING cosmos.oltp
+    OPTIONS(spark.cosmos.database = 'database-v4')
+    TBLPROPERTIES(partitionKeyPath = '/groupId', autoScaleMaxThroughput = '4000', indexingPolicy = 'AllProperties', defaultTtlInSeconds = '-1');
+```
+
+> [!NOTE]
+> The above example creates a container with [autoscale](../provision-throughput-autoscale.md). If you prefer standard provisioning, you can replace `autoScaleMaxThroughput` with `manualThroughput` instead. 
+
+> [!IMPORTANT]
+> The partition key must be defined as `/groupId`, and `ttl` must be enabled, for the throughput control feature to work. 
+
+Within the Spark config of a given application, we can then specify parameters for our workload. The below example sets throughput control as `enabled`, as well as defining a throughput control group `name` and a `targetThroughputThreshold`. We also define the `database` and `container` in which through control group is maintained:  
+
+```scala
+    "spark.cosmos.throughputControl.enabled" -> "true",
+    "spark.cosmos.throughputControl.name" -> "SourceContainerThroughputControl",
+    "spark.cosmos.throughputControl.targetThroughputThreshold" -> "0.95", 
+    "spark.cosmos.throughputControl.globalControl.database" -> "database-v4", 
+    "spark.cosmos.throughputControl.globalControl.container" -> "ThroughputControl"
+```
+
+In the above example, the `targetThroughputThreshold` is defined as **0.95**, so rate limiting will occur (and requests will be retried) when clients consume more than 95% (+/- 5-10 percent) of the throughput that is allocated to the container. This configuration is stored as a document in the throughput container that looks like the below:
+
+```json
+    {
+        "id": "ZGF0YWJhc2UtdjQvY3VzdG9tZXIvU291cmNlQ29udGFpbmVyVGhyb3VnaHB1dENvbnRyb2w.info",
+        "groupId": "database-v4/customer/SourceContainerThroughputControl.config",
+        "targetThroughput": "",
+        "targetThroughputThreshold": "0.95",
+        "isDefault": true,
+        "_rid": "EHcYAPolTiABAAAAAAAAAA==",
+        "_self": "dbs/EHcYAA==/colls/EHcYAPolTiA=/docs/EHcYAPolTiABAAAAAAAAAA==/",
+        "_etag": "\"2101ea83-0000-1100-0000-627503dd0000\"",
+        "_attachments": "attachments/",
+        "_ts": 1651835869
+    }
+```
+> [!NOTE]
+> Throughput control does not do RU pre-calculation of each operation. Instead, it tracks the RU usages after the operation based on the response header. As such, throughput control is based on an approximation - and does not guarantee that amount of throughput will be available for the group at any given time. 
+
+> [!WARNING]
+> The `targetThroughputThreshold` is **immutable**. If you change the target throughput threshold value, this will create a new throughput control group (but as long as you use Version 4.10.0 or later it can have the same name). You need to restart all Spark jobs that are using the group if you want to ensure they all consume the new threshold immediately (otherwise they will pick-up the new threshold after the next restart).
+
+For each Spark client that uses the throughput control group, a record will be created in the `ThroughputControl` container - with a ttl of a few seconds - so the documents will vanish pretty quickly if a Spark client isn't actively running anymore -  which looks like the below:
+
+```json
+    {
+        "id": "Zhjdieidjojdook3osk3okso3ksp3ospojsp92939j3299p3oj93pjp93jsps939pkp9ks39kp9339skp",
+        "groupId": "database-v4/customer/SourceContainerThroughputControl.config",
+        "_etag": "\"1782728-w98999w-ww9998w9-99990000\"",
+        "ttl": 10,
+        "initializeTime": "2022-06-26T02:24:40.054Z",
+        "loadFactor": 0.97636377638898,
+        "allocatedThroughput": 484.89444487847,
+        "_rid": "EHcYAPolTiABAAAAAAAAAA==",
+        "_self": "dbs/EHcYAA==/colls/EHcYAPolTiA=/docs/EHcYAPolTiABAAAAAAAAAA==/",
+        "_etag": "\"2101ea83-0000-1100-0000-627503dd0000\"",
+        "_attachments": "attachments/",
+        "_ts": 1651835869
+    }
+```
+
+In each client record, the `loadFactor` attribute represents the load on the given client, relative to other clients in the throughput control group. The `allocatedThroughput` attribute shows how many RUs are currently allocated to this client. The Spark Connector will adjust allocated throughput for each client based on its load. This will ensure that each client gets a share of the throughput available that is proportional to its load, and all clients together don't consume more than the total allocated for the throughput control group to which they belong. 
+
+
+## Next steps
+
+* [Spark samples in GitHub](https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/cosmos/azure-cosmos-spark_3_2-12/Samples).
+* [Manage data with Azure Cosmos DB Spark 3 OLTP Connector for SQL API](create-sql-api-spark.md).
+* Learn more about [Apache Spark](https://spark.apache.org/).