You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/cosmos-db/cassandra/connect-spark-configuration.md
+26-26Lines changed: 26 additions & 26 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -23,13 +23,13 @@ This article is one among a series of articles on Azure Cosmos DB Cassandra API
23
23
24
24
## Dependencies for connectivity
25
25
***Spark connector for Cassandra:**
26
-
Spark connector is used to connect to Azure Cosmos DB Cassandra API. Identify and use the version of the connector located in [Maven central](https://mvnrepository.com/artifact/com.datastax.spark/spark-cassandra-connector) that is compatible with the Spark and Scala versions of your Spark environment. We recommend an environment which supports Spark 3.0 or higher, and the spark connector available at maven coordinates `com.datastax.spark:spark-cassandra-connector-assembly_2.12:3.0.0`. If using Spark 2.x, we recommend an environment with Spark version 2.4.5, using spark connector at maven coordinates `com.datastax.spark:spark-cassandra-connector_2.11:2.4.3`.
26
+
Spark connector is used to connect to Azure Cosmos DB Cassandra API. Identify and use the version of the connector located in [Maven central](https://mvnrepository.com/artifact/com.datastax.spark/spark-cassandra-connector-assembly) that is compatible with the Spark and Scala versions of your Spark environment. We recommend an environment that supports Spark 3.2.1 or higher, and the spark connector available at maven coordinates `com.datastax.spark:spark-cassandra-connector-assembly_2.12:3.2.0`. If using Spark 2.x, we recommend an environment with Spark version 2.4.5, using spark connector at maven coordinates `com.datastax.spark:spark-cassandra-connector_2.11:2.4.3`.
27
27
28
28
29
29
***Azure Cosmos DB helper library for Cassandra API:**
30
-
If you are using a version Spark 2.x then in addition to the Spark connector, you need another library called [azure-cosmos-cassandra-spark-helper](https://search.maven.org/artifact/com.microsoft.azure.cosmosdb/azure-cosmos-cassandra-spark-helper/1.2.0/jar) with maven coordinates `com.microsoft.azure.cosmosdb:azure-cosmos-cassandra-spark-helper:1.2.0` from Azure Cosmos DB in order to handle [rate limiting](./scale-account-throughput.md#handling-rate-limiting-429-errors). This library contains custom connection factory and retry policy classes.
30
+
If you're using a version Spark 2.x, then in addition to the Spark connector, you need another library called [azure-cosmos-cassandra-spark-helper](https://search.maven.org/artifact/com.microsoft.azure.cosmosdb/azure-cosmos-cassandra-spark-helper/1.2.0/jar) with maven coordinates `com.microsoft.azure.cosmosdb:azure-cosmos-cassandra-spark-helper:1.2.0` from Azure Cosmos DB in order to handle [rate limiting](./scale-account-throughput.md#handling-rate-limiting-429-errors). This library contains custom connection factory and retry policy classes.
31
31
32
-
The retry policy in Azure Cosmos DB is configured to handle HTTP status code 429("Request Rate Large") exceptions. The Azure Cosmos DB Cassandra API translates these exceptions into overloaded errors on the Cassandra native protocol, and you can retry with back-offs. Because Azure Cosmos DB uses provisioned throughput model, request rate limiting exceptions occur when the ingress/egress rates increase. The retry policy protects your spark jobs against data spikes that momentarily exceed the throughput allocated for your container. If using the Spark 3.x connector, implementing this library is not required.
32
+
The retry policy in Azure Cosmos DB is configured to handle HTTP status code 429("Request Rate Large") exceptions. The Azure Cosmos DB Cassandra API translates these exceptions into overloaded errors on the Cassandra native protocol, and you can retry with back-offs. Because Azure Cosmos DB uses provisioned throughput model, request rate limiting exceptions occur when the ingress/egress rates increase. The retry policy protects your spark jobs against data spikes that momentarily exceed the throughput allocated for your container. If using the Spark 3.x connector, implementing this library isn't required.
33
33
34
34
> [!NOTE]
35
35
> The retry policy can protect your spark jobs against momentary spikes only. If you have not configured enough RUs required to run your workload, then the retry policy is not applicable and the retry policy class rethrows the exception.
@@ -40,20 +40,20 @@ This article is one among a series of articles on Azure Cosmos DB Cassandra API
40
40
41
41
Listed in the next section are all the relevant parameters for controlling throughput using the Spark Connector for Cassandra. In order to optimize parameters to maximize throughput for spark jobs, the `spark.cassandra.output.concurrent.writes`, `spark.cassandra.concurrent.reads`, and `spark.cassandra.input.reads_per_sec` configs needs to be correctly configured, in order to avoid too much throttling and back-off (which in turn can lead to lower throughput).
42
42
43
-
The optimal value of these configurations depends on 4 factors:
43
+
The optimal value of these configurations depends on four factors:
44
44
45
45
- The amount of throughput (Request Units) configured for the table that data is being ingested into.
46
46
- The number of workers in your Spark cluster.
47
47
- The number of executors configured for your spark job (which can be controlled using `spark.cassandra.connection.connections_per_executor_max` or `spark.cassandra.connection.remoteConnectionsPerExecutor` depending on Spark version)
48
-
- The average latency of each request to cosmos DB, if you are collocated in the same Data Center. Assume this value to be 10 ms for writes and 3 ms for reads.
48
+
- The average latency of each request to Cosmos DB, if you're collocated in the same Data Center. Assume this value to be 10 ms for writes and 3 ms for reads.
49
49
50
-
As an example, if we have 5 workers and a value of `spark.cassandra.output.concurrent.writes`= 1, and a value of `spark.cassandra.connection.remoteConnectionsPerExecutor` = 1, then we have 5 workers that are concurrently writing into the table, each with 1 thread. If it takes 10 ms to perform a single write, then we can send 100 requests (1000 milliseconds divided by 10) per second, per thread. With 5 workers, this would be 500 writes per second. At an average cost of 5 request units (RUs) per write, the target table would need a minimum 2500 request units provisioned (5 RUs x 500 writes per second).
50
+
As an example, if we have five workers and a value of `spark.cassandra.output.concurrent.writes`= 1, and a value of `spark.cassandra.connection.remoteConnectionsPerExecutor` = 1, then we have five workers that are concurrently writing into the table, each with one thread. If it takes 10 ms to perform a single write, then we can send 100 requests (1000 milliseconds divided by 10) per second, per thread. With five workers, this would be 500 writes per second. At an average cost of five request units (RUs) per write, the target table would need a minimum 2500 request units provisioned (5 RUs x 500 writes per second).
51
51
52
52
Increasing the number of executors can increase the number of threads in a given job, which can in turn increase throughput. However, the exact impact of this can be variable depending on the job, while controlling throughput with number of workers is more deterministic. You can also determine the exact cost of a given request by profiling it to get the Request Unit (RU) charge. This will help you to be more accurate when provisioning throughput for your table or keyspace. Have a look at our article [here](./find-request-unit-charge-cassandra.md) to understand how to get request unit charges at a per request level.
53
53
54
54
### Scaling throughput in the database
55
55
56
-
The Cassandra Spark connector will saturate throughput in Azure Cosmos DB very efficiently. As a result, even with effective retries, you will need to ensure you have sufficient throughput (RUs) provisioned at the table or keyspace level to prevent rate limiting related errors. The minimum setting of 400 RUs in a given table or keyspace will not be sufficient. Even at minimum throughput configuration settings, the Spark connector can write at a rate corresponding to around **6000 request units** or more.
56
+
The Cassandra Spark connector will saturate throughput in Azure Cosmos DB efficiently. As a result, even with effective retries, you'll need to ensure you have sufficient throughput (RUs) provisioned at the table or keyspace level to prevent rate limiting related errors. The minimum setting of 400 RUs in a given table or keyspace won't be sufficient. Even at minimum throughput configuration settings, the Spark connector can write at a rate corresponding to around **6000 request units** or more.
57
57
58
58
If the RU setting required for data movement using Spark is higher than what is required for your steady state workload, you can easily scale throughput up and down systematically in Azure Cosmos DB to meet the needs of your workload for a given time period. Read our article on [elastic scale in Cassandra API](scale-account-throughput.md) to understand the different options for scaling programmatically and dynamically.
59
59
@@ -67,21 +67,21 @@ The following table lists Azure Cosmos DB Cassandra API-specific throughput conf
| spark.cassandra.output.batch.size.rows | 1 |Number of rows per single batch. Set this parameter to 1. This parameter is used to achieve higher throughput for heavy workloads. |
70
-
| spark.cassandra.connection.connections_per_executor_max (Spark 2.x) spark.cassandra.connection.remoteConnectionsPerExecutor (Spark 3.x) | None | Maximum number of connections per node per executor. 10*n is equivalent to 10 connections per node in an n-node Cassandra cluster. So, if you require 5 connections per node per executor for a 5 node Cassandra cluster, then you should set this configuration to 25. Modify this value based on the degree of parallelism or the number of executors that your spark jobs are configured for. |
70
+
| spark.cassandra.connection.connections_per_executor_max (Spark 2.x) spark.cassandra.connection.remoteConnectionsPerExecutor (Spark 3.x) | None | Maximum number of connections per node per executor. 10*n is equivalent to 10 connections per node in an n-node Cassandra cluster. So, if you require five connections per node per executor for a five node Cassandra cluster, then you should set this configuration to 25. Modify this value based on the degree of parallelism or the number of executors that your spark jobs are configured for. |
71
71
| spark.cassandra.output.concurrent.writes | 100 | Defines the number of parallel writes that can occur per executor. Because you set "batch.size.rows" to 1, make sure to scale up this value accordingly. Modify this value based on the degree of parallelism or the throughput that you want to achieve for your workload. |
72
72
| spark.cassandra.concurrent.reads | 512 | Defines the number of parallel reads that can occur per executor. Modify this value based on the degree of parallelism or the throughput that you want to achieve for your workload |
73
73
| spark.cassandra.output.throughput_mb_per_sec | None | Defines the total write throughput per executor. This parameter can be used as an upper limit for your spark job throughput, and base it on the provisioned throughput of your Cosmos container. |
74
74
| spark.cassandra.input.reads_per_sec| None | Defines the total read throughput per executor. This parameter can be used as an upper limit for your spark job throughput, and base it on the provisioned throughput of your Cosmos container. |
75
75
| spark.cassandra.output.batch.grouping.buffer.size | 1000 | Defines the number of batches per single spark task that can be stored in memory before sending to Cassandra API |
76
76
| spark.cassandra.connection.keep_alive_ms | 60000 | Defines the period of time until which unused connections are available. |
77
77
78
-
Adjust the throughput and degree of parallelism of these parameters based on the workload you expect for your spark jobs, and the throughput you have provisioned for your Cosmos DB account.
78
+
Adjust the throughput and degree of parallelism of these parameters based on the workload you expect for your spark jobs, and the throughput you've provisioned for your Cosmos DB account.
79
79
80
80
81
81
## Connecting to Azure Cosmos DB Cassandra API from Spark
82
82
83
83
### cqlsh
84
-
The following commands detail how to connect to Azure CosmosDB Cassandra API from cqlsh. This is useful for validation as you run through the samples in Spark.<br>
84
+
The following commands detail how to connect to Azure Cosmos DB Cassandra API from cqlsh. This is useful for validation as you run through the samples in Spark.<br>
0 commit comments