Optimize ScyllaDB's batch writes #4047

ndr-ds · 2025-06-02T17:50:39Z

Motivation

So it is a known thing that batches are not token aware in the ScyllaDB’s Rust driver. Batches will be sent to a random node, which will then forward things to the proper nodes, which makes it not be “token aware”. Which also means there’s an extra network hop for most batch requests. This is what the default load balancing policies currently do.
So currently if someone using the Rust driver needs atomicity, they can use batches, but they’ll get a bit of a performance hit as the batch won’t be token aware.
So, for us to have the best performance on batches, maintaining the per partition atomicity that it guarantees, we would need shard aware batching, but that’s not yet supported in the Rust driver.
There is some work being attempted for “shard aware batching”, but one of the reviewers is arguing that there are ways of solving this problem that don’t involve user code. These ways involve creating a custom Load Balancing Policy, which is what I'm doing in this PR.

Proposal

Build a custom "Sticky" Load Balancing Policy. This policy will be specific to a given partition: given the partition, it will remember what are the (node, shard) pairs for all the replicas containing this partition's data. Then for every batch that we try to send, send them to one of the replicas, in a round robin fashion to spread load across the replicas.

We'll have an LRU cache keyed on the partition key, that contains either a Ready value or a NotReady value.
The reason for this is that there are some cases where you try to get the endpoints information for a token, and the Rust driver hasn't updated it's metadata yet about the table, so that information isn't filled yet. If we have a Ready value, we have the actual "sticky" policy already, with the (node, shard) endpoints, and we're good to go. If you have a NotReady value, you'll have a timestamp of when we last attempted to get the endpoints. We always wait at least 2 seconds before trying again, to give the driver time to update itself, and not overload it with these endpoint requests. Until then we use the default policy and take a bit of a performance hit, but should be for very limited time.

The NotReady state can also contain the Token already for that partition, in case we managed to calculate it in the last attempt. The Token is calculated by doing a Murmur3 hash of the tables specs and partition key. If the table doesn't change that Token will never change for this partition key. Since there's hashing involved, we cache it to not do that repeated work.

If we ever decide to auto scale our ScyllaDB deployment based on load, we'll need to add a mechanism here to invalidate these cache entries when that happens.

Test Plan

CI + I won't merge before I benchmark this code together with the new key space partitioning PR, to make sure the performance is what we expect.

Release Plan

Nothing to do / These changes follow the usual release cycle.

ndr-ds · 2025-06-02T17:51:00Z

Scylla test changes #4198
New ScyllaDB key space partitioning #4049
Optimize ScyllaDB's batch writes #4047 👈 (View in Graphite)
Some code cleanups #4066
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

afck · 2025-06-04T15:48:51Z

Do we need this if we never have batches that affect multiple shards?

Can we enforce that each view is on a single shard, and then just not batch anything else?

ndr-ds · 2025-06-04T17:22:54Z

Do we need this if we never have batches that affect multiple shards?

Yes, even if all queries in the batch are for the same partition key, the Rust driver will still just send the batch to a random node by default.

Can we enforce that each view is on a single shard, and then just not batch anything else?

Yes, but that's a much bigger change I think 😅 both the enforcing that each view is on a single shard, and not batching.
I want to do a few follow ups here where we do that enforcement, and stop doing single query batches and do just the queries directly instead. It's still useful to keep this even in that case, because now batches should still be reasonably performant even in the case where people don't care about atomicity, but want to use batches just to minimize network calls to the DB.

afck · 2025-06-05T08:30:14Z

linera-views/src/backends/scylla_db.rs

 const KEYSPACE: &str = "kv";

+/// The default size of the cache for the load balancing policies.
+const DEFAULT_LOAD_BALANCING_POLICY_CACHE_SIZE: usize = 50_000;


Does this work now, as of Rust 1.83.0?

Suggested change

const DEFAULT_LOAD_BALANCING_POLICY_CACHE_SIZE: usize = 50_000;

const DEFAULT_LOAD_BALANCING_POLICY_CACHE_SIZE: NonZeroUsize = NonZeroUsize::new(50_000).unwrap();

Seems like it does!

linera-views/src/backends/scylla_db.rs

afck · 2025-06-05T09:06:33Z

linera-views/src/backends/scylla_db.rs

+                policy
+            }
+            Err(error) => {
+                // Cache that the policy creation failed, so we don't try again too soon, and don't


Is this expected? Should we log if that happens?

I can do a WARN here I think, but it shouldn't happen too many times AFAIU

afck · 2025-06-05T09:13:20Z

linera-views/src/backends/scylla_db.rs

+                match policy {
+                    LoadBalancingPolicyCacheEntry::Ready(policy) => policy.clone(),
+                    LoadBalancingPolicyCacheEntry::NotReady(timestamp, token) => {
+                        if Timestamp::now().delta_since(*timestamp)


Maybe we should use Instance here instead of Timestamp? Our linera_base timestamp type is just to define and serialize timestamps in the protocol as u64s. Since this here is only used locally I'd go with the standard library. Also, this expression here could then be timestamp.elapsed().

Sorry, I meant Instant.

MathieuDutSik · 2025-06-05T13:43:30Z

linera-views/src/backends/scylla_db.rs

+pub struct ScyllaDbClientConfig {
+    /// The delay before the sticky load balancing policy creation is retried.
+    pub delay_before_sticky_load_balancing_policy_retry_ms: u64,
+}


Now that we have a ScyllaDbClientConfig, why not put the DEFAULT_LOAD_BALANCING_POLICY_CACHE_SIZE in it?

Also, the replication_factor could have its place here. It does not have to be in this PR, though.

I could, sure! And on the replication_factor, yeah, that's in my TODO list 😅 I noticed a while back it doesn't belong on common config as it's actually specific to ScyllaDb

MathieuDutSik · 2025-06-05T14:02:46Z

linera-views/src/backends/scylla_db.rs

+enum LoadBalancingPolicyCacheEntry {
+    Ready(Arc<dyn LoadBalancingPolicy>),
+    // The timestamp of the last time the policy creation was attempted.
+    NotReady(Instant, Option<Token>),
+}


I have some maybe stupid concern, but I think that the sharding policy in ScyllaDb is dynamic.
So, I wonder if this affects the caching.

Good question. For a given partition_key, the Token is always unique, as it's always calculated by the same hash function. So the Token won't change. As far as the endpoints, and AFAIU, we'll only reshuffle token ranges (causing the nodes that hold a different token to change) when we scale ScyllaDB up or down, as in add or remove VMs in the NodePool, or run ALTER TABLE commands, stuff like that.
We currently don't do that, and it would complicate the code a bit to add support for it now, so I would rather leave it for when it's needed

Wait, we don't support changes in the sharding assignment on the ScyllaDb side?

This sticky load balancing policy part currently doesn't, but we can add support for that in a follow up PR, I think :)

MathieuDutSik

No obstacle on my side, but I would like to see benchmarks confirming the improvements before merging.

Benchmarks can be added to the PR description.

ma2bd · 2025-06-05T23:12:22Z

linera-views/src/backends/scylla_db.rs

+            if let Some(policy) = cache.get(partition_key) {
+                match policy {
+                    LoadBalancingPolicyCacheEntry::Ready(policy) => policy.clone(),


No need for nesting if let and match:

match cache.get(partition_key) { Some(LoadBalancingPolicyCacheEntry::Ready(policy)) => ... ... None => ... }

ma2bd

Let's experiment with this on a separate branch first. The fact that this doesn't support changes in the shard assignments by ScyllaDb is a problem.

ndr-ds · 2025-06-16T22:23:13Z

I can add support for that on this PR, or in a following one, to not inflate the size of this one even more (I won't merge this without that, will merge the full stack at once, probably). I was thinking on creating a stack that we know works, then merging the full stack once we agree it gets things to a good state (instead of doing a separate branch).
Happy to do a separate branch as well though, if we prefer that 😅

ndr-ds · 2025-10-24T23:17:39Z

Closing this, as we went with a simpler approach

This was referenced Jun 2, 2025

Optimize ScyllaDB usage #3985

Merged

Add type_id as primary key to ScyllaDb schema #4015

Closed

Add bucket_id to ScyllaDb's primary key #4016

Closed

ndr-ds mentioned this pull request Jun 2, 2025

Add key partition prefix to ScyllaDB schema #4021

Closed

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch 3 times, most recently from 46899e9 to 78c5f0a Compare June 2, 2025 18:31

ndr-ds force-pushed the 05-22-optimize_scylladb_usage branch from 4acb372 to 6106db7 Compare June 2, 2025 19:05

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from 78c5f0a to 0c95618 Compare June 2, 2025 19:05

ndr-ds mentioned this pull request Jun 2, 2025

New ScyllaDB key space partitioning #4049

Closed

ndr-ds changed the base branch from 05-22-optimize_scylladb_usage to graphite-base/4047 June 2, 2025 23:22

ndr-ds force-pushed the graphite-base/4047 branch from 6106db7 to af51607 Compare June 2, 2025 23:22

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from 0c95618 to 768208f Compare June 2, 2025 23:22

graphite-app bot changed the base branch from graphite-base/4047 to main June 2, 2025 23:22

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from 768208f to 0b093e3 Compare June 2, 2025 23:22

ndr-ds requested review from MathieuDutSik, Twey, afck, bart-linera, christos-h, deuszx and ma2bd June 3, 2025 13:09

ndr-ds marked this pull request as ready for review June 3, 2025 13:09

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch 2 times, most recently from 09d6c81 to af0e20c Compare June 3, 2025 19:11

ndr-ds changed the base branch from main to graphite-base/4047 June 3, 2025 21:03

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from af0e20c to fa4280f Compare June 3, 2025 21:03

ndr-ds changed the base branch from graphite-base/4047 to 06-03-truncate_query_output_on_query_node June 3, 2025 21:03

ndr-ds mentioned this pull request Jun 3, 2025

Truncate query output on query_node #4054

Merged

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from a8361a5 to b6a6083 Compare June 4, 2025 17:17

ndr-ds mentioned this pull request Jun 4, 2025

Some code cleanups #4066

Merged

ndr-ds changed the base branch from main to graphite-base/4047 June 5, 2025 05:28

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from b6a6083 to c8b5e29 Compare June 5, 2025 05:28

ndr-ds changed the base branch from graphite-base/4047 to 06-04-some_code_cleanups June 5, 2025 05:28

ndr-ds changed the base branch from 06-04-some_code_cleanups to graphite-base/4047 June 5, 2025 05:47

ndr-ds force-pushed the graphite-base/4047 branch from 877abcb to 6976d9c Compare June 5, 2025 06:25

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from c8b5e29 to 36cfe7b Compare June 5, 2025 06:25

ndr-ds changed the base branch from graphite-base/4047 to main June 5, 2025 06:25

afck reviewed Jun 5, 2025

View reviewed changes

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from 36cfe7b to cef2f5d Compare June 5, 2025 10:47

MathieuDutSik reviewed Jun 5, 2025

View reviewed changes

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from cef2f5d to aca7d91 Compare June 5, 2025 20:40

MathieuDutSik approved these changes Jun 5, 2025

View reviewed changes

ma2bd reviewed Jun 5, 2025

View reviewed changes

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from aca7d91 to 9367ee5 Compare June 5, 2025 23:21

ma2bd requested changes Jun 16, 2025

View reviewed changes

ndr-ds mentioned this pull request Jun 18, 2025

Don't hold DashMap lock over await points. #4138

Merged

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch 2 times, most recently from a5ebfeb to 0a80832 Compare June 30, 2025 19:07

ndr-ds marked this pull request as draft July 2, 2025 15:08

Optimize ScyllaDB's batch writes

7be4b46

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from 0a80832 to 7be4b46 Compare July 2, 2025 17:04

ndr-ds mentioned this pull request Jul 2, 2025

Scylla test changes #4198

Closed

ndr-ds closed this Oct 24, 2025

	const DEFAULT_LOAD_BALANCING_POLICY_CACHE_SIZE: usize = 50_000;
	const DEFAULT_LOAD_BALANCING_POLICY_CACHE_SIZE: NonZeroUsize = NonZeroUsize::new(50_000).unwrap();

Optimize ScyllaDB's batch writes #4047

Optimize ScyllaDB's batch writes #4047

Uh oh!

Conversation

ndr-ds commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Proposal

Test Plan

Release Plan

Uh oh!

ndr-ds commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

afck commented Jun 4, 2025

Uh oh!

ndr-ds commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ma2bd Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MathieuDutSik left a comment

Choose a reason for hiding this comment

Uh oh!

ma2bd Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ma2bd left a comment

Choose a reason for hiding this comment

Uh oh!

ndr-ds commented Jun 16, 2025

Uh oh!

ndr-ds commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ndr-ds commented Jun 2, 2025 •

edited

Loading

ndr-ds commented Jun 2, 2025 •

edited

Loading

ndr-ds commented Jun 4, 2025 •

edited

Loading

ma2bd Jun 5, 2025 •

edited

Loading

ma2bd Jun 5, 2025 •

edited

Loading