Add Bound{Statement,Batch} #1321

nrxus · 2025-04-14T05:53:13Z

This PR adds a version of BoundStatement and BoundBatch to hopefully help inspire your future design of these structs.

As discussed in: #941, you all wish to design these features yourselves (for totally understandable reasons) so this PR is here just to help inspire the future implementation, perhaps discuss decisions/API, or whatever else I may be able to help with. I will likely use this branch in my own work to solve my need for this API temporarily until an official version is up.

`BoundStatement`

pub struct BoundStatement {
    pub(crate) prepared: PreparedStatement,
    pub(crate) values: SerializedValues,
}

This is probably the easiest to understand. It is just a PreparedStatement with a bound SerializedValues. The only public way to create this struct is through PreparedStatement::bind(self, impl: SerializeRow), thus making it externally type safe at creation, while internally it erases the type and just keeps the byte buffer (SerializedValues).

This struct is useful to:

Keep a prepared statement and its values together in a non-generic way by type erasing at the moment of creation.
Allow users to get the Token related to the prepared statement + values combo without needing to double-serialize (once to get the token and then again when executing the statement).

All existing internal APIs where a PreparedStatement and its relevant values where passed in together are rewritten to use BoundStatement instead while all external APIs remain the same so as to not introduce breaking changes.

`BoundBatch`

pub struct BoundBatch {
    pub(crate) buffer: Vec<u8>,
    pub(crate) prepared: HashMap<Bytes, PreparedStatement>,
    first_prepared: Option<(PreparedStatement, Token)>,
    pub(crate) statements_len: u16,
    
    /* snip the less relevant fields */
}

This is where things begin to get interesting. While a naive implementation of BoundBatch that simply keeps a Vec<PreparedStatement> could work, that'd take away some of the current niceties of how Batch is serialized, namely when serializing a batch and its values it doesn't create a small buffer for each serialized row but instead serializes every statement + value into one joined buffer thus avoiding multiple small allocations. To let this current optimization live-on while still allowing each statement + value to be passed in sequentially the BoundBatch instead keeps a single buffer where it serializes each statement + value as they are passed in. When the bound batch is executed this large buffer is copied into the request buffer in one go. Since we are serializing as we go, we need to keep track a few extra details:

A map of prepared statements, key'd by its ID: this allows batches to re-prepare its statements when the DB returns with an error about a statement not being prepared but expected it to be
The first statement and its token: if that statement was prepared, so that the batch can use it for sharding purposes (batch goes to the shard of the first prepared statement)
The length of its statements: needed to make sure we haven't reached the limit and so we can serialize the length when making the request.

This struct is useful to:

Keep a batch and its values together in a non-generic way by type erasing at the moment of creation.
More fool-proof in that the number of statements and values never go out-of-sync
Allows adding a BoundStatement to avoid re-serializing values . This is useful because it is common for batches to be more efficient when they are all for the same token. So now a user can make a BoundStatement, use it to calculate its token, and based on that decide what BoundBatch instance to put the BoundStatement into. BoundBatch can easily copy the serialized bytes out of BoundStatement thus saving CPU resources.

Note that this implementation of BoundBatch allows for unprepared statements to be added if and only if there are no values associated for the unprepared statement. This is done to follow the existing logic of Batch where any unprepared statement that had values would be prepared prior to doing the request, now we are just making it explicit for the user to do have to prepared it if they want to use BoundBatch. The internal implementation of executing a Batch has been rewritten to first prepare the statements in the Batch (just as it used to do but earlier now) and then create a BoundBatch. This allows the logic of executing a batch to be all in one place so that all the existing tests of Batch end up testing BoundBatch execution as well.

`Execute` trait

This is a sealed trait (users of the crate can call its methods but not implement the trait) for types that can be executed on a Session without any additional values. For now this is directly implemented only on BoundBatch. Existing methods have been rewritten to use this trait but externally nothing has changed other than this trait being usable publicly. (so someone can call my_bound_batch.execute(&session).

`ExecutePageable` trait

Another sealed trait for types that be executed on a Session but are aware of pagiination. This is implemented directly on BoundStatement and (Statement, impl SerializeValeus). Any type that implements ExecutePageable will also auto-implement Execute with the implementation calling for the pageable methods with no page limit, and no initial paging state. This allows for BoundStatement and (Statement, impl SerializeValeus) to have the same three ways to be called now: no pagination from the start, no pagination from a saved point, and pagination from a saved point. This is done via a single method that has a const generic but could easily be refactored to two methods instead (that call for the generic method under the hood).

Other random notes

Because the scylla-cql crate is already in v1.x, and the Batch struct in that crate has its definition fully public, I couldn't change it at all so instead I had to create a new BatchV2 . I left the current implementation of Batch (de)serialization and instead made requests with the batch opcode (de)serialize using BatchV2. This allows code to compile but it is technically perhaps a breaking change in that if a foreign crate relies on scylla-cql deserializing to Batch instead of the new BatchV2 then their code will break at runtime. I can't think of a reason why someone would do that but I figured it was worth bringing up anyway.

Fixes: #941

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
I added relevant tests for new features and bug fixes.
All commits compile, pass static checks and pass test.
PR description sums up the changes and reasons why they should be introduced.
I have provided docstrings for the public items that I want to introduce.
I have adjusted the documentation in ./docs/source/.
I added appropriate Fixes: annotations to PR description.

github-actions · 2025-04-14T05:55:24Z

cargo semver-checks found no API-breaking changes in this PR.
Checked commit: 2b5f6c6

wprzytula · 2025-04-14T11:40:13Z

Other random notes

Because the scylla-cql crate is already in v1.x, and the Batch struct in that crate has its definition fully public, I couldn't change it at all so instead I had to create a new BatchV2 . I left the current implementation of Batch (de)serialization and instead made requests with the batch opcode (de)serialize using BatchV2. This allows code to compile but it is technically perhaps a breaking change in that if a foreign crate relies on scylla-cql deserializing to Batch instead of the new BatchV2 then their code will break at runtime. I can't think of a reason why someone would do that but I figured it was worth bringing up anyway.

Your concern is justified. The scenario indeed may surprise the users with unexpected runtime behaviour.

There exists a general solution to this problem: all affected items from scylla-cql (by affected I mean having their logical behaviour altered) must be replaced with v2 versions. In this case, we could duplicate Request along with all functions that operate on it. Then, scylla would only use v2, whereas scylla-cql users would stay at the previous versions.

If you say that it's a lot of duplicated code, then I agree. If you say it's a huge effort to duplicate everything, I also agree.

Therefore, we likely prefer a non-general solution in this case, but rather one that is suited for this particular case. @Lorak-mmk has some idea IIUC.

nrxus · 2025-04-17T22:41:15Z

I've began using this branch at my job and we've noticed a very clear drop in how long it takes us to process our data and reduced our number of scaled instances as well.

For context, the specific workload is: we get a giant file with lots of different types of events that we want to save onto different tables, but there is often multiple events that go into the same table and partition. To do this we make batches that we keep in essentially a HashMap<Option<u64>, Batch>>. Using this branch, we've also been able to optimize some of our code to place values into the BoundBatch as we transform our data into our struct that implements SerializeRow instead of needing to keep a vector of all them to have them when the batch gets executed as we needed to do in the previous API.

Our average processing time for a file went from ~150ms to ~50ms, so a drop in 66%. Of course this is anecdotal and specific to my use case of batching so YMMV.

wprzytula

I stopped the review on the third commit for now.

scylla/src/client/pager.rs

wprzytula · 2025-04-29T14:38:45Z

scylla/src/client/session.rs

        values: impl SerializeRow,
        paging_state: PagingState,
    ) -> Result<(QueryResult, PagingStateResponse), ExecutionError> {
-        self.do_execute_single_page(prepared, values, paging_state)
-            .await
+        let statement = prepared.clone().bind(&values)?;
+        self.do_execute_single_page(&statement, paging_state).await
    }


💭 I think we'd prefer not to require cloning PreparedStatement (even it should be quite cheap) in execute_{unpaged,single_page}(). Could we instead introduce (for internal, but perhaps also external use) borrowed BoundStatements? Such that would borrow PreparedStatements, but own SerializedValues.

User might like to do something like this:

let prepared = session.prepare(...).await?; let mut bound = Vec::new(); bound.push(prepared.bind_by_ref(...)); bound.push(prepared.bind_by_ref(...)); ... let results = futures::future::try_join_all( bound.into_iter() .map(|bound| session.execute_bound(bound)) )?;

Yeah that should be very doable. I was assuming that cloning a PreparedStatement was cheap enough as to not matter but I can add a BoundStatementRef or something like that.

I have modified the existing BoundStatement to hold a Cow<PreparedStatement> such that it allows a borrowed or an owned version of PreparedStatement. I then modified the existing PreparedStatement::bind to borrow the prepared statement by reference and added a new method that moves it instead.

wprzytula · 2025-04-29T15:12:54Z

scylla/src/statement/bound.rs

+/// Represents a statement that already had all its values bound
+#[derive(Debug, Clone)]
+pub struct BoundStatement {
+    pub(crate) prepared: PreparedStatement,
+    pub(crate) values: SerializedValues,
+}
+
+impl BoundStatement {
+    pub(crate) fn new(
+        prepared: PreparedStatement,
+        values: &impl SerializeRow,
+    ) -> Result<BoundStatement, SerializationError> {
+        let values = prepared.serialize_values(values)?;
+        Ok(Self { prepared, values })
+    }
+
+    /// Determines which values constitute the partition key and puts them in order.
+    ///
+    /// This is a preparation step necessary for calculating token based on a prepared statement.
+    pub(crate) fn pk(&self) -> Result<PartitionKey<'_>, PartitionKeyExtractionError> {
+        PartitionKey::new(self.prepared.get_prepared_metadata(), &self.values)
+    }
+
+    pub(crate) fn pk_and_token(
+        &self,
+    ) -> Result<Option<(PartitionKey<'_>, Token)>, PartitionKeyError> {
+        if !self.prepared.is_token_aware() {
+            return Ok(None);
+        }
+
+        let partition_key = self.pk()?;
+        let token = partition_key.calculate_token(self.prepared.get_partitioner_name())?;
+        Ok(Some((partition_key, token)))
+    }
+
+    /// Calculates the token for the prepared statement and its bound values
+    ///
+    /// Returns the token that would be computed for executing the provided prepared statement with
+    /// the provided values.
+    pub fn token(&self) -> Result<Option<Token>, PartitionKeyError> {
+        self.pk_and_token().map(|p| p.map(|(_, t)| t))
+    }
+
+    /// Returns the prepared statement behind the `BoundStatement`
+    pub fn prepared(&self) -> &PreparedStatement {
+        &self.prepared
+    }


❓ The commit message says that the BoundStatement is internal-only. At the same time, it's pub and it has pub methods.

There is no public way to create it, the only method PreparedStatement::bind is pub(crate). But I am fine making the struct itself pub(crate) as well for this commit and then undoing it in the commit that makes it public.

wprzytula · 2025-04-29T15:26:20Z

scylla/src/statement/bound.rs

+    /// Determines which values constitute the partition key and puts them in order.
+    ///
+    /// This is a preparation step necessary for calculating token based on a prepared statement.
+    pub(crate) fn pk(&self) -> Result<PartitionKey<'_>, PartitionKeyExtractionError> {
+        PartitionKey::new(self.prepared.get_prepared_metadata(), &self.values)
+    }
+
+    pub(crate) fn pk_and_token(
+        &self,
+    ) -> Result<Option<(PartitionKey<'_>, Token)>, PartitionKeyError> {
+        if !self.prepared.is_token_aware() {
+            return Ok(None);
+        }
+
+        let partition_key = self.pk()?;
+        let token = partition_key.calculate_token(self.prepared.get_partitioner_name())?;
+        Ok(Some((partition_key, token)))
+    }
+
+    /// Calculates the token for the prepared statement and its bound values
+    ///
+    /// Returns the token that would be computed for executing the provided prepared statement with
+    /// the provided values.
+    pub fn token(&self) -> Result<Option<Token>, PartitionKeyError> {
+        self.pk_and_token().map(|p| p.map(|(_, t)| t))
+    }


🔧 The previous method names (e.g., extract_partition_key_and_calculate_token()), were verbose but informative. Their names warned that they required computations; the new names suggest getter-like cheap operation.

Fair enough, and I am okay changing it back but if I could push back just slightly, the previous one was more expensive because it also serialized the values as part of that calculation. For this method the values have already been serialized so it is a cheaper calculation. Additionally based on the types it is already implied that some sort of calculation needs to happen at runtime since the operation can fail. Although... looking at the code it still a bit complex so I am fine adding calculate back in there or something like that.

wprzytula · 2025-04-29T15:38:21Z

scylla/src/client/session.rs

+    async fn last_minute_prepare_batch<'b>(
+        &self,
+        init_batch: &'b Batch,
+        values: impl BatchValues,
+    ) -> Result<Cow<'b, Batch>, PrepareError> {


❓ What does the name of this method mean?

I couldn't think of a good name for it and I wanted to get the idea out there rather than bikeshed on it, my bad.

Basically this is the preparing of the batch that occurs at the moment a batch gets executed if any of its statements with values weren't already prepared. This logic already existed but it was done in a latter step and I had to do move it here since I wanted the execution of Batch to be built upon BoundBatch to make it easier to catch bugs instead of duplicating logic. BoundBatch does not allow for unprepared statements with values so this "last minute prepare" was moved here.

wprzytula · 2025-04-29T15:44:39Z

General thought:
You made PreparedStatement execution use BoundStatement internally. We probably prefer not to do that, because if we don't have to incur BoundStatement type-erasure-involved performance penalty, we don't wan to.
We prefer to use corresponding trait impls (SerializeRow, SerializeValue) to serialize those values straight to the networking buffer in scylla-cql, bypassing the middle man (buffer in BoundStatement).

nrxus · 2025-05-01T05:51:18Z

General thought: You made PreparedStatement execution use BoundStatement internally. We probably prefer not to do that, because if we don't have to incur BoundStatement type-erasure-involved performance penalty, we don't wan to. We prefer to use corresponding trait impls (SerializeRow, SerializeValue) to serialize those values straight to the networking buffer in scylla-cql, bypassing the middle man (buffer in BoundStatement).

That doesn't seem to match the existing implementation (in main) for executing a prepared statement. As far as I can tell, one of the first things the code does is serialize the statement.

Prepared statement unpaged:

scylla-rust-driver/scylla/src/client/session.rs

Line 1299 in 0085612

let serialized_values = prepared.serialize_values(&values)?;

Prepared statement single page:

scylla-rust-driver/scylla/src/client/session.rs

Line 1318 in 0085612

let serialized_values = prepared.serialize_values(&values)?;

Prepared statement iterating:

scylla-rust-driver/scylla/src/client/session.rs

Line 1438 in 0085612

let serialized_values = prepared.serialize_values(&values)?;

In fact, even for unprepared statements, as long as there are values, they eventually get prepared and serialized, instead of serializing straight into the networking buffer:

scylla-rust-driver/scylla/src/client/session.rs

Lines 1085 to 1086 in 0085612

    
           let prepared = connection.prepare(statement).await?; 
        
           let serialized = prepared.serialize_values(values_ref)?;

The only case I see where we don't pre-serialize values is unprepared statements without values. But BoundStatement is explicitly for statements that were prepared and have a value so I don't think that matters here.

That all being said, you all do in fact do that for Batch, which is why I implemented the slightly complicated strategy of serializing as statements are pushed into one buffer to avoid the cost of many small buffers as you mentioned.

this struct keeps track of a PreparedStatement and SerializedValues

this is the version that the top crate (scylla) will use to send batches

it implements Default same as `Batch`, and it also allows for override of the batch_type same as `Batch`

allows users to transform an existing bound statement into one that doesn't borrow the prepared statement (by cloning it)

wprzytula assigned nrxus Apr 14, 2025

wprzytula added the enhancement New feature or request label Apr 14, 2025

wprzytula added this to the 1.3.0 milestone Apr 14, 2025

nrxus changed the title ~~Add internal boundstatement~~ Add Bound{Statement,Batch} Apr 14, 2025

nrxus force-pushed the add-internal-boundstatement branch 4 times, most recently from 156aab0 to ee9b930 Compare April 16, 2025 23:00

wprzytula requested review from Lorak-mmk and wprzytula April 19, 2025 10:59

wprzytula reviewed Apr 29, 2025

View reviewed changes

nrxus force-pushed the add-internal-boundstatement branch from ee9b930 to 8481e96 Compare May 1, 2025 03:16

nrxus force-pushed the add-internal-boundstatement branch from 8481e96 to f9ec0a2 Compare May 2, 2025 00:53

nrxus requested a review from wprzytula May 2, 2025 00:55

nrxus force-pushed the add-internal-boundstatement branch from f9ec0a2 to fe53d08 Compare May 28, 2025 02:20

Lorak-mmk assigned wprzytula May 28, 2025

nrxus force-pushed the add-internal-boundstatement branch 2 times, most recently from 49d7424 to 1a2e7f3 Compare June 27, 2025 18:08

wprzytula modified the milestones: 1.3.0, 1.4.0 Jul 8, 2025

nrxus added 4 commits July 8, 2025 18:37

Add an internal only BoundStatement

62aa961

this struct keeps track of a PreparedStatement and SerializedValues

Add an internal only BoundBatch

dd7d532

switch scylla-cql's request::Batch to use new BatchV2 version

80308b5

this is the version that the top crate (scylla) will use to send batches

expose creation of BoundBatch

8979702

it implements Default same as `Batch`, and it also allows for override of the batch_type same as `Batch`

nrxus force-pushed the add-internal-boundstatement branch from 1a2e7f3 to bafcdf9 Compare July 9, 2025 01:37

nrxus added 2 commits July 8, 2025 19:04

Allow adding statements to a boundbatch and executing it

af88283

add BoundStatement::into_owned

2b5f6c6

allows users to transform an existing bound statement into one that doesn't borrow the prepared statement (by cloning it)

nrxus force-pushed the add-internal-boundstatement branch from bafcdf9 to 2b5f6c6 Compare July 9, 2025 02:05

Lorak-mmk marked this pull request as draft September 22, 2025 13:16

Lorak-mmk modified the milestones: 1.4.0, 1.5.0 Sep 23, 2025

Add Bound{Statement,Batch} #1321

Are you sure you want to change the base?

Add Bound{Statement,Batch} #1321

Uh oh!

Conversation

nrxus commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

BoundStatement

BoundBatch

Execute trait

ExecutePageable trait

Other random notes

Pre-review checklist

Uh oh!

github-actions bot commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wprzytula commented Apr 14, 2025

Other random notes

Uh oh!

nrxus commented Apr 17, 2025

Uh oh!

wprzytula left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nrxus May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wprzytula commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nrxus commented May 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nrxus commented Apr 14, 2025 •

edited

Loading

`BoundStatement`

`BoundBatch`

`Execute` trait

`ExecutePageable` trait

github-actions bot commented Apr 14, 2025 •

edited

Loading

nrxus May 1, 2025 •

edited

Loading

wprzytula commented Apr 29, 2025 •

edited

Loading