session: fix closing semantics #328

wprzytula · 2025-06-25T09:36:46Z

Note: generated with GPT-4o and manually redacted.

Fix Session Closing Semantics and Enable `AsyncTests` Suite

Summary

This pull request introduces critical improvements to the session closing mechanism. Additionally, it enables the AsyncTests::Close test suite, aligns the behavior with the expectations of the CPP Driver, and satisfies the contract defined in the cassandra.h documentation regarding cass_session_free() and cass_session_close().

Key Changes

Empirical Proof of Flaws in Current Implementation:
- A temporary commit (913a7c2b) was introduced to empirically demonstrate flaws in the current implementation. The AsyncTests::Close test was tuned to fail by increasing concurrent requests, adding sleep times, and switching to a multi-threaded runtime with a hardcoded number of worker threads. This highlights issues with session closure concurrent to running requests.
- Subsequent commits address these flaws by ensuring synchronous read-locking of the session upon scheduling a request.
- Note: This temporary commit will be removed before merging the PR into the master branch.
Simplified Future Logic:
- Introduced CassFuture::make_ready_raw() to streamline the creation of ready futures.
- Ensured that session connectivity is checked synchronously before creating operation futures, reducing complexity and potential errors.
Improved Session Closing Logic:
- The RwLock mechanism now ensures that the session is protected from premature drops by synchronously taking a read lock for all running requests. This guarantees that cass_session_close() and cass_session_free() block until all requests are completed, aligning with the expectations of the AsyncTests::Close suite. The synchronous taking of the lock is done nonblockingly, which prevents the issues described in Panic when cass_future_error_code() is called from a future callback #329.
- The session closing process now ensures (by virtue of RwLock) that all in-flight requests are completed before the session is freed. This prevents potential data loss or inconsistencies during session closure.
Enabled AsyncTests::Close Suite:
- The Session::execute(_batch) methods has already been cloning the Session's Arc, preventing use-after-free (UAF) scenarios when the session is closed while requests are still running [introduced in c1e40d7].
Implemented a unit test for cass_session_free:
- It ensures the function synchronously waits for all in-flight requests to complete.

Notes to reviewers

Verify that the AsyncTests::Close test's semantics are satisfied.
Ensure that session operations behave as expected under concurrent request and session closure scenarios.
Make sure that the session access/modify semantics are correct:
- preparing statements,
- executing statements,
- executing batches,
- connecting session,
- closing session.
Confirm that no deadlocks occur in the current_thread runtime.
Validate that the temporary commit (913a7c2b) is removed before merging.

Fixes: #304
Fixes: #329

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
PR description sums up the changes and reasons why they should be introduced.
I have implemented Rust unit tests for the features/changes introduced.
I have enabled appropriate tests in Makefile in {SCYLLA,CASSANDRA}_(NO_VALGRIND_)TEST_FILTER.
I added appropriate Fixes: annotations to PR description.

wprzytula · 2025-06-30T16:54:06Z

The taken approach is wrong. Callbacks - they will panic due to #329 problem.
This must wait until I find an approach that will fix #329 and support current_thread runtime at the same time.

wprzytula · 2025-07-06T12:51:40Z

Rebased on master.

wprzytula · 2025-07-08T11:50:31Z

Reprioritized to P0 due to relation to #329.

Lorak-mmk

I think I found a serious error - lmk if I'm right.

scylla-rust-wrapper/src/session.rs

Lorak-mmk · 2025-07-08T14:40:21Z

scylla-rust-wrapper/src/session.rs

+        let prev = cass_session.connected.compare_and_swap(
+            &None::<Arc<CassConnectedSession>>,
+            Some(Arc::new(CassConnectedSession {
+                session,
+                exec_profile_map,
+            })),
+        );
+        if prev.is_some() {
+            return Err(error());
+        }
+


A bit sad that if there are multiple connection attempts, all will happen. Peviously others woul wait on lock.
I don't think it is an important scenario, so it is acceptable.

The caller is responsible for not connecting session concurrently.

scylla-rust-wrapper/src/session.rs

Lorak-mmk · 2025-07-08T15:17:56Z

scylla-rust-wrapper/src/session.rs

+        // We start by setting the Notify, so that we won't lose a wakeup
+        // if the last request finishes before we set it.
+        cass_session_connected
+            .requests_finished_notify
+            .set(Notify::new())
+            .expect(
+                "The swap guarantees that only one thread takes the connected session. \
+                And this should never be an issue IRL, because session should be closed \
+                from a single thread, only once",
+            );
+
        let fut = async move {
-            // TODO: add waiting for the pending requests to finish.
-            let _ = cass_session_connected;
+            while cass_session_connected.has_pending_requests() {
+                // Wait for all pending requests to finish.
+                // This will block until the last request finishes and calls `requests_finished_notify.notify_one()`.
+                cass_session_connected
+                    .requests_finished_notify
+                    .get()
+                    .expect("We have initialized the OnceLock prior")
+                    .notified()
+                    .await;
+            }



Isn't the following race possible?

No requests are pending

Thread 1 starts cass_session_execute. performs load_full on session.

Thread 2 performs cass_session_free

Thread 2 sets notify, checks that session has no pending requests, finishes

Thread 1 pends a request and performs it.

Yep. I forgot to reason about multi-threaded concurrency.

I think the described scenario can be fixed by a "check-set-confirm" (my own crafted name) operation, by requiring the session to not be closed yet after the request is pended. What do I mean?

Thread 1 starts cass_session_execute. It performs load_full on session.

Thread 2 performs cass_session_free

Thread 2 sets notify, checks that session has no pending requests, finishes

Thread 1 pends a request.

(the changed part) Thread 1 verifies that the session is still open. If it is, it performs the request. If it is not, it unpends the request and returns an error ("session has been closed").

Why is this correct?
cass_session_close's future must wait for running requests. It's not clear who was first - session closing or a pended request - in a parallel scenario. Then, I believe we are free to refuse starting the request that encountered a closed session after it has been pended.

Note: we must prevent the ABA problem, i.e., distinguish the old session from the newly connected, different session.

I believe this can be done by merely using Arc::ptr_eq on the old and the new Arc<CassConnectedSession>.

If you see a way to break this, we could always use a static usize counter that will be unique for every connected session.

WDYT?

I've pushed a commit which I believe solves the problem. Please verify it.

wprzytula · 2025-07-14T12:22:14Z

@Lorak-mmk I've redone this in a massively simplified way, leveraging the existing RwLock. In short, nonblocking variants of read() functions were used to prevent #329 issues, resulting in working code that passes both tests and the "callbacks" example. LMK if this looks correct to you.

Copilot

Pull Request Overview

This PR refines session closing semantics to ensure all in-flight requests complete before freeing, streamlines future creation, updates the Tokio runtime to multi-threaded, and fully enables the AsyncTests::Close suite with associated C++ and Rust tests.

Introduce make_ready_raw() for simpler ready-future creation and refactor async session operations to use non-blocking try_read/write_owned locks.
Update the global Tokio runtime to a multi-threaded builder and adjust tests to validate session-free blocking behavior.
Enable and tune the AsyncTests suite, add sleep delays in C++ async tests, and implement a Rust integration test for cass_session_free.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/src/integration/tests/test_async.cpp	Reduced concurrent request count to 5 and added `insert.set_sleep_time(100)` to force overlap in close test.
scylla-rust-wrapper/src/session.rs	Refactored session connect/execute/close calls to use `try_read_owned`/`write_owned`, added null/disconnect checks.
scylla-rust-wrapper/src/lib.rs	Changed `RUNTIME` to use a multi-threaded Tokio runtime with 2 worker threads.
scylla-rust-wrapper/src/future.rs	Added `make_ready_raw()` helper and removed an unused import.
Makefile	Enabled `AsyncTests.*` filter for both Scylla and Cassandra test targets.

scylla-rust-wrapper/src/session.rs

tests/src/integration/tests/test_async.cpp

Lorak-mmk

It looks really good now! I just left a few minor comments.

scylla-rust-wrapper/src/session.rs

Lorak-mmk · 2025-07-14T19:02:06Z

scylla-rust-wrapper/src/session.rs


    fn connect(
-        session: Arc<RwLock<CassSessionInner>>,
+        session: Arc<CassSession>,
        cluster: &CassCluster,
        keyspace: Option<String>,
    ) -> CassOwnedSharedPtr<CassFuture, CMut> {
        let session_builder = cluster.build_session_builder();
        let exec_profile_map = cluster.execution_profile_map().clone();
        let host_filter = cluster.build_host_filter();
-
-        let mut session_guard = RUNTIME.block_on(session.write_owned());
-
-        if let Some(cluster_client_id) = cluster.get_client_id() {
-            // If the user set a client id, use it instead of the random one.
-            session_guard.client_id = cluster_client_id;
-        }
+        let cluster_client_id = cluster.get_client_id();

        let fut = Self::connect_fut(
-            session_guard,
+            session,
            session_builder,
+            cluster_client_id,
            exec_profile_map,
            host_filter,
            keyspace,


I really like the new approach. The only thing I don't yet understand is why do we need to revert the approach to handling client id. connect is not something that would be used in callbacks, right? In which case it should be fine to use block_on here. What do I miss?

I'm asking because I know you wanted client id to work synchronously, so it is surprising that you abandoned the idea.

connect is not something that would be used in callbacks, right?

I won't assume so.

I'm asking because I know you wanted client id to work synchronously, so it is surprising that you abandoned the idea.

I haven't found a way to make it work with the RwLock, while retaining full API usability in callbacks.

wprzytula · 2025-07-15T07:57:43Z

Addressed the comments.

Lorak-mmk

Looks good. Please remember to remove the TMP commit before merging!
I'd be perfectly ok with cass_session_connect not working in callbacks (and it would give us the desired semantics for client id), but I leave that up to you.

By accident, the code was not deduplicated to use the `close_fut` function.

The function mistakenly had the `one` part of its name duplicated.

As this set of rules is going to be useful for more than one test, it makes sense to make it a separate function.

@Lorak-mmk

The mechanism around locking the session has been problematic in the number of ways. **1. Late locking**: The primary issue was that requests would not lock lock the session for reading until their futures were polled, which meant that the following code could lead to request failure due to session having been closed: ```c CassFuture *exec_fut = cass_session_execute(cass_session, cass_statement); CassFuture *close_fut = cass_session_close(cass_session); cass_future_wait(exec_fut); cass_future_wait(close_fut); ``` This is because locking the session for reading is done asynchronously to the code that follows `cass_session_execute` invocation. The same issue was present in all request-making functions, that is, `cass_session_execute`, `cass_session_execute_batch`, and all of the `cass_session_prepare*` family of functions. **2. Thread-blocking locking**: The second issue was that, in some cases, the session was locked for reading by blocking the current thread idly. This was the case for `cass_session_get_metrics`, `cass_session_get_schema_meta`, and `cass_session_get_client_id`. This could lead to deadlocks, especially when the number of threads in the thread pool were low (with `current_thread` tokio executor being the most vulnerable case). **3. Runtime-blocking locking**: The third issue was that, in the case of `cass_session_connect*` functions, the session was locked for writing by blocking the current thread as the executor thread for the awaited future. While this was designed with the `current_thread` executor in mind and worked perfectly for its case, it showed to cause panic when called by a tokio executor thread. **Solution**: This commit addresses all of the above issues by adopting an asymmetric locking mechanism for the session. The session is now locked for reading in advance yet fallibly, by calling `try_read(_owned)` on the session rwlock. This is done in the request-making functions, so that the session is guaranteed to be locked for reading when the request future is returned. The session is still locked for writing (upon connecting or closing) asynchronously, by calling and awaiting `write(_owned)` on the rwlock. This is done in the `cass_session_connect*` and `cass_session_close`. A downside of this approach is that the session is not guaranteed to be locked for writing when the `cass_session_connect*` or `cass_session_close` futures are returned, but this is not a problem because closing and connecting are considered to be "long-running", complex operations and thus are not expected to have conducted a specific part of their logic by the time their future is returned. **Results**: All enabled tests still pass, while the `callbacks` example now passes, too! This best part is that the number and complexity of the required changes is minimal, and the code is now much more robust. I hope @Lorak-mmk will be happy with this solution, as compared to the complex requests pending mechanism and atomics. Note that the test for `cass_session_get_client_id` had to be adjusted. This is because the session has the client ID set only in the connect future instead of in the connect function synchronously, so the test (which did not await the connect future) would fail after the changes.

Since the commit c1e40d7, the Session `execute(_batch)` methods now clone the Session's Arc, which prevents UAF if the Session is closed while the requests are still running. That commit's message says: "we cannot enable `AsyncTests::Close` yet since it expects that prematurely dropped session awaits all async tasks before closing". This is now taken care of by the previous commits. Thus, we can enable the `AsyncTests::Close` test suite.

This tests that `cass_session_free` synchronously waits for all in-flight requests to complete before freeing the session.

wprzytula · 2025-07-15T11:56:44Z

Rebased on master and dropped the temporary commit.

wprzytula self-assigned this Jun 25, 2025

wprzytula added this to the 0.6 milestone Jun 25, 2025

wprzytula added bug Something isn't working P1 P1 priority item - very important labels Jun 25, 2025

wprzytula requested review from Copilot and Lorak-mmk June 25, 2025 10:12

This comment was marked as outdated.

Sign in to view

wprzytula force-pushed the fix-session-close branch from fead714 to b477893 Compare June 25, 2025 10:14

wprzytula marked this pull request as ready for review June 25, 2025 10:37

wprzytula mentioned this pull request Jun 27, 2025

Panic when cass_future_error_code() is called from a future callback #329

Closed

wprzytula removed the request for review from Lorak-mmk June 30, 2025 16:53

wprzytula marked this pull request as draft July 1, 2025 06:36

wprzytula modified the milestones: 0.6, 0.5.1 Jul 6, 2025

wprzytula force-pushed the fix-session-close branch from b477893 to e3c2b0a Compare July 6, 2025 12:51

wprzytula mentioned this pull request Jul 6, 2025

cass_session_free waits for session close #338

Merged

4 tasks

wprzytula force-pushed the fix-session-close branch from e3c2b0a to 1c31e9c Compare July 8, 2025 11:35

wprzytula added P0 P0 item - absolute must have and removed P1 P1 priority item - very important labels Jul 8, 2025

wprzytula marked this pull request as ready for review July 8, 2025 12:06

wprzytula requested review from Lorak-mmk and Copilot July 8, 2025 12:06

This comment was marked as outdated.

Sign in to view

wprzytula force-pushed the fix-session-close branch 2 times, most recently from 8b3237b to 41c4046 Compare July 8, 2025 12:20

Lorak-mmk requested changes Jul 8, 2025

View reviewed changes

wprzytula force-pushed the fix-session-close branch from 41c4046 to e848d07 Compare July 8, 2025 18:12

wprzytula requested a review from Lorak-mmk July 8, 2025 20:04

wprzytula force-pushed the fix-session-close branch 4 times, most recently from 4a93ded to 5d0b098 Compare July 14, 2025 12:05

wprzytula requested review from Lorak-mmk and removed request for Lorak-mmk July 14, 2025 12:11

wprzytula requested a review from Copilot July 14, 2025 12:22

Copilot AI reviewed Jul 14, 2025

View reviewed changes

scylla-rust-wrapper/src/session.rs Outdated Show resolved Hide resolved

tests/src/integration/tests/test_async.cpp Outdated Show resolved Hide resolved

wprzytula force-pushed the fix-session-close branch from 5d0b098 to 6ee294a Compare July 14, 2025 12:36

Lorak-mmk reviewed Jul 14, 2025

View reviewed changes

wprzytula force-pushed the fix-session-close branch from 6ee294a to afb9683 Compare July 15, 2025 07:57

wprzytula requested a review from Lorak-mmk July 15, 2025 11:38

Lorak-mmk approved these changes Jul 15, 2025

View reviewed changes

wprzytula added 7 commits July 15, 2025 13:56

session: use close_fut in cass_session_close

151f07a

By accident, the code was not deduplicated to use the `close_fut` function.

session: test_with_one_proxy_one -> "_one$"/""

04784a5

The function mistakenly had the `one` part of its name duplicated.

session: introduce mock_init_rules

96f1c21

As this set of rules is going to be useful for more than one test, it makes sense to make it a separate function.

future: introduce CassFuture::make_ready_raw()

6c5b2a2

session: unit test for cass_session_free

42d7cc4

This tests that `cass_session_free` synchronously waits for all in-flight requests to complete before freeing the session.

wprzytula force-pushed the fix-session-close branch from afb9683 to 42d7cc4 Compare July 15, 2025 11:56

wprzytula requested a review from Lorak-mmk July 15, 2025 11:56

Lorak-mmk approved these changes Jul 15, 2025

View reviewed changes

wprzytula merged commit 85d4eb8 into scylladb:master Jul 15, 2025
11 checks passed

wprzytula deleted the fix-session-close branch July 15, 2025 12:25

wprzytula mentioned this pull request Jul 15, 2025

Release 0.5.1 #346

Merged

session: fix closing semantics #328

session: fix closing semantics #328

Uh oh!

Conversation

wprzytula commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix Session Closing Semantics and Enable AsyncTests Suite

Summary

Key Changes

Notes to reviewers

Pre-review checklist

Uh oh!

This comment was marked as outdated.

Uh oh!

wprzytula commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wprzytula commented Jul 6, 2025

Uh oh!

wprzytula commented Jul 8, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Lorak-mmk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Lorak-mmk Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

wprzytula Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Lorak-mmk Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

wprzytula Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wprzytula Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wprzytula Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wprzytula commented Jul 14, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Lorak-mmk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Lorak-mmk Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

wprzytula Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

wprzytula commented Jul 15, 2025

Uh oh!

Lorak-mmk left a comment

Choose a reason for hiding this comment

Uh oh!

wprzytula commented Jul 15, 2025

Uh oh!

Uh oh!

Uh oh!

wprzytula commented Jun 25, 2025 •

edited

Loading

Fix Session Closing Semantics and Enable `AsyncTests` Suite

wprzytula commented Jun 30, 2025 •

edited

Loading

wprzytula Jul 8, 2025 •

edited

Loading

wprzytula Jul 8, 2025 •

edited

Loading

wprzytula Jul 8, 2025 •

edited

Loading