Skip to content

Conversation

wprzytula
Copy link
Collaborator

@wprzytula wprzytula commented Jun 25, 2025

Note: generated with GPT-4o and manually redacted.

Fix Session Closing Semantics and Enable AsyncTests Suite

Summary

This pull request introduces critical improvements to the session closing mechanism. Additionally, it enables the AsyncTests::Close test suite, aligns the behavior with the expectations of the CPP Driver, and satisfies the contract defined in the cassandra.h documentation regarding cass_session_free() and cass_session_close().

Key Changes

  1. Empirical Proof of Flaws in Current Implementation:

    • A temporary commit (913a7c2b) was introduced to empirically demonstrate flaws in the current implementation. The AsyncTests::Close test was tuned to fail by increasing concurrent requests, adding sleep times, and switching to a multi-threaded runtime with a hardcoded number of worker threads. This highlights issues with session closure concurrent to running requests.
    • Subsequent commits address these flaws by ensuring synchronous read-locking of the session upon scheduling a request.
    • Note: This temporary commit will be removed before merging the PR into the master branch.
  2. Simplified Future Logic:

    • Introduced CassFuture::make_ready_raw() to streamline the creation of ready futures.
    • Ensured that session connectivity is checked synchronously before creating operation futures, reducing complexity and potential errors.
  3. Improved Session Closing Logic:

    • The RwLock mechanism now ensures that the session is protected from premature drops by synchronously taking a read lock for all running requests. This guarantees that cass_session_close() and cass_session_free() block until all requests are completed, aligning with the expectations of the AsyncTests::Close suite. The synchronous taking of the lock is done nonblockingly, which prevents the issues described in Panic when cass_future_error_code() is called from a future callback #329.
    • The session closing process now ensures (by virtue of RwLock) that all in-flight requests are completed before the session is freed. This prevents potential data loss or inconsistencies during session closure.
  4. Enabled AsyncTests::Close Suite:

    • The Session::execute(_batch) methods has already been cloning the Session's Arc, preventing use-after-free (UAF) scenarios when the session is closed while requests are still running [introduced in c1e40d7].
  5. Implemented a unit test for cass_session_free:

    • It ensures the function synchronously waits for all in-flight requests to complete.

Notes to reviewers

  • Verify that the AsyncTests::Close test's semantics are satisfied.
  • Ensure that session operations behave as expected under concurrent request and session closure scenarios.
  • Make sure that the session access/modify semantics are correct:
    • preparing statements,
    • executing statements,
    • executing batches,
    • connecting session,
    • closing session.
  • Confirm that no deadlocks occur in the current_thread runtime.
  • Validate that the temporary commit (913a7c2b) is removed before merging.

Fixes: #304
Fixes: #329

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have implemented Rust unit tests for the features/changes introduced.
  • I have enabled appropriate tests in Makefile in {SCYLLA,CASSANDRA}_(NO_VALGRIND_)TEST_FILTER.
  • I added appropriate Fixes: annotations to PR description.

@wprzytula wprzytula self-assigned this Jun 25, 2025
@wprzytula wprzytula added this to the 0.6 milestone Jun 25, 2025
@wprzytula wprzytula added bug Something isn't working P1 P1 priority item - very important labels Jun 25, 2025
@wprzytula wprzytula requested review from Copilot and Lorak-mmk June 25, 2025 10:12
Copilot

This comment was marked as outdated.

@wprzytula wprzytula force-pushed the fix-session-close branch from fead714 to b477893 Compare June 25, 2025 10:14
@wprzytula wprzytula marked this pull request as ready for review June 25, 2025 10:37
@wprzytula wprzytula removed the request for review from Lorak-mmk June 30, 2025 16:53
@wprzytula
Copy link
Collaborator Author

wprzytula commented Jun 30, 2025

The taken approach is wrong. Callbacks - they will panic due to #329 problem.
This must wait until I find an approach that will fix #329 and support current_thread runtime at the same time.

@wprzytula wprzytula marked this pull request as draft July 1, 2025 06:36
@wprzytula wprzytula modified the milestones: 0.6, 0.5.1 Jul 6, 2025
@wprzytula wprzytula force-pushed the fix-session-close branch from b477893 to e3c2b0a Compare July 6, 2025 12:51
@wprzytula
Copy link
Collaborator Author

Rebased on master.

@wprzytula wprzytula force-pushed the fix-session-close branch from e3c2b0a to 1c31e9c Compare July 8, 2025 11:35
@wprzytula wprzytula added P0 P0 item - absolute must have and removed P1 P1 priority item - very important labels Jul 8, 2025
@wprzytula
Copy link
Collaborator Author

Reprioritized to P0 due to relation to #329.

@wprzytula wprzytula marked this pull request as ready for review July 8, 2025 12:06
@wprzytula wprzytula requested review from Lorak-mmk and Copilot July 8, 2025 12:06
Copilot

This comment was marked as outdated.

@wprzytula wprzytula force-pushed the fix-session-close branch 2 times, most recently from 8b3237b to 41c4046 Compare July 8, 2025 12:20
Copy link
Collaborator

@Lorak-mmk Lorak-mmk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found a serious error - lmk if I'm right.

Comment on lines 153 to 341
let prev = cass_session.connected.compare_and_swap(
&None::<Arc<CassConnectedSession>>,
Some(Arc::new(CassConnectedSession {
session,
exec_profile_map,
})),
);
if prev.is_some() {
return Err(error());
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit sad that if there are multiple connection attempts, all will happen. Peviously others woul wait on lock.
I don't think it is an important scenario, so it is acceptable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caller is responsible for not connecting session concurrently.

Comment on lines 317 to 172
// We start by setting the Notify, so that we won't lose a wakeup
// if the last request finishes before we set it.
cass_session_connected
.requests_finished_notify
.set(Notify::new())
.expect(
"The swap guarantees that only one thread takes the connected session. \
And this should never be an issue IRL, because session should be closed \
from a single thread, only once",
);

let fut = async move {
// TODO: add waiting for the pending requests to finish.
let _ = cass_session_connected;
while cass_session_connected.has_pending_requests() {
// Wait for all pending requests to finish.
// This will block until the last request finishes and calls `requests_finished_notify.notify_one()`.
cass_session_connected
.requests_finished_notify
.get()
.expect("We have initialized the OnceLock prior")
.notified()
.await;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the following race possible?

  • No requests are pending
  • Thread 1 starts cass_session_execute. performs load_full on session.
  • Thread 2 performs cass_session_free
  • Thread 2 sets notify, checks that session has no pending requests, finishes
  • Thread 1 pends a request and performs it.

Copy link
Collaborator Author

@wprzytula wprzytula Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I forgot to reason about multi-threaded concurrency.

Copy link
Collaborator Author

@wprzytula wprzytula Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the described scenario can be fixed by a "check-set-confirm" (my own crafted name) operation, by requiring the session to not be closed yet after the request is pended. What do I mean?

  • Thread 1 starts cass_session_execute. It performs load_full on session.
  • Thread 2 performs cass_session_free
  • Thread 2 sets notify, checks that session has no pending requests, finishes
  • Thread 1 pends a request.
  • (the changed part) Thread 1 verifies that the session is still open. If it is, it performs the request. If it is not, it unpends the request and returns an error ("session has been closed").

Why is this correct?
cass_session_close's future must wait for running requests. It's not clear who was first - session closing or a pended request - in a parallel scenario. Then, I believe we are free to refuse starting the request that encountered a closed session after it has been pended.

Note: we must prevent the ABA problem, i.e., distinguish the old session from the newly connected, different session.

  1. I believe this can be done by merely using Arc::ptr_eq on the old and the new Arc<CassConnectedSession>.
  2. If you see a way to break this, we could always use a static usize counter that will be unique for every connected session.

WDYT?

Copy link
Collaborator Author

@wprzytula wprzytula Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a commit which I believe solves the problem. Please verify it.

@wprzytula wprzytula force-pushed the fix-session-close branch from 41c4046 to e848d07 Compare July 8, 2025 18:12
@wprzytula wprzytula requested a review from Lorak-mmk July 8, 2025 20:04
@wprzytula wprzytula force-pushed the fix-session-close branch 4 times, most recently from 4a93ded to 5d0b098 Compare July 14, 2025 12:05
@wprzytula wprzytula requested review from Lorak-mmk and removed request for Lorak-mmk July 14, 2025 12:11
@wprzytula
Copy link
Collaborator Author

@Lorak-mmk I've redone this in a massively simplified way, leveraging the existing RwLock. In short, nonblocking variants of read() functions were used to prevent #329 issues, resulting in working code that passes both tests and the "callbacks" example. LMK if this looks correct to you.

@wprzytula wprzytula requested a review from Copilot July 14, 2025 12:22
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refines session closing semantics to ensure all in-flight requests complete before freeing, streamlines future creation, updates the Tokio runtime to multi-threaded, and fully enables the AsyncTests::Close suite with associated C++ and Rust tests.

  • Introduce make_ready_raw() for simpler ready-future creation and refactor async session operations to use non-blocking try_read/write_owned locks.
  • Update the global Tokio runtime to a multi-threaded builder and adjust tests to validate session-free blocking behavior.
  • Enable and tune the AsyncTests suite, add sleep delays in C++ async tests, and implement a Rust integration test for cass_session_free.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/src/integration/tests/test_async.cpp Reduced concurrent request count to 5 and added insert.set_sleep_time(100) to force overlap in close test.
scylla-rust-wrapper/src/session.rs Refactored session connect/execute/close calls to use try_read_owned/write_owned, added null/disconnect checks.
scylla-rust-wrapper/src/lib.rs Changed RUNTIME to use a multi-threaded Tokio runtime with 2 worker threads.
scylla-rust-wrapper/src/future.rs Added make_ready_raw() helper and removed an unused import.
Makefile Enabled AsyncTests.* filter for both Scylla and Cassandra test targets.

@wprzytula wprzytula force-pushed the fix-session-close branch from 5d0b098 to 6ee294a Compare July 14, 2025 12:36
Copy link
Collaborator

@Lorak-mmk Lorak-mmk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks really good now! I just left a few minor comments.

Comment on lines 70 to 87

fn connect(
session: Arc<RwLock<CassSessionInner>>,
session: Arc<CassSession>,
cluster: &CassCluster,
keyspace: Option<String>,
) -> CassOwnedSharedPtr<CassFuture, CMut> {
let session_builder = cluster.build_session_builder();
let exec_profile_map = cluster.execution_profile_map().clone();
let host_filter = cluster.build_host_filter();

let mut session_guard = RUNTIME.block_on(session.write_owned());

if let Some(cluster_client_id) = cluster.get_client_id() {
// If the user set a client id, use it instead of the random one.
session_guard.client_id = cluster_client_id;
}
let cluster_client_id = cluster.get_client_id();

let fut = Self::connect_fut(
session_guard,
session,
session_builder,
cluster_client_id,
exec_profile_map,
host_filter,
keyspace,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the new approach. The only thing I don't yet understand is why do we need to revert the approach to handling client id. connect is not something that would be used in callbacks, right? In which case it should be fine to use block_on here. What do I miss?

I'm asking because I know you wanted client id to work synchronously, so it is surprising that you abandoned the idea.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

connect is not something that would be used in callbacks, right?

I won't assume so.

I'm asking because I know you wanted client id to work synchronously, so it is surprising that you abandoned the idea.

I haven't found a way to make it work with the RwLock, while retaining full API usability in callbacks.

@wprzytula wprzytula force-pushed the fix-session-close branch from 6ee294a to afb9683 Compare July 15, 2025 07:57
@wprzytula
Copy link
Collaborator Author

Addressed the comments.

@wprzytula wprzytula requested a review from Lorak-mmk July 15, 2025 11:38
Copy link
Collaborator

@Lorak-mmk Lorak-mmk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Please remember to remove the TMP commit before merging!
I'd be perfectly ok with cass_session_connect not working in callbacks (and it would give us the desired semantics for client id), but I leave that up to you.

By accident, the code was not deduplicated to use the `close_fut`
function.
The function mistakenly had the `one` part of its name duplicated.
As this set of rules is going to be useful for more than one test, it
makes sense to make it a separate function.
The mechanism around locking the session has been problematic in the
number of ways.

**1. Late locking**:
The primary issue was that requests would not lock
lock the session for reading until their futures were polled, which
meant that the following code could lead to request failure due to
session having been closed:

```c
CassFuture *exec_fut = cass_session_execute(cass_session, cass_statement);
CassFuture *close_fut = cass_session_close(cass_session);
cass_future_wait(exec_fut);
cass_future_wait(close_fut);
```

This is because locking the session for reading is done asynchronously
to the code that follows `cass_session_execute` invocation.
The same issue was present in all request-making functions, that is,
`cass_session_execute`, `cass_session_execute_batch`,
and all of the `cass_session_prepare*` family of functions.

**2. Thread-blocking locking**:
The second issue was that, in some cases, the session was locked for
reading by blocking the current thread idly. This was the case for
`cass_session_get_metrics`, `cass_session_get_schema_meta`,
and `cass_session_get_client_id`. This could lead to deadlocks,
especially when the number of threads in the thread pool were low
(with `current_thread` tokio executor being the most vulnerable case).

**3. Runtime-blocking locking**:
The third issue was that, in the case of `cass_session_connect*`
functions, the session was locked for writing by blocking the current
thread as the executor thread for the awaited future. While this was
designed with the `current_thread` executor in mind and worked perfectly
for its case, it showed to cause panic when called by a tokio executor
thread.

**Solution**:
This commit addresses all of the above issues by adopting an asymmetric
locking mechanism for the session.

The session is now locked for reading in advance yet fallibly, by
calling `try_read(_owned)` on the session rwlock. This is done in
the request-making functions, so that the session is guaranteed to be
locked for reading when the request future is returned.

The session is still locked for writing (upon connecting or closing)
asynchronously, by calling and awaiting `write(_owned)` on the rwlock.
This is done in the `cass_session_connect*` and `cass_session_close`.
A downside of this approach is that the session is not guaranteed to be
locked for writing when the `cass_session_connect*` or
`cass_session_close` futures are returned, but this is not a problem
because closing and connecting are considered to be "long-running",
complex operations and thus are not expected to have conducted a
specific part of their logic by the time their future is returned.

**Results**:
All enabled tests still pass, while the `callbacks` example now passes,
too! This best part is that the number and complexity of the required
changes is minimal, and the code is now much more robust.
I hope @Lorak-mmk will be happy with this solution, as compared to the
complex requests pending mechanism and atomics.

Note that the test for `cass_session_get_client_id` had to be adjusted.
This is because the session has the client ID set only in the connect
future instead of in the connect function synchronously, so the test
(which did not await the connect future) would fail after the changes.
Since the commit c1e40d7, the Session
`execute(_batch)` methods now clone the Session's Arc, which prevents
UAF if the Session is closed while the requests are still running.

That commit's message says: "we cannot enable `AsyncTests::Close` yet
since it expects that prematurely dropped session awaits all async tasks
before closing". This is now taken care of by the previous commits.
Thus, we can enable the `AsyncTests::Close` test suite.
This tests that `cass_session_free` synchronously waits for all
in-flight requests to complete before freeing the session.
@wprzytula wprzytula force-pushed the fix-session-close branch from afb9683 to 42d7cc4 Compare July 15, 2025 11:56
@wprzytula
Copy link
Collaborator Author

Rebased on master and dropped the temporary commit.

@wprzytula wprzytula requested a review from Lorak-mmk July 15, 2025 11:56
@wprzytula wprzytula merged commit 85d4eb8 into scylladb:master Jul 15, 2025
11 checks passed
@wprzytula wprzytula deleted the fix-session-close branch July 15, 2025 12:25
@wprzytula wprzytula mentioned this pull request Jul 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 P0 item - absolute must have
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Panic when cass_future_error_code() is called from a future callback fix AsyncTests.*_Close (await pending futures when closing the session)
2 participants