Skip to content

Conversation

Lorak-mmk
Copy link
Collaborator

A race can happen when multiple calls with different statements (none of which are in cache) are executed at the same time, with cache full.

It is possible that all of them will remove the same entry from the cache, and then add their own entries. Then size of the cache will exceed self.max_capacity, so we will never remove from cache again.

First fix that I thought of is to change "==" to "<=". With that fix, we would still be able to remove elements from the map after race happens. Note that we only remove an element when inserting a new element, so this would still never decrease the cache overflow unless 2 cache misses with the same query happen at the same time. So the memory leak would still happen, but much slower.

So in addition to that I changed "if cache full: remove" to "while cache full: remove". In some cases it could lead to some queries getting evicted for no reason, hurting performance. There is also slim possibility of some thread getting starved for a bit. Those issues seem not very probable, so maybe it is better to risk them than the memory leak? I'm honestly not sure. Maybe we should remove the loop and only keep "<=".

I also thought of checking cache size in the happy path, but len() is not a simple operation in dashmap - it goes through shards and sums their lengths. I didn't benchmark it, but I'm afraid it could hurt performance.

Maybe in the future we should research how dedicated cache libraries are approaching this?

Fixes: #1420

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes. - Writing tests for that seems quite difficult...
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

A race can happen when multiple calls with different statements (none of
which are in cache) are executed at the same time, with cache full.

It is possible that all of them will remove the same entry from the
cache, and then add their own entries. Then size of the cache will
exceed self.max_capacity, so we will never remove from cache again.

First fix that I thought of is to change "==" to "<=". With that fix, we
would still be able to remove elements from the map after race happens.
Note that we only remove an element when inserting a new element, so
this would still never decrease the cache overflow unless 2 cache misses
with the same query happen at the same time. So the memory leak would
still happen, but much slower.

So in addition to that I changed  "if cache full: remove" to "while
cache full: remove". In some cases it could lead to some queries getting
evicted for no reason, hurting performance. There is also slim
possibility of some thread getting starved for a bit.
Those issues seem not very probable, so may it is better to risk them
than the memory leak? I'm honestly not sure.

I also thought of checking cache size in the happy path, but len() is
not a simple operation in dashmap - it goes through shards and sums
their lengths. I didn't benchmark it, but I'm afraid it could hurt
performance.

Maybe in the future we should research how dedicated cache libraries are
approaching this?
@Lorak-mmk Lorak-mmk self-assigned this Aug 13, 2025
@Lorak-mmk Lorak-mmk requested review from wprzytula and Copilot August 13, 2025 19:02
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a race condition in the CachingSession that could lead to a memory leak. The issue occurs when multiple threads execute cache misses simultaneously on a full cache, potentially causing all threads to remove the same cache entry but add different ones, resulting in cache overflow that would never be cleaned up.

  • Changed the cache size check from equality (==) to less-than-or-equal (<=) to handle overflow conditions
  • Replaced the single cache eviction with a while loop to ensure proper cache size management after race conditions
  • Added detailed comments explaining the race condition and potential trade-offs of the solution

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

// If we don't have a loop here, then this overflow would never disappear during typical
// operation of caching session.
// The loop has downsides: it could evict more entries than strictly necessary, or starve
// some thread for a bit. If this becomes a problem then maye we should research how
Copy link
Preview

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a typo in the comment: 'maye' should be 'maybe'.

Suggested change
// some thread for a bit. If this becomes a problem then maye we should research how
// some thread for a bit. If this becomes a problem then maybe we should research how

Copilot uses AI. Check for mistakes.

@github-actions github-actions bot added the semver-checks-breaking cargo-semver-checks reports that this PR introduces breaking API changes label Aug 13, 2025
Copy link

cargo semver-checks detected some API incompatibilities in this PR.
Checked commit: a84ed29

See the following report for details:

cargo semver-checks output
./scripts/semver-checks.sh --baseline-rev 3da897d6e81e785c9be145da47f31248c33a4f30
+ cargo semver-checks -p scylla -p scylla-cql --baseline-rev 3da897d6e81e785c9be145da47f31248c33a4f30
     Cloning 3da897d6e81e785c9be145da47f31248c33a4f30
    Building scylla v1.3.1 (current)
       Built [  35.258s] (current)
     Parsing scylla v1.3.1 (current)
      Parsed [   0.114s] (current)
    Building scylla v1.3.1 (baseline)
error: running cargo-doc on crate 'scylla' failed with output:
-----
   Compiling libc v0.2.175
   Compiling proc-macro2 v1.0.97
   Compiling unicode-ident v1.0.18
   Compiling shlex v1.3.0
   Compiling autocfg v1.5.0
    Checking cfg-if v1.0.1
   Compiling fs_extra v1.3.0
   Compiling dunce v1.0.5
   Compiling num-traits v0.2.19
   Compiling getrandom v0.3.3
   Compiling quote v1.0.40
    Checking pin-project-lite v0.2.16
   Compiling syn v2.0.105
   Compiling pkg-config v0.3.32
   Compiling jobserver v0.1.33
    Checking zeroize v1.8.1
   Compiling vcpkg v0.2.15
   Compiling cc v1.2.32
    Checking once_cell v1.21.3
   Compiling aws-lc-rs v1.13.3
   Compiling ident_case v1.0.1
   Compiling strsim v0.11.1
   Compiling cmake v0.1.54
   Compiling zerocopy v0.8.26
   Compiling fnv v1.0.7
    Checking num-integer v0.1.46
    Checking mio v1.0.4
    Checking socket2 v0.6.0
    Checking bytes v1.10.1
   Compiling aws-lc-sys v0.30.0
   Compiling openssl-sys v0.9.109
    Checking futures-sink v0.3.31
    Checking futures-core v0.3.31
   Compiling libm v0.2.15
    Checking futures-channel v0.3.31
    Checking rand_core v0.9.3
    Checking rustls-pki-types v1.12.0
   Compiling lock_api v0.4.13
   Compiling bigdecimal v0.4.8
   Compiling num-bigint v0.3.3
   Compiling synstructure v0.13.2
   Compiling darling_core v0.20.11
   Compiling snap v1.1.1
    Checking untrusted v0.9.0
    Checking slab v0.4.11
    Checking futures-io v0.3.31
   Compiling thiserror v1.0.69
    Checking powerfmt v0.2.0
    Checking pin-utils v0.1.0
   Compiling openssl v0.10.73
   Compiling thiserror v2.0.14
   Compiling parking_lot_core v0.9.11
   Compiling tokio-macros v2.5.0
   Compiling zerofrom-derive v0.1.6
    Checking tokio v1.47.1
   Compiling darling_macro v0.20.11
   Compiling futures-macro v0.3.31
    Checking memchr v2.7.5
   Compiling rustls v0.23.31
   Compiling crossbeam-utils v0.8.21
    Checking foreign-types-shared v0.1.1
    Checking futures-task v0.3.31
    Checking futures-util v0.3.31
    Checking foreign-types v0.3.2
   Compiling darling v0.20.11
    Checking zerofrom v0.1.6
   Compiling yoke-derive v0.8.0
   Compiling thiserror-impl v2.0.14
   Compiling thiserror-impl v1.0.69
   Compiling openssl-macros v0.1.1
    Checking deranged v0.4.0
    Checking ppv-lite86 v0.2.21
    Checking num-bigint v0.4.6
    Checking iana-time-zone v0.1.63
    Checking either v1.15.0
    Checking twox-hash v2.1.1
    Checking scopeguard v1.2.0
    Checking log v0.4.27
    Checking stable_deref_trait v1.2.0
    Checking time-core v0.1.4
   Compiling tokio-openssl v0.6.5
    Checking bitflags v2.9.1
    Checking subtle v2.6.1
    Checking num-conv v0.1.0
    Checking smallvec v1.15.1
    Checking time v0.3.41
    Checking yoke v0.8.0
    Checking itertools v0.14.0
    Checking lz4_flex v0.11.5
    Checking chrono v0.4.41
    Checking rand_chacha v0.9.0
    Checking futures-executor v0.3.31
   Compiling scylla-macros v1.3.1 (/home/runner/work/scylla-rust-driver/scylla-rust-driver/target/semver-checks/git-3da897d6e81e785c9be145da47f31248c33a4f30/c3ed706bfaabbfdac8a0ac777c681ed8709c2256/scylla-macros)
   Compiling tracing-attributes v0.1.30
    Checking uuid v1.18.0
    Checking tracing-core v0.1.34
    Checking secrecy v0.8.0
    Checking hashbrown v0.14.5
    Checking foldhash v0.1.5
    Checking equivalent v1.0.2
    Checking byteorder v1.5.0
    Checking allocator-api2 v0.2.21
    Checking tracing v0.1.41
    Checking dashmap v6.1.0
    Checking hashbrown v0.15.5
    Checking scylla-cql v1.3.1 (/home/runner/work/scylla-rust-driver/scylla-rust-driver/target/semver-checks/git-3da897d6e81e785c9be145da47f31248c33a4f30/c3ed706bfaabbfdac8a0ac777c681ed8709c2256/scylla-cql)
    Checking futures v0.3.31
    Checking histogram v0.11.3
    Checking rand v0.9.2
   Compiling async-trait v0.1.88
    Checking rand_pcg v0.9.0
    Checking socket2 v0.5.10
    Checking arc-swap v1.7.1
    Checking rustls-webpki v0.103.4
    Checking tokio-rustls v0.26.2
 Documenting scylla v1.3.1 (/home/runner/work/scylla-rust-driver/scylla-rust-driver/target/semver-checks/git-3da897d6e81e785c9be145da47f31248c33a4f30/c3ed706bfaabbfdac8a0ac777c681ed8709c2256/scylla)
�[38;5;9merror: couldn't read `/home/runner/work/scylla-rust-driver/scylla-rust-driver/target/semver-checks/git-3da897d6e81e785c9be145da47f31248c33a4f30/c3ed706bfaabbfdac8a0ac777c681ed8709c2256/scylla/src/deserialize/README.md`: No such file or directory (os error 2)
   �[38;5;12m--> /home/runner/work/scylla-rust-driver/scylla-rust-driver/target/semver-checks/git-3da897d6e81e785c9be145da47f31248c33a4f30/c3ed706bfaabbfdac8a0ac777c681ed8709c2256/scylla/src/lib.rs:203:14
    �[38;5;12m|
�[38;5;12m203 �[38;5;12m|     #![doc = include_str!("deserialize/README.md")]
    �[38;5;12m|              �[38;5;9m^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error: could not document `scylla`

-----

error: failed to build rustdoc for crate scylla v1.3.1
note: this is usually due to a compilation error in the crate,
      and is unlikely to be a bug in cargo-semver-checks
note: the following command can be used to reproduce the compilation error:
      cargo new --lib example &&
          cd example &&
          echo '[workspace]' >> Cargo.toml &&
          cargo add --path /home/runner/work/scylla-rust-driver/scylla-rust-driver/target/semver-checks/git-3da897d6e81e785c9be145da47f31248c33a4f30/c3ed706bfaabbfdac8a0ac777c681ed8709c2256/scylla --features bigdecimal-04,chrono-04,default,full-serialization,metrics,num-bigint-03,num-bigint-04,openssl-010,rustls-023,secrecy-08,time-03 &&
          cargo check

    Building scylla-cql v1.3.1 (current)
       Built [  10.773s] (current)
     Parsing scylla-cql v1.3.1 (current)
      Parsed [   0.039s] (current)
    Building scylla-cql v1.3.1 (baseline)
       Built [  10.601s] (baseline)
     Parsing scylla-cql v1.3.1 (baseline)
      Parsed [   0.038s] (baseline)
    Checking scylla-cql v1.3.1 -> v1.3.1 (no change; assume patch)
     Checked [   0.342s] 165 checks: 165 pass, 13 skip
     Summary no semver update required
    Finished [  22.578s] scylla-cql
error: aborting due to failure to build rustdoc for crate scylla v1.3.1

Stack backtrace:
   0: anyhow::error::<impl anyhow::Error>::msg
   1: anyhow::__private::format_err
   2: cargo_semver_checks::data_generation::generate::generate_rustdoc
   3: cargo_semver_checks::data_generation::request::CrateDataRequest::resolve
   4: cargo_semver_checks::rustdoc_gen::StatefulRustdocGenerator<cargo_semver_checks::rustdoc_gen::ReadyState>::load_rustdoc
   5: cargo_semver_checks::Check::check_release
   6: cargo_semver_checks::exit_on_error
   7: cargo_semver_checks::main
   8: std::sys::backtrace::__rust_begin_short_backtrace
   9: main
make: *** [Makefile:73: semver-rev] Error 1

@dkropachev
Copy link
Collaborator

dkropachev commented Aug 13, 2025

Probably it worth to throw some warning when cache is full.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-checks-breaking cargo-semver-checks reports that this PR introduces breaking API changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Potential race condition in CachingSession
3 participants