-
Notifications
You must be signed in to change notification settings - Fork 14
Create TTL map with time wheel architecture #92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1988a4e to
fe7cfe8
Compare
| impl<K, V> TTLMap<K, V> | ||
| where | ||
| K: Eq + Hash + Send + Sync + Clone + 'static, | ||
| V: Default + Clone + Send + Sync + 'static, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can explain why V is clone below
| value | ||
| } | ||
| Entry::Occupied(entry) => entry.get().clone(), | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tricky.
When you hold an Entry, you are locking the map and the dashmap specifically says **Locking behaviour:** May deadlock if called when holding any sort of reference into the map
I believe holding the lock over any .await point may deadlock because another task may be scheduled and try to access the same entry (Tbh I think this may happen if the other task accesses a different entry on the same shard).
Alternatives considered:
- we return a reference to V. This is risky because the caller might hold it across an .await and deadlock
- we pass a closure which operates on V while we hold the lock. This is not ideal because it increases lock contention
Thus, I went with Clone. If the caller wants a shared reference, they can wrap their type in an Arc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think requiring Clone for the values is a fair ask, and it's completely fine for our use case 👍
fe7cfe8 to
3e4dd65
Compare
3e4dd65 to
ad74bb8
Compare
This change adds a DashMap-like struct which has a background tasks to clean up entries that have outlived a configurable TTL. This struct is simliar to https://github.com/moka-rs/moka, which also uses time wheels. Having our own module avoids introducing a large dependency, which keeps this project closer to vanilla datafusion. This change is meant to be useful for #89, where it's possible for `ExecutionStages` to be orphaned in `ArrowFlightEndpoint`. We need an async task to clean up old entries. Informs: #90
ad74bb8 to
4346b69
Compare
| /* | ||
| TTLMap is a DashMap that automatically removes entries after a specified time-to-live (TTL). | ||
|
|
||
| How the Time Wheel Works | ||
|
|
||
| Time Buckets: [0] [1] [2] [3] [4] [5] [6] [7] ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Rust, module level docs are defined with //!. That way, documentation compilers and IDEs can interpret them as pieces of documentation and display them appropriately.
| use std::time::Duration; | ||
| use tokio::sync::Mutex; | ||
|
|
||
| // TTLMap is a key-value store that automatically removes entries after a specified time-to-live. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be a doc comment with a triple ///, otherwise IDEs will ignore it while hovering over the struct
| value | ||
| } | ||
| Entry::Occupied(entry) => entry.get().clone(), | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think requiring Clone for the values is a fair ask, and it's completely fine for our use case 👍
| { | ||
| let mut buckets = self.buckets.lock().await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 this looks like it's introducing a global lock shared by all the new requests. This means that all new requests will be competing for the same lock, and it's the main thing the dashmap tried to address in the first place.
I think we should be capable of finding a solution that does not imply a global lock. A hint I know is: spawning tokio tasks is almost free. What to do with that hint? no idea yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review!
I've attempted an alternative implementation here, using your hint: https://github.com/datafusion-contrib/datafusion-distributed/pull/96/files#diff-835dd4f31132680b017b402166e24979a012bb5f1975c95c34de467706b964d2
- Each shard in the time wheel has it's own task and
KeySetand a clone of theArc<DashMap>. Each task is responsible for managing some subset of keys in the map. This avoids the global mutex over all the KeySets - Now, when we insert a key into the
TTLMap, we send the key on a channel to the task which is responsible for its shard. - There's one task responsible for issuing Clear commands (also over the channel) to the time wheel tasks
I pushed a benchmark commit to this branch (and same one to the other branch above) to see if I improved lock contention. Unfortunately, the results look similar regardless of what I do.
There's a lot of variance but generally I see 17531721 - 20000000 ops / sec.
This PR
Tasks: 100
Operations per task: 1000000
Total operations: 100000000
Total time: 5.70s
Throughput: 17531721 ops/sec
Average latency: 0.06 μs per operation
Entries remaining: 10
test common::ttl_map::tests::bench_lock_contention ... ok
Other PR with no mutex
=== TTLMap Lock Contention Benchmark ===
Tasks: 100
Operations per task: 1000000
Total operations: 100000000
Total time: 6.03s
Throughput: 16582541 ops/sec
Average latency: 0.06 μs per operation
Entries remaining: 10
test common::ttl_map::tests::bench_lock_contention ... ok
If I comment all locking and the GC task, the baseline performance is at most 20M ops / sec as well.
=== TTLMap Lock Contention Benchmark ===
Tasks: 100
Operations per task: 1000000
Total operations: 100000000
Total time: 4.93s
Throughput: 20291279 ops/sec
Average latency: 0.05 μs per operation
Entries remaining: 10
test common::ttl_map::tests::bench_lock_contention ... ok
I'm not sure how to proceed. Either the benchmark needs to be better or locking the time wheel is not the bottle neck. The latter is possible because:
- The critical sections under
self.buckets.lock().awaitare very small (constant # of instructions) - The dashmap mutex (which we hold while initializing entries) is probably a bigger source of contention because we hold an Entry's lock while initializing it.
- This is okay because the first request initializes the key, so the other ones have to wait anyways and can read the cached
Valueonce its done. - This might be bad if the shards in the DashMap are large, introducing cross-key contention. Generally, this is an okay tradeoff and we defer this decision to the DashMap
- This is okay because the first request initializes the key, so the other ones have to wait anyways and can read the cached
Please let me know your thoughts!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's going to be a bit hard to create an accurate benchmark for this, but I think it's worth getting that right in order to compare approaches. I think there are some improvement points to the benchmark in order to make it more realistic:
- The
worker_threadsin the test should be as high as your machine allows you, with just the currentworker_threads = 2, the chances of just two physical threads competing concurrently for the same lock are slim. key_countshould probably be much higher, as in reality almost all keys are going to be different, I think even making all the keys random could worktask_countshould be several orders of magnitude higher, something like (number of concurrent queries) * (number of stages per query) * (number of tasks per stage) / (number of available workers), in practice, this can easily go in the order of millions- In practice, each task is handled in its own
do_getcall, and tonic will spawn a tokio task for it, so it should be more realistic that each task in the bench only performs 1get_or_init - I'd remove any source of noise like string allocations (
format!), I'd try to just benchmark a pureget_or_init()access
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's a draft approach #97
The result of running that is:
=== TTLMap Lock Contention Benchmark ===
Tasks: 100000
Total time: 88.52ms
Average latency: 425 μs per operation
Entries remaining: 0
DashMap Lock contention time: 12ms
Mutex Lock contention time: 42503ms
|
@gabotechs comments are on point and +1 to them. I will add that in addition to handling orphaned stages, we also need to handle removing the stage when we know that it is done being used. This way we do not have to wait for a timeout during normal operation. I suggest we do that in a separate PR, but wanted to note it here. |
|
@robtandy Is there a way for the endpoint to know how many times it may be called? Ex. how many actors will read from it. I'm thinking we can evict a stage once it's called N times, however this won't account for retries. |
|
@jayshrivastava , the |
If this was a JS lib that needs to be bundled in a browser I that would make a lot of sense, but being a Rust library where everything is going to get eventually compiled... I don't think it makes a difference unless there's a significant impact in compilation time or something perceptible by a user. Introducing dependencies should not make the project closer or farther as long as it's not perceptible from the outside, I think it could be worth exploring https://github.com/moka-rs/moka at least for getting a baseline for our benchmarks |
|
From the new bench, it looks like the method in #96 is significantly faster. Next steps
Mutex (extremely high variance btw, sometimes it's 15000us, sometimes its 30000us) Non-Mutex (#96) |
|
Got claude to use moka (#99), but it's not as fast as #96 👀 I think this is expected because moka has to do more accounting, like using LRU vs LFU. Using moka means we have less code to maintain, if that's a concern. We probably don't even need a wrapper. |
This change adds a DashMap-like struct which has a background tasks to clean up entries that have outlived a configurable TTL. This data structure is added so it can be used in the `ArrowFlightEndpoint`, where it's possible for `ExecutionStages` to be orphaned due to errors. This change adds the ability to clean up tasks async. The implementation is simliar to https://github.com/moka-rs/moka, which also uses time wheels. Benchmarks show that this implementation has less overhead per operation than moka (see results #99). ``` === TTLMap Moka Benchmark === Tasks: 100000 Total time: 162.53ms Average latency: 45 μs per operation Throughput: 615257.30 ops/sec === TTLMap Lock Contention Benchmark === Tasks: 100000 Total time: 137.07ms Average latency: 985 ns per operation Entries remaining: 0 DashMap Lock contention time: 21ms Accounting time: 47ms ``` There's also an implementation in #92, which has the worst performance by far. ``` Tasks: 100000 Total time: 105.65ms Average latency: 20453 μs per operation Entries remaining: 0 DashMap Lock contention time: 23ms Mutex Lock contention time: 2045251ms ``` Informs: #90
|
I've cleaned up #96 to make it production quality. If it makes sense, we can close this PR in favor of that one.
I'll address this in a separate PR and leave #90 open until that is completed. |
This change adds a DashMap-like struct which has a background tasks to clean up entries that have outlived a configurable TTL. This data structure is added so it can be used in the `ArrowFlightEndpoint`, where it's possible for `ExecutionStages` to be orphaned due to errors. This change adds the ability to clean up tasks async. The implementation is simliar to https://github.com/moka-rs/moka, which also uses time wheels. Benchmarks show that this implementation has less overhead per operation than moka (see results #99). ``` === TTLMap Moka Benchmark === Tasks: 100000 Total time: 162.53ms Average latency: 45 μs per operation Throughput: 615257.30 ops/sec === TTLMap Lock Contention Benchmark === Tasks: 100000 Total time: 137.07ms Average latency: 985 ns per operation Entries remaining: 0 DashMap Lock contention time: 21ms Accounting time: 47ms ``` There's also an implementation in #92, which has the worst performance by far. ``` Tasks: 100000 Total time: 105.65ms Average latency: 20453 μs per operation Entries remaining: 0 DashMap Lock contention time: 23ms Mutex Lock contention time: 2045251ms ``` Informs: #90
This change adds a DashMap-like struct which has a background tasks to clean up entries that have outlived a configurable TTL. This data structure is added so it can be used in the `ArrowFlightEndpoint`, where it's possible for `ExecutionStages` to be orphaned due to errors. This change adds the ability to clean up tasks async. The implementation is simliar to https://github.com/moka-rs/moka, which also uses time wheels. Benchmarks show that this implementation has less overhead per operation than moka (see results #99). ``` === TTLMap Moka Benchmark === Tasks: 100000 Total time: 162.53ms Average latency: 45 μs per operation Throughput: 615257.30 ops/sec === TTLMap Lock Contention Benchmark === Tasks: 100000 Total time: 137.07ms Average latency: 985 ns per operation Entries remaining: 0 DashMap Lock contention time: 21ms Accounting time: 47ms ``` There's also an implementation in #92, which has the worst performance by far. ``` Tasks: 100000 Total time: 105.65ms Average latency: 20453 μs per operation Entries remaining: 0 DashMap Lock contention time: 23ms Mutex Lock contention time: 2045251ms ``` Informs: #90
|
Closing in favor of #96 |
This change adds a DashMap-like struct which has a background tasks to clean up entries that have outlived a configurable TTL. This data structure is added so it can be used in the `ArrowFlightEndpoint`, where it's possible for `ExecutionStages` to be orphaned due to errors. This change adds the ability to clean up tasks async. The implementation is simliar to https://github.com/moka-rs/moka, which also uses time wheels. Benchmarks show that this implementation has less overhead per operation than moka (see results #99). ``` === TTLMap Moka Benchmark === Tasks: 100000 Total time: 162.53ms Average latency: 45 μs per operation Throughput: 615257.30 ops/sec === TTLMap Lock Contention Benchmark === Tasks: 100000 Total time: 137.07ms Average latency: 985 ns per operation Entries remaining: 0 DashMap Lock contention time: 21ms Accounting time: 47ms ``` There's also an implementation in #92, which has the worst performance by far. ``` Tasks: 100000 Total time: 105.65ms Average latency: 20453 μs per operation Entries remaining: 0 DashMap Lock contention time: 23ms Mutex Lock contention time: 2045251ms ``` Informs: #90
This change adds a DashMap-like struct which has a background tasks to clean up entries that have outlived a configurable TTL. This data structure is added so it can be used in the `ArrowFlightEndpoint`, where it's possible for `ExecutionStages` to be orphaned due to errors. This change adds the ability to clean up tasks async. The implementation is simliar to https://github.com/moka-rs/moka, which also uses time wheels. Benchmarks show that this implementation has less overhead per operation than moka (see results #99). ``` === TTLMap Moka Benchmark === Tasks: 100000 Total time: 162.53ms Average latency: 45 μs per operation Throughput: 615257.30 ops/sec === TTLMap Lock Contention Benchmark === Tasks: 100000 Total time: 137.07ms Average latency: 985 ns per operation Entries remaining: 0 DashMap Lock contention time: 21ms Accounting time: 47ms ``` There's also an implementation in #92, which has the worst performance by far. ``` Tasks: 100000 Total time: 105.65ms Average latency: 20453 μs per operation Entries remaining: 0 DashMap Lock contention time: 23ms Mutex Lock contention time: 2045251ms ``` Informs: #90
This change adds a DashMap-like struct which has a background tasks to clean
up entries that have outlived a configurable TTL.
This struct is simliar to https://github.com/moka-rs/moka, which also uses
time wheels. Having our own module avoids introducing a large dependency, which
keeps this project closer to vanilla datafusion.
This change is meant to be useful for #89, where it's
possible for
ExecutionStagesto be orphaned inArrowFlightEndpoint. We need an async task to clean up old entries.Informs: #90