-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New ScyllaDB key space partitioning #4049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New ScyllaDB key space partitioning #4049
Conversation
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
e2afc91 to
72b1567
Compare
768208f to
0b093e3
Compare
72b1567 to
7cfc4d3
Compare
7cfc4d3 to
0b8cf01
Compare
09d6c81 to
a8361a5
Compare
0b8cf01 to
1f38372
Compare
1f38372 to
07ca203
Compare
07ca203 to
314f92f
Compare
a8361a5 to
0ff7949
Compare
0ff7949 to
2bd4ab3
Compare
314f92f to
63fd134
Compare
2bd4ab3 to
a05e631
Compare
63fd134 to
ff4a945
Compare
| let config = ScyllaDbStore::new_test_config().await?; | ||
| let namespace = generate_test_namespace(); | ||
| let store = ScyllaDbStore::recreate_and_connect(&config, &namespace).await?; | ||
| // TODO(#4065): Remove this once we enforce exclusive mode for views. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we're going to remove the line you added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you’re right, I guess the tests lacking these calls are an orthogonal issue. I’ll remove the comment tomorrow
d9c9974 to
d04a148
Compare
b72852f to
c8b5e29
Compare
d04a148 to
d33ab0d
Compare
c8b5e29 to
36cfe7b
Compare
36cfe7b to
cef2f5d
Compare
d33ab0d to
1cf50f7
Compare
1cf50f7 to
5161f03
Compare
cef2f5d to
aca7d91
Compare
5161f03 to
a419c54
Compare
aca7d91 to
9367ee5
Compare
a419c54 to
0428ee1
Compare
ma2bd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
back to your queue as we iterate on the design
0428ee1 to
496ba58
Compare
a5ebfeb to
0a80832
Compare
496ba58 to
0fb57f0
Compare
0a80832 to
7be4b46
Compare
0fb57f0 to
f1df5d1
Compare
|
Closing this, as we went with a simpler approach |

Motivation
ScyllaDB hashes each partition key into a token. Each token range gets assigned to a shard. Each shard is pinned to a specific CPU by default. This means that having very few partitions is bad for ScyllaDB's performance.
Right now in the current scheme of things we have just one partition key, which is the
root_key. For views, which is our mutable data in the DB,root_keywill be set, and we'll be in "exclusive mode". For everything else (Certificates,ConfirmedBlocks,Blobs, etc), everything is on the same partition. For example when we're doing 1M TPS, we'll have several thousands of blocks being created every second, as well as certificates. This current scheme won't scale for those numbers.Proposal
New schema proposal: instead of having a
root_key, we'll have apartition_keyinstead. Thispartition_keywill have two modes: exclusive mode (mutable data) and non exclusive mode (immutable data). The former will work exactly how the previousroot_keyschema worked.For exclusive mode, the
partition_key's first byte will be0, indicating the mode. The rest of the bytes will be theroot_keythat was provided when callingopen_exclusive. Everything else should work as it previously did.For non exclusive mode, the
partition_key's first byte will be1. The rest of the partition key will be a prefix of the key of a predetermined length (through a configuration parameter). This mode we'll have some reservations:partition_key's prefix size)read_multi_values_internalandcontains_keys_internalwe currently group the keys bypartition_keyand execute one query perpartition_keyin parallel. This is done to keep the queries token aware across partitionsfind_keys_by_prefix_internalandfind_key_values_by_prefix_internal, if the prefix is smaller than thepartition_key's prefix size, we'll do a full table scan. This happens infrequently enough that we're willing to take the performance hit.Test Plan
CI + will benchmark this to check performance. Some tests had to be altered as they didn't respect the invariant that if we're using a
root_key, we should be in exclusive mode.Follow ups
There are several follow ups here:
CertificateandConfirmedBlockBaseKeys closer together, as well asBlobandBlobStatefind_keys_by_prefix_internalfor prefix deletes, and take the perf hit of the full table scan, as these seem to be currently infrequentViews (maybe onload) that they must always be on exclusive modeWritableKeyValueStoreshould havewrite_valueandwrite_multi_valuesmethods, analogous toReadableKeyValueStore. Batches should be used when atomicity is wanted, or when we want to save network requests to the DBRelease Plan