Skip to content

Commit 56b6b0d

Browse files
Use a new schema for the data storage in Linera. (#4814)
## Motivation Databases commonly use a `partition_key` which corresponds to `root_key` in our code. The partition key is used for hashing purposes in order to spread the workload over nodes. An unfortunate feature of the existing schema is that Blobs, BlobStates, Events, and certificates are all in the same partition (the one corresponding to `&[]`). This causes some performance problems. A common recommendation for schema design is that the partition key should be spread out so that one bin does not receive too much data. Fixes #4807 ## Proposal The following proposal is implemented: * For all base keys except Event, the root key is determined from the serialization. * For the Events, we want to access several events at once. So, the base key is serialized by taking only the `ChainId`, `StreamId`. This led to the introduction of a `fn root_key(&self)` for the `BaseKey` type. The function does not return errors since for the types in question, BlobId, CryptoHash, ChainId, returning an error is impossible. Why this change is the right one: * There is no limit to the number of partition keys in databases. On the other hand, there is a limit to the size of data for a specified partition key. So concentrating all data on one partition key creates potential problems above 100M and may fail completely for 2G. * We are already having a root-key formed from the `ChainId` for the application states. So, we already accept that we can have many many partition keys. The `Batch` of `linera-storage` is replaced by a `MultiPartitionBatch`. It is unfortunate that we had the collision with the `Batch` of `linera-views`. This PR does the requested job of changing only the `linera-storage`. However, there are some losses of parallelization for the `read_multi_values/contains_keys` operation. This is not irremediable: * We can add some functions `read_multi_root_values(_, root_keys: Vec<Vec<u8>>, key: Vec<u8>)` to the `KeyValueDatabase`. It is possible to implement this feature efficiently in `ScyllaDb`, which is our main database target. * We can add some `write_multi_partition_batch` to the `KeyValueDatabase`. Note that the existing `write_batch` in `db_storage.rs` is creating many futures, but the right solution is likely to group the entries. Of course, batch size is an issue, but it has to be addressed by measuring it not spreading over all. * It is a little bit problematic to see how those features could be implemented in the combinators like LruCaching, ValueSplitting, and so on. ## Test Plan The CI. ## Release Plan Hopefully, to put it into the main. It is possible to write a migration tool that takes the existing storage of TestNet Conway and converts it to the new schema. But that is only if we really want to do that. Before that, it would be good to see if the scalability works as expected for ScyllaDb runs. ## Links None.
1 parent df074a1 commit 56b6b0d

File tree

1 file changed

+187
-179
lines changed

1 file changed

+187
-179
lines changed

0 commit comments

Comments
 (0)