|
| 1 | +# Storing the Cardano ledger state on disk: integration notes for high-performance backend |
| 2 | + |
| 3 | +Authors: Joris Dral, Wolfgang Jeltsch |
| 4 | +Date: May 2025 |
| 5 | + |
| 6 | +## Sessions |
| 7 | + |
| 8 | +Creating new empty tables or opening tables from snapshots requires a `Session`. |
| 9 | +The session can be created using `openSession`, which has to be done in the |
| 10 | +consensus layer. The session should be shared between all tables. Sharing |
| 11 | +between a table and its duplicates, which are created using `duplicate`, is |
| 12 | +automatic. Once the session is created, it could be stored in the `LedgerDB`. |
| 13 | +When the `LedgerDB` is closed, all tables and the session should be closed. |
| 14 | +Closing the session will automatically close all tables, but this is only |
| 15 | +intended to be a backup functionality: ideally the user closes all tables |
| 16 | +manually. |
| 17 | + |
| 18 | +## The compact index |
| 19 | + |
| 20 | +The compact index is a memory-efficient data structure that maintains serialised |
| 21 | +keys. Rather than storing full keys, it only stores the first 64 bits of each |
| 22 | +key. |
| 23 | + |
| 24 | +The compact index only works properly if in most cases it can determine the |
| 25 | +order of two serialised keys by looking at their 64-bit prefixes. This is the |
| 26 | +case, for example, when the keys are hashes: the probability that two hashes |
| 27 | +have the same 64-bit prefixes is $\frac{1}{2}^{64}$ and thus very small. If the |
| 28 | +hashes are 256 bits in size, then the compact index uses 4 times less memory |
| 29 | +than if it would store the full keys. |
| 30 | + |
| 31 | +There is a backup mechanism in place for the case when the 64-bit prefixes of |
| 32 | +keys are not sufficient to make a comparison. This backup mechanism is less |
| 33 | +memory-efficient and less performant. That said, if the probability of prefix |
| 34 | +clashes is very small, like in the example above, then in practice the backup |
| 35 | +mechanism will never be used. |
| 36 | + |
| 37 | +UTXO keys are *almost* uniformly distributed. Each UTXO key consist of a 32-byte |
| 38 | +hash and a 2-byte index. While the distribution of hashes is uniform, the |
| 39 | +distribution of indexes is not, as indexes are counters that always start at 0. |
| 40 | +A typical transaction has two inputs and two outputs and thus requires storing |
| 41 | +two UTXO keys that have the same hash part, albeit not the same index part. If |
| 42 | +we serialise UTXO keys naively, putting the hash part before the index part, |
| 43 | +then the 64-bit prefixes will often not be sufficient to make comparisons |
| 44 | +between keys. As a result, the backup mechanism will kick in way too often, |
| 45 | +which will severely hamper performance. |
| 46 | + |
| 47 | +The solution is to change the serialisation of UTXO keys such that the first |
| 48 | +64 bits of a serialised key comprise the 2-byte index and just 48 bits of the |
| 49 | +hash. This way, comparisons of keys with equal hashes will succeed, as the |
| 50 | +indexes will be taken into account. On the other hand, it becomes more likely |
| 51 | +that the covered bits of hashes are not enough to distinguish between different |
| 52 | +hashes, but the propability of this should still be so low that the backup |
| 53 | +mechanism will not kick in in practice. |
| 54 | + |
| 55 | +Importantly, range lookups and cursor reads return key–value pairs in the order |
| 56 | +of their *serialised* keys. With the described change to UTXO key serialisation, |
| 57 | +the ordering of serialised keys no longer matches the ordering of actual, |
| 58 | +unserialised keys. This is fine for `lsm-tree`, for which any total ordering of |
| 59 | +keys is as good as any other total ordering. However, the consensus layer will |
| 60 | +face the situation where a range lookup or a cursor read returns key–value pairs |
| 61 | +slightly out of order. Currently, we do not expect this to cause problems. |
| 62 | + |
| 63 | +## Snapshots |
| 64 | + |
| 65 | +Snapshots currently require support for hard links. This means that on Windows |
| 66 | +the library only works when using NTFS. Support for other file systems could be |
| 67 | +added by providing an alternative snapshotting method, but such a method would |
| 68 | +likely involve copying file contents, which is slower than hard-linking. |
| 69 | + |
| 70 | +Creating a snapshot outside the session directory while still using hard links |
| 71 | +should be possible as long as the directory for the snapshot is on the same disk |
| 72 | +volume as the session directory, but this feature is currently not implemented. |
| 73 | +Hard-linking across different volumes is generally not possible; therefore, |
| 74 | +placing a snapshot on a volume that does not hold the associated session |
| 75 | +directory requires a different snapshotting implementation, which would probably |
| 76 | +also rely on copying file contents. |
| 77 | + |
| 78 | +A copying snapshotting implementation would probably kill two birds with one |
| 79 | +stone by removing the two current limitations just discussed. |
| 80 | + |
| 81 | +## Value resolving |
| 82 | + |
| 83 | +When instantiating the `ResolveValue` class, it is usually advisable to |
| 84 | +implement `resolveValue` such that it works directly on the serialised values. |
| 85 | +This is typically cheaper than having `resolveValue` deserialise the values, |
| 86 | +composing them, and then serialising the result. For example, when the resolve |
| 87 | +function is intended to work like `(+)`, then `resolveValue` could add the raw |
| 88 | +bytes of the serialised values and would likely achieve better performance this |
| 89 | +way. |
| 90 | + |
| 91 | +## `io-classes` incompatibility |
| 92 | + |
| 93 | +At the time of writing, various packages in the `cardano-node` stack depend on |
| 94 | +`io-classes-1.5` and the 1.5-versions of its daughter packages, like |
| 95 | +`strict-stm`. For example, the build dependencies in `ouroboros-consensus.cabal` |
| 96 | +contain the following: |
| 97 | + |
| 98 | +* `io-classes ^>= 1.5` |
| 99 | +* `strict-stm ^>= 1.5` |
| 100 | + |
| 101 | +However, `lsm-tree` needs `io-classes-1.6` or `io-classes-1.7`, and this leads |
| 102 | +to a dependency conflict. One would hope that a package could have loose enough |
| 103 | +bounds that it could be built with `io-classes-1.5`, `io-classes-1.6`, and |
| 104 | +`io-classes-1.7`. Unfortunately, this is not the case, because, starting with |
| 105 | +the `io-classes-1.6` release, daughter packages like `strict-stm` are |
| 106 | +sublibraries of `io-classes`. For example, the build dependencies in |
| 107 | +`lsm-tree.cabal` contain the following: |
| 108 | + |
| 109 | +* `io-classes ^>= 1.6 || ^>= 1.7` |
| 110 | +* `io-classes:strict-stm` |
| 111 | + |
| 112 | +Sadly, there is currently no way to express both sets of build dependencies |
| 113 | +within a single `build-depends` field, as Cabal’s support for conditional |
| 114 | +expressions is not powerful enough for this. |
| 115 | + |
| 116 | +It is known to us that the `ouroboros-consensus` stack has not been updated to |
| 117 | +`io-classes-1.7` due to a bug related to Nix. For more information, see |
| 118 | +https://github.com/IntersectMBO/ouroboros-network/pull/4951. We would advise to |
| 119 | +fix this Nix-related bug rather than downgrading `lsm-tree`’s dependency on |
| 120 | +`io-classes` to version 1.5. |
0 commit comments