The RedStuff encoding algorithm used in Walrus is an adaptation of the Twin-Code framework presented by Rashmi et al. 1.
The goal of the Walrus system is to provide a distributed storage infrastructure, where a decentralized set of entities—the storage nodes—collaborate to store and serve files ("blobs" of data). When it comes to storage properties, Walrus has a three key goals:
- To support extremely high availability and durability of the data;
- to have low storage overhead compared to full replication, i.e., we don't want to store each blob on every storage node;
- to gracefully support node failures, and in particular to allow for efficient node recovery (more on this later).
Given these requirements, one good option is to erasure encode the blobs across the storage nodes. At a high level, erasure encoding (or erasure
coding) allows to encode the data into
- Erasure coding allows to recover a blob even if
$N - k$ storage nodes fail, providing high availability and durability to the data; - the overall storage overhead is much smaller than for full replication, as for a blob of size
$S$ , the total storage used in the system is$S \cdot c$ instead of$S \cdot N$ , where$c \ll N$ is a small constant (4.5 in Walrus's case).
To achieve the third requirement, however, simple erasure coding is insufficient: A failed node that wants to reconstruct its part of the encoding
needs to first fetch at least
We provide here the essential background on the coding schemes used in RedStuff.
Erasure coding considers the problem of error correction in the case of bit erasures, i.e., some bits in the message are lost, as in the case of a
lossy channel. In general, an erasure code divides a blob (or message) of
Fountain codes are a class of erasure codes. The key property of fountain codes is that the encoding process is rateless, i.e., the encoder can
produce an arbitrary number of encoded parts on the fly, without knowing the total number of parts that will be produced. This is useful for the
RedStuff use case, as it allows us to specify the rate of the encoder (e.g., by encoding
RedStuff is based on the RaptorQ fountain code. RaptorQ is one of the fastest and most efficient fountain codes, and has the following properties:
- It is systematic, i.e., the first
$k$ symbols of the encoded message correspond to the original message; - it is a linear code, i.e., the encoding process is a linear transformation of the input symbols, or in other words, the encoded symbols are linear combinations of the input symbols;
- it is almost optimal, meaning that the
$k' ~ k$ . Specifically, the probability of decoding failure for$k' = k + H$ symbols received is$< 1/256^{H+1}$
An old idea in distributed storage is to use an erasure code to encode blobs of data across multiple storage nodes. Thus, by using
The Twin-Code Framework aims to solve this issue by allowing for efficient node recovery. We briefly describe here how the framework is used in RedStuff, and refer to the original paper for the specific details. The RedStuff encoding algorithm is an adaptation of the Twin-Code framework, which allows for efficient node recovery in erasure-coded storage systems.
We consider a scenario in which a blob of data is encoded and stored across
The RedStuff encoding and recovery process works as follows:
- First, in RedStuff, the data blob of size
$S$ is divided into symbols and arranged in a rectangular message matrix, of up to$N - 2f$ rows and$N - f$ columns of symbols. The number of rows ($n_R$ ) and columns ($n_C$ ) is fixed, and determines the symbol size$s$ as follows:
- Then, the columns and the rows of such message matrix are encoded separately with RaptorQ.
- The primary encoding, performed on columns, expands the
$n_R$ symbols of each column to$N$ symbols (as mentioned above, the rateless nature of RaptorQ allows us to choose the number of encoded symbols). - The secondary encoding, performed on rows, expands the
$n_C$ symbols of each row to$N$ symbols.
- The primary encoding, performed on columns, expands the
- Thus,
$n_R$ is also called the number of "primary source symbols", and$n_C$ the number of "secondary source symbols". Clearly, the primary encoding has rate$n_R / N$ , and the secondary encoding has rate$n_C / N$ . - The encoded rows and columns are then used to obtain primary and secondary slivers, which are distributed to shards and used for blob
reconstruction and sliver recovery:
- Primary slivers are the rows of the matrix of size
$N \times n_C$ obtained with the primary encoding of the message matrix. Each primary sliver is therefore composed of$n_C$ symbols. - Secondary slivers are the columns of the matrix of size
$n_R \times N$ obtained with the primary encoding of the message matrix. Each secondary sliver is therefore composed of$n_R$ symbols.
- Primary slivers are the rows of the matrix of size
- Each shard receives a primary and a secondary sliver, based on the shard number and the row and column numbers of the slivers (see the section on sliver-to-shard mapping for more details).
- The fundamental property we achieve with this construction, and thanks to the linearity of RaptorQ, is that encoding the primary slivers (as rows)
with the secondary encoding and the secondary slivers (as columns) with the primary encoding results in the same
$N \times N$ expanded message matrix. This property can then be used for lost sliver recovery:- If a shard is trying reconstruct a lost primary sliver, it can request
$N-f$ symbols from the encodings of the secondary slivers of other shards, because, as mentioned above, the primary encoding of secondary slivers results in the symbols for primary slivers. Then, since the secondary encoding has$n_C$ source symbols, and$n_C \leq N-2f$ , we can decode the original primary sliver from the obtained recovery symbols with high probability (see the later discussion on recovery probability). - The reconstruction of secondary slivers is identical, but inverting the role of primary and secondary slivers and encodings.
- If a shard is trying reconstruct a lost primary sliver, it can request
The following example concretely shows this process in action.
Consider a Walrus instance with
Then, the primary encoding acts on the columns of the matrix, expanding them such that each column is composed of 4 source symbols and 6 recovery
symbols (
Each of the rows of this column expansion is a primary sliver. E.g.,
Similarly, the secondary encoding on the rows of the matrix produces the expanded rows.
Each of the columns of this column expansion is a secondary sliver. E.g.,
The
Thanks to the linearity of RaptorQ, the expansion of
- the recovery secondary slivers (columns 5,6) with the primary encoding; and
- the recovery primary slivers (rows 3,4,5,6) with the secondary encoding;
result in the same set of symbols, which is essential for recovery. Therefore, these symbols can be represented as the lower-right quadrant of what we call the "fully expanded message matrix".
Note again that these symbols do not need to be stored on any node because they can always be recomputed by expanding either a primary or secondary
symbol. E.g,
- the secondary-encoding expansion of the 4th primary sliver:
$[r_{4,0}, r_{4,1}, r_{4,2}, r_{4,3}, r_{4,4}, \textcolor{blue}{r_{4,5}}, r_{4,6}]$ ; or - the primary-encoding expansion of the 5th secondary sliver:
$[r_{0,5}, r_{1,5}, r_{2,5}, r_{3,5}, \textcolor{blue}{r_{4,5}}, r_{5,5}, r_{6,5}]$ .
Consider now that shard 3 fails, losing its slivers, and needs to recover them. In the following, the symbols of the lost slivers are highlighted in red (recall that the lower quadrant is never stored).
To recover the primary sliver, the node needs to contact 5 other shards and request the recovery symbols for the 3rd primary slivers. Since the symbols of the sliver are recovery symbols, the shards need to encode their secondary slivers (here highlighted as columns) to obtain them. For example, shards 0, 1, 2, 4, 6 provide the symbols:
To recover the secondary sliver, the node needs to contact at 3 other shards to obtain recovery symbols. In this case however, the recovery symbols are already part of the primary slivers (here highlighted as rows) stored by the other shards, so no re-encoding is necessary. For example, shards 0, 1, 5 provide the recovery symbols:
Note that in this case, the symbols
The rectangular layout of the message matrix is an optimization for the Byzantine setting. When storing the blob, a client can only await
However, after this initial sharing phase, the honest shards share and reconstruct the missing slivers from each other. At a steady state, therefore,
we can always assume that
We explain here how the
- The
$N-2f$ honest shards receive the sliver pairs. - The remaining
$f$ honest shards are notified of the stored blob (e.g., through the chain), and start the process to recover their sliver pairs. - First, they recover their secondary slivers, as they can be decoded from
$N-2f$ recovery symbols. - Then, once all
$N-f$ honest shards have their secondary slivers, they can start recovering the primary slivers (which indeed require$N-f$ recovery symbols). - All honest shards have their sliver pairs.
Assume for simplicity that
Therefore, the storage overhead due to RedStuff encoding is about
The key modifications in RedStuff, compared to the original Twin-Code framework, are the following:
- RedStuff uses the RaptorQ fountain code for both the "Type 0" and "Type 1" encoding, as they are called in the paper. The rates are about
$(N-2f)/N$ and$(N-f)/N$ respectively. We later discuss more in detail the exact rates and the decoding probability. We call the Type 0 encoding the primary encoding, and the Type 1 encoding the secondary encoding. - The blob is not laid out in a square message matrix, but in a rectangular one. As we explain in the following, this is an optimization for the specific BFT setting we are considering.
- Both "Type 0" and "Type 1" encodings are stored on each shard. In our terminology these are called "slivers", and the two together form a "sliver pair".
So far we have assumed that sliver pair
This is important because we foresee that the systematic slivers will be the most frequently accessed, because they can be used to access the data without any decoding.
In brief, the mapping is done as follows: each encoded blob is assigned a 32 byte pseudo random blob ID. This ID is interpreted as an unsigned
big-endian integer, and its reminder modulo
As mentioned above, the reconstruction failure probability of the RaptorQ code is
We therefore set the following parameters in our encoding configuration:
-
$f$ , the maximum number of Byzantine shards is$\lfloor (N-1) / 3 \rfloor$ . - The safety limit for the encoding,
$\sigma$ , to ensure high reconstruction probability, is set as a function of$N$ (see table below). - The number of primary source symbols (equivalent to the number of symbols in a secondary sliver) is
$N - 2f -\sigma$ - The number of secondary source symbols (equivalent to the number of symbols in a primary sliver) is
$N - f -\sigma$
Currently,
| N shards from | N shards to (incl) | Prob. failure | |
|---|---|---|---|
| 0 | 15 | 0 | 0.00391 |
| 16 | 30 | 1 | 1.53e-05 |
| 31 | 45 | 2 | 5.96e-08 |
| 46 | 60 | 3 | 2.33e-10 |
| 61 | 75 | 4 | 9.09e-13 |
| 76 | inf | 5 | 3.55e-15 |
For example, then, we have the following settings:
| N shards | f | # primary | # secondary | |
|---|---|---|---|---|
| 7 | 2 | 0 | 3 | 5 |
| 10 | 3 | 0 | 4 | 7 |
| 31 | 10 | 2 | 9 | 19 |
| 100 | 33 | 5 | 29 | 62 |
| 300 | 99 | 5 | 97 | 196 |
| 1000 | 333 | 5 | 329 | 662 |
In RaptorQ, the size of a symbol is encoded as a 16-bit integer. Therefore, the maximum size of a blob that can be encoded is
Since the blob is encoded in the rectangular message matrix, the blob size is upperbound by source_symbols_primary * source_symbols_secondary * u16::MAX and lowerbound by source_symbols_primary * source_symbols_secondary. A few examples for the same configurations as above:
| N shards | Min blob size | Max blob size | Min encoded blob size | Max encoded blob size |
|---|---|---|---|---|
| 7 | 15.0 B | 983 KB | 56.0 B | 3.67 MB |
| 10 | 28.0 B | 1.83 MB | 110 B | 7.21 MB |
| 31 | 171 B | 11.2 MB | 868 B | 56.9 MB |
| 100 | 1.80 KB | 118 MB | 9.10 KB | 596 MB |
| 300 | 19.0 KB | 1.25 GB | 87.9 KB | 5.76 GB |
| 1000 | 218 KB | 14.3 GB | 991 KB | 64.9 GB |
Alongside the efficient encoding performed by RedStuff, we want to enable shards to authenticate that the slivers and encoding symbols they receive indeed belong to the blob they wanted. We briefly outline here how this is achieved.
For each sliver, primary or secondary, a Merkle tree is constructed.
Importantly, the tree is constructed over all the
Then, a Merkle tree over the sliver hashes is computed to obtain a blob hash. This is computed by concatenating primary and secondary sliver hashes
(computed as above) for each sliver pair, and then constructing the Merle tree over the concatenations (
Therefore, to prove that a sliver is indeed part of a blob, it is sufficient to provide the Merkle path to the root.
Finally, the encoding type tag (representing the RedStuff version, or alternative encoding), the length of the blob before the encoding, and the Merkle root of the tree over the slivers are hashed together to obtain the blob ID.
As seen above, each storage node needs to store the full metadata for the blob. The metadata consists of:
- A
32 BMerkle root hash for each primary and secondary sliver; - the
32 Bblob ID, computed as above; - the erasure code type (
1 B); - the length of the unencoded blob size (
8 B).
Of these, the hashes for the primary and secondary slivers can be a considerable overhead, if the number of shards is high. We show here the cumulative size of the hashes stored on the system, depending on the number of nodes and the number of shards.
| N shards | One node | floor(N/floor(log2(N)) nodes | N nodes |
|---|---|---|---|
| 7 | 448 B | 1.34 KB | 3.14 KB |
| 10 | 640 B | 1.92 KB | 6.40 KB |
| 31 | 1.98 KB | 13.9 KB | 61.5 KB |
| 100 | 6.40 KB | 102 KB | 640 KB |
| 300 | 19.2 KB | 710 KB | 5.76 MB |
| 1000 | 64.0 KB | 7.10 MB | 64.0 MB |
We see that the cumulative size of the hashes in the case of 1000 nodes (1 node per shard), is 64KB per node, or 64MB for a system of 1000 nodes.
However, recall that the number of shards is set and constant, while the number of nodes may vary (each node has one or more shards), potentially
lowering the overhead on the system. We therefore show here the ratio between the size of the hashes stored on the system to the minimum and maximum
blob sizes, for N=1000 shards and different number of nodes (1 node, floor(N/floor(log2(N))) = 111, 1000).
| N = 1000 | Total metadata size | Factor min blob | Factor max blob | Factor min encoded blob | Factor max encoded blob |
|---|---|---|---|---|---|
| Single node | 64.0 KB | 0.294 | 4.48e-06 | 0.0646 | 9.85e-07 |
| floor(N/floor(log2(N))) nodes | 7.10 MB | 32.6 | 0.000498 | 7.17 | 0.000109 |
| N nodes | 64.0 MB | 294 | 0.00448 | 64.6 | 0.000985 |
We see that for realistic node counts and small blob sizes, the total metadata overhead goes can be multiple times the size of the initial unencoded blob. However, for larger blob sizes, the overhead is negligible.
Footnotes
-
K. V. Rashmi, N. B. Shah and P. V . Kumar, "Enabling node repair in any erasure code for distributed storage," 2011 IEEE International Symposium on Information Theory Proceedings, St. Petersburg, Russia, 2011, pp. 1235-1239, doi: 10.1109/ISIT.2011.6033732. ↩

