|
| 1 | +# Riak Core Claim Fixes |
| 2 | + |
| 3 | +This post is about two bug fixes to |
| 4 | +[riak-core's](https://github.com/basho/riak_core) claim code. |
| 5 | + |
| 6 | +The post will cover the purpose of claim, the inputs, and the desired |
| 7 | +properties of the output. Then a look at the bugs themselves, and how |
| 8 | +we've fixed them. |
| 9 | + |
| 10 | +First though, a quick look at the ring. Those of you who are confident |
| 11 | +about the ring, what it is, and how it does it, skip down to |
| 12 | +[Recent Work](#recent-work) |
| 13 | + |
| 14 | +## Abstract and Concrete |
| 15 | + |
| 16 | +As [Martin Sumner](https://twitter.com/masleeds) recently wrote in an |
| 17 | +in depth look at |
| 18 | +[claim in Riak](https://github.com/infinityworks/riak_core/blob/mas-claimv2issues/docs/ring_claim.md), |
| 19 | +the ring is something often illustrated, a useful metaphor, and |
| 20 | +typically a vaguely grasped abstraction. But the ring is not just |
| 21 | +conceptual, it's a concrete thing in Riak: a datastructure that |
| 22 | +describes a mapping of ownership as a routing table. |
| 23 | + |
| 24 | +### The Key Space |
| 25 | + |
| 26 | +Riak uses |
| 27 | +[consistent hashing](https://en.wikipedia.org/wiki/Consistent_hashing) |
| 28 | +to decide where to store replicas of data. An integer (2^160) |
| 29 | +represents the entire key space. The key space is divided into a |
| 30 | +number of ranges of equal size. The number of ranges is configured as |
| 31 | +the _ring size_ for a cluster, and the ring size is always a power of |
| 32 | +two. If the ring size is 64, then there will be 64 equal sized |
| 33 | +_partitions_ of the key space (and so on for ring sizes of 8, 16, 32, |
| 34 | +128, 512 etc.) |
| 35 | + |
| 36 | +A hash function is used to turn any key into an integer in the key |
| 37 | +space (i.e. `hash({"artist", "REM"}) -> |
| 38 | +666603136322215606811169435479313437784925203632`). The result of the |
| 39 | +hash function is an integer that falls inside one of the partitions, |
| 40 | +or ranges, of the divided key space. The hash function is such that we |
| 41 | +get an even distribution of values throughout the key space. So far, |
| 42 | +so abstract. |
| 43 | + |
| 44 | +### Ownership |
| 45 | + |
| 46 | +Each of these ranges/divisions/partitions of the ring represents a |
| 47 | +portion of the keys Riak stores. In order to spread the work evenly |
| 48 | +throughout a Riak cluster each physical node _should_ own an equal |
| 49 | +number of partitions. |
| 50 | + |
| 51 | +Concretely the ring is a mapping of partitions of the key space to |
| 52 | +physical nodes, denoting ownership. For each partition in the key |
| 53 | +space there is an entry in the ring that points to a riak node |
| 54 | +name. It says: |
| 55 | + |
| 56 | + |
| 57 | + 1415829711164312202009819681693899175291684651008" |
| 58 | + |
| 59 | +For each partition that it owns, a node will start a process. These |
| 60 | +processes are called _vnodes_ for virtual node. |
| 61 | + |
| 62 | +### A Routing Table |
| 63 | + |
| 64 | +Since it maps out ownership, the ring is also a routing table. Any |
| 65 | +node in a Riak cluster can run the hash function on a key and look up |
| 66 | +in the ring which node owns the partition that key falls into and |
| 67 | +route requests to that node. For this to work, every node needs a copy |
| 68 | +of the ring, and ideally, all copies would agree as to which nodes own |
| 69 | +which keys. Riak has a gossip protocol for spreading the ring |
| 70 | +throughout a cluster and ensuring all nodes share a consistent view of |
| 71 | +the ring. Hopefully a future post can describe gossip in detail. |
| 72 | + |
| 73 | +### Replicas, Preference Lists, Primary and Fallback |
| 74 | + |
| 75 | +Riak is a replicated database. By default it stores three replicas of |
| 76 | +every key/value, the number of replicas is called the `n-val` meaning |
| 77 | +there are `n` replicas, and by default `n=3`. As described above, |
| 78 | +hashing the key gets a partition, and a partition is owned by a node, |
| 79 | +and the node runs a vnode process for that partition. The vnode stores |
| 80 | +the key/value, but what of the other replicas? In order to place the |
| 81 | +`n-1` remaining replicas Riak uses the ring. Ordering the partitions |
| 82 | +low-to-high Riak picks the nodes that own the next two partitions |
| 83 | +following the one the key hashes to. This gives us a list of pairs |
| 84 | +`[{partition(), node()}]` called the Preference List, or preflist for |
| 85 | +short, which says exactly which processes on which physical nodes |
| 86 | +should have copies of the data. |
| 87 | + |
| 88 | +More specifically the preflist above is called a Primary |
| 89 | +Preflist. Riak is a fault tolerant database, and if some node in the |
| 90 | +Primary Preflist is not reachable to write or read a replica, Riak |
| 91 | +uses the ring again. For any node in the primary preflist that is down |
| 92 | +Riak will select the node for the next partition on the ring, and so |
| 93 | +on, until it comes to a node that is up. Any node that stores a |
| 94 | +replica but is not in the primary preflist is called a Fallback. A |
| 95 | +Fallback will start a vnode process for the primary partition it |
| 96 | +temporarily replaces and store a replica until such a time as the |
| 97 | +primary is contactable. |
| 98 | + |
| 99 | +Thinking about a key that hashes into one of the last two partitions |
| 100 | +should clarify why the ring is called a ring. A key that hashes into |
| 101 | +the last partition would have a preflist made up of that last |
| 102 | +partition, followed by the first, and then the second partitions, |
| 103 | +wrapping around the tail of the ring and back to the start. |
| 104 | + |
| 105 | +We want at least `n` nodes in the ring if we want each replica to be |
| 106 | +on it's own node. If we want to tolerate failures and still have each |
| 107 | +replica on it's own node we need `n + f` nodes where `f` is the number |
| 108 | +of node failures we need to tolerate. Basho's recommendation was that |
| 109 | +you want at least 5 nodes in a Riak cluster. |
| 110 | + |
| 111 | +Knowing how Riak constructs preflists from the ring, and how it |
| 112 | +selects primaries and fallbacks is useful when we consider what |
| 113 | +properties we might want from an algorithm that assigns partitions to |
| 114 | +nodes. This algorithm is called Claim. |
| 115 | + |
| 116 | +### Claim |
| 117 | + |
| 118 | +Claim's job is to make the ring. As inputs it has the ring size (total |
| 119 | +number of partitions), the list of nodes that make up the cluster, and |
| 120 | +the current ring. |
| 121 | + |
| 122 | +#### Spacing |
| 123 | + |
| 124 | +Thinking about preflists, we want to distribute the partitions in such |
| 125 | +a way that any preflist selected from the ring will lead to replicas |
| 126 | +being stored on three distinct nodes. Ideally it should be able to |
| 127 | +tolerate the failure of any node and still have every preflist contain |
| 128 | +three distinct nodes. This spacing is called `target-n-val` and by |
| 129 | +default it is set to four. |
| 130 | + |
| 131 | +#### Balance |
| 132 | + |
| 133 | +Since the vnode processes are the actors that do the work of reading |
| 134 | +and writing data, it would be best if the same number of processes ran |
| 135 | +on each node, meaning that all nodes would perform pretty equally. No |
| 136 | +node would be much faster than the others, and crucially no node would |
| 137 | +be much slower. Clearly if Ring Size is not divisible by Node Count we |
| 138 | +can't get exactly the same number of vnodes per-node, but the |
| 139 | +difference should be minimal. |
| 140 | + |
| 141 | +## Recent Work |
| 142 | + |
| 143 | +### Tail Violations |
| 144 | + |
| 145 | +One issue with Claim as it stands today is that it does not address |
| 146 | +"tail violations." For example, given a ring size of 16 with 5 nodes |
| 147 | +added at once claim generates a ring where there are primary preflists |
| 148 | +that do not have three distinct nodes. Claim assigns to each node in |
| 149 | +order, partition one (`p1`) to node one (`n1`), `p2` to `n2`, `p3` to |
| 150 | +`n3`, `p4` to `n4`, `p5` to `n5`, `p6` to `n1` and so on, finishing |
| 151 | +with `p16` on `n1`. Both the preflists |
| 152 | +`[{p15, n5}, {p16, n1}, {p1, n1}]` and |
| 153 | +`[{p16, n1}, {p1, n1}, {p2, n2}]` violate the spacing requirements of |
| 154 | +the ring. This can lead to dataloss if you expect there to be 3 |
| 155 | +distinct physical replicas of your data but in fact there may only be |
| 156 | +2, and as such is an important bug to fix. |
| 157 | + |
| 158 | +This bug was found by modifying the [quickcheck](http://www.quviq.com/products/) tests to add |
| 159 | +multiple nodes to a cluster at once, and fixed by |
| 160 | +[Ramen Sen](https://github.com/ramensen) of NHS Digital. The fix |
| 161 | +firstly decides if there are tail violations (if the `ring size rem |
| 162 | +node count` is greater than zero but less than `target-n-val`) and if |
| 163 | +they can be solved given the number of nodes, target-n-val, and ring |
| 164 | +size. |
| 165 | + |
| 166 | +Think of the ring as a set of sequences, where a sequence is: |
| 167 | + |
| 168 | + [n1, n2, n3, n4, n5] |
| 169 | + |
| 170 | +Then a ring for 16 vnodes with 5 nodes has 3 full sequences, and a |
| 171 | +tail of n1 |
| 172 | + |
| 173 | + [n1, n2, n3, n4, n5] |
| 174 | + [n1, n2, n3, n4, n5] |
| 175 | + [n1, n2, n3, n4, n5] |
| 176 | + [n1] |
| 177 | + |
| 178 | +In the cases that tail violations can be solved they are done so by |
| 179 | +increasing the tail of the ring so that there is no wrap around tail |
| 180 | +violation and the tail ensures `target-n-val`: |
| 181 | + |
| 182 | + [n1, n2, n3, n4, n5] |
| 183 | + [n1, n2, n3, n4, n5] |
| 184 | + [n1, n2, n3, n4, n5] |
| 185 | + [n1, n2, n3, n4] |
| 186 | + |
| 187 | +Each node added to the tail of the ring is then removed from previous |
| 188 | +complete sequences. Lowest first from the previous sequence, and so on |
| 189 | +back. |
| 190 | + |
| 191 | + [n1, n2, n3, n5] |
| 192 | + [n1, n2, n4, n5] |
| 193 | + [n1, n3, n4, n5] |
| 194 | + [n1, n2, n3, n4] |
| 195 | + |
| 196 | +Leaving a ring where all preflists, even with a single node failure, |
| 197 | +have three distinct nodes. The fix is slightly more complex, taking |
| 198 | +larger chunks of sequences when possible, but the above should serve |
| 199 | +as illustration. |
| 200 | + |
| 201 | +### Balanced Rings |
| 202 | + |
| 203 | +If we have 4 nodes and a ring of size 32 and a nicely balanced ring |
| 204 | +where each node has 8 vnodes, and no tail violations but then add a |
| 205 | +5th node, what happens? Claim takes the existing ring and adds the |
| 206 | +`n5` to it. It calculates how many vnodes each node wants (6, since we |
| 207 | +can't have a fraction of a vnode) and how many each existing node can |
| 208 | +give up (2 each). It then decides on what the delta for each node |
| 209 | +is. Nodes 1 through 4 have a delta of 2 which means they can give up |
| 210 | +at most 2 vnodes. Node 5 has a delta of -6, meaning it wants 6 |
| 211 | +vnodes. Claim then takes vnodes from the existing nodes and assigns |
| 212 | +them to the new node in such a way that `target-n-val`is not |
| 213 | +violated. The bug is that claim takes all deltas from each node until |
| 214 | +wants are satisfied. It takes 2 from node1, 2 from node2, 2 from node3 |
| 215 | +and then stops. Now node5 has all 6 wants met, and nodes 1 through 3 |
| 216 | +have 6 vnodes each, all good. Except that node4 has 8 vnodes, 33.3% |
| 217 | +more than any other vnode. Node4 is going to be busy, and therefore |
| 218 | +slow, and thusly trouble. |
| 219 | + |
| 220 | + |
| 221 | + |
| 222 | +The fix for this issue is trivial. Rather than taking all the deltas |
| 223 | +at once from a node, instead take a delta from each node in round |
| 224 | +robin until the wants of the new node are met. This results in a ring |
| 225 | +where two nodes have 7 vnodes and the rest have 6. |
| 226 | + |
| 227 | + |
| 228 | + |
| 229 | +## Future Work |
| 230 | + |
| 231 | +As well as fixing the two issues above and adding extra quickcheck |
| 232 | +property tests to verify the fixes, there is work to be done. The |
| 233 | +existing tests don't call the same code paths as the operator commands |
| 234 | +`riak-admin cluster join | plan | remove | etc` which means there may |
| 235 | +well be edges undetected and code untested. There is no test for |
| 236 | +removing a node. To address this we've started work on a more thorough |
| 237 | +quickcheck statem test that models the full life cycle of the ring |
| 238 | +over many transitions of node adds and removes. However the ring, |
| 239 | +claim, and gossip code seems to have grown organically and is spread |
| 240 | +over a large surface in riak_core. This work will take some more time. |
| 241 | + |
| 242 | +# Conclusion |
| 243 | + |
| 244 | +These changes, though small, make claim safer and lead to better |
| 245 | +behaved clusters. Though the tail violation issue could be detected if |
| 246 | +a diligent operator followed |
| 247 | +[best practices](https://www.tiot.jp/riak-docs/riak/kv/latest/setup/upgrading/checklist/#confirming-configuration-with-riaknostic) |
| 248 | +and checked the output of the ring before moving to production, there |
| 249 | +was no way to mitigate the issues. |
| 250 | + |
| 251 | +These fixes will be released with the next version of Riak we create |
| 252 | +for NHS-Digital, along with the new features from previous posts. If |
| 253 | +you want to look at the code, or try it for yourself you can find it |
| 254 | +[here](https://github.com/ramensen/riak_core/tree/rdb/rs-claim-tail-violations). |
0 commit comments