Skip to content

Commit c9c924e

Browse files
authored
Fix claim tail violations and unbalanced rings (#913)
* Fix claim tail violations and unbalanced rings This commit squashes together some fixes for riak_core claim. There is a full write up in docs/claim-fixes.md. In brief: In some situations riak_core_claim can produce rings that have tail violations. A tail violation is where some preflists (drawn from the tail of the ring) do not enforce the target-n-val property. One example is ring using the old claim diag for 16 vnodes and 5 nodes. Such a ring would have a tail of 1. The last two preflists from that ring would be [5, 1, 1] and [1, 1,2] both of which would mean >1 replica on a single node. These "tail-violations" are addressed. A further issue with claim was that distributing existing claim to a new node could result in an unbalanced ring. An example would be adding a 5th node to a 32 vnode ring of 4 nodes. The distribution went from [8, 8, 8, 8] to [6, 6, 6, 8, 6] . Before assigning vnodes claim calculates "deltas" as a measure of how many vnodes each node wants or should give up. Prior to the change, for this example the deltas were [2, 2, 2, 2, -6] which means "you can 2 partitions from each of 4 nodes in order to provide 6 on the 5th". Claim would be greedy and take all 2 from each of the first 3 nodes and stop. This fix balances the deltas before claim so that they always equal the wants of the claiming node(s). For the example the deltas would be [2, 2, 1, 1, -6] ensuring that the final balance of vnodes will be [6, 6, 7, 7, 6]. These "unbalanced rings" are fixed. NOTE: there may still be ways to get bad rings with "leave" commands, as this is still not tested * Write up recent claim work * Fix specs for dialyzer Some mild renaming/refactoring for better reading (I got confused while figuring out dialyzer errors and these names made more sense when I re-read it.)
1 parent 9e9e734 commit c9c924e

10 files changed

+964
-12
lines changed

docs/balanced.png

13.6 KB
Loading

docs/claim-fixes.md

Lines changed: 254 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,254 @@
1+
# Riak Core Claim Fixes
2+
3+
This post is about two bug fixes to
4+
[riak-core's](https://github.com/basho/riak_core) claim code.
5+
6+
The post will cover the purpose of claim, the inputs, and the desired
7+
properties of the output. Then a look at the bugs themselves, and how
8+
we've fixed them.
9+
10+
First though, a quick look at the ring. Those of you who are confident
11+
about the ring, what it is, and how it does it, skip down to
12+
[Recent Work](#recent-work)
13+
14+
## Abstract and Concrete
15+
16+
As [Martin Sumner](https://twitter.com/masleeds) recently wrote in an
17+
in depth look at
18+
[claim in Riak](https://github.com/infinityworks/riak_core/blob/mas-claimv2issues/docs/ring_claim.md),
19+
the ring is something often illustrated, a useful metaphor, and
20+
typically a vaguely grasped abstraction. But the ring is not just
21+
conceptual, it's a concrete thing in Riak: a datastructure that
22+
describes a mapping of ownership as a routing table.
23+
24+
### The Key Space
25+
26+
Riak uses
27+
[consistent hashing](https://en.wikipedia.org/wiki/Consistent_hashing)
28+
to decide where to store replicas of data. An integer (2^160)
29+
represents the entire key space. The key space is divided into a
30+
number of ranges of equal size. The number of ranges is configured as
31+
the _ring size_ for a cluster, and the ring size is always a power of
32+
two. If the ring size is 64, then there will be 64 equal sized
33+
_partitions_ of the key space (and so on for ring sizes of 8, 16, 32,
34+
128, 512 etc.)
35+
36+
A hash function is used to turn any key into an integer in the key
37+
space (i.e. `hash({"artist", "REM"}) ->
38+
666603136322215606811169435479313437784925203632`). The result of the
39+
hash function is an integer that falls inside one of the partitions,
40+
or ranges, of the divided key space. The hash function is such that we
41+
get an even distribution of values throughout the key space. So far,
42+
so abstract.
43+
44+
### Ownership
45+
46+
Each of these ranges/divisions/partitions of the ring represents a
47+
portion of the keys Riak stores. In order to spread the work evenly
48+
throughout a Riak cluster each physical node _should_ own an equal
49+
number of partitions.
50+
51+
Concretely the ring is a mapping of partitions of the key space to
52+
physical nodes, denoting ownership. For each partition in the key
53+
space there is an entry in the ring that points to a riak node
54+
name. It says:
55+
56+
"[email protected] owns partition
57+
1415829711164312202009819681693899175291684651008"
58+
59+
For each partition that it owns, a node will start a process. These
60+
processes are called _vnodes_ for virtual node.
61+
62+
### A Routing Table
63+
64+
Since it maps out ownership, the ring is also a routing table. Any
65+
node in a Riak cluster can run the hash function on a key and look up
66+
in the ring which node owns the partition that key falls into and
67+
route requests to that node. For this to work, every node needs a copy
68+
of the ring, and ideally, all copies would agree as to which nodes own
69+
which keys. Riak has a gossip protocol for spreading the ring
70+
throughout a cluster and ensuring all nodes share a consistent view of
71+
the ring. Hopefully a future post can describe gossip in detail.
72+
73+
### Replicas, Preference Lists, Primary and Fallback
74+
75+
Riak is a replicated database. By default it stores three replicas of
76+
every key/value, the number of replicas is called the `n-val` meaning
77+
there are `n` replicas, and by default `n=3`. As described above,
78+
hashing the key gets a partition, and a partition is owned by a node,
79+
and the node runs a vnode process for that partition. The vnode stores
80+
the key/value, but what of the other replicas? In order to place the
81+
`n-1` remaining replicas Riak uses the ring. Ordering the partitions
82+
low-to-high Riak picks the nodes that own the next two partitions
83+
following the one the key hashes to. This gives us a list of pairs
84+
`[{partition(), node()}]` called the Preference List, or preflist for
85+
short, which says exactly which processes on which physical nodes
86+
should have copies of the data.
87+
88+
More specifically the preflist above is called a Primary
89+
Preflist. Riak is a fault tolerant database, and if some node in the
90+
Primary Preflist is not reachable to write or read a replica, Riak
91+
uses the ring again. For any node in the primary preflist that is down
92+
Riak will select the node for the next partition on the ring, and so
93+
on, until it comes to a node that is up. Any node that stores a
94+
replica but is not in the primary preflist is called a Fallback. A
95+
Fallback will start a vnode process for the primary partition it
96+
temporarily replaces and store a replica until such a time as the
97+
primary is contactable.
98+
99+
Thinking about a key that hashes into one of the last two partitions
100+
should clarify why the ring is called a ring. A key that hashes into
101+
the last partition would have a preflist made up of that last
102+
partition, followed by the first, and then the second partitions,
103+
wrapping around the tail of the ring and back to the start.
104+
105+
We want at least `n` nodes in the ring if we want each replica to be
106+
on it's own node. If we want to tolerate failures and still have each
107+
replica on it's own node we need `n + f` nodes where `f` is the number
108+
of node failures we need to tolerate. Basho's recommendation was that
109+
you want at least 5 nodes in a Riak cluster.
110+
111+
Knowing how Riak constructs preflists from the ring, and how it
112+
selects primaries and fallbacks is useful when we consider what
113+
properties we might want from an algorithm that assigns partitions to
114+
nodes. This algorithm is called Claim.
115+
116+
### Claim
117+
118+
Claim's job is to make the ring. As inputs it has the ring size (total
119+
number of partitions), the list of nodes that make up the cluster, and
120+
the current ring.
121+
122+
#### Spacing
123+
124+
Thinking about preflists, we want to distribute the partitions in such
125+
a way that any preflist selected from the ring will lead to replicas
126+
being stored on three distinct nodes. Ideally it should be able to
127+
tolerate the failure of any node and still have every preflist contain
128+
three distinct nodes. This spacing is called `target-n-val` and by
129+
default it is set to four.
130+
131+
#### Balance
132+
133+
Since the vnode processes are the actors that do the work of reading
134+
and writing data, it would be best if the same number of processes ran
135+
on each node, meaning that all nodes would perform pretty equally. No
136+
node would be much faster than the others, and crucially no node would
137+
be much slower. Clearly if Ring Size is not divisible by Node Count we
138+
can't get exactly the same number of vnodes per-node, but the
139+
difference should be minimal.
140+
141+
## Recent Work
142+
143+
### Tail Violations
144+
145+
One issue with Claim as it stands today is that it does not address
146+
"tail violations." For example, given a ring size of 16 with 5 nodes
147+
added at once claim generates a ring where there are primary preflists
148+
that do not have three distinct nodes. Claim assigns to each node in
149+
order, partition one (`p1`) to node one (`n1`), `p2` to `n2`, `p3` to
150+
`n3`, `p4` to `n4`, `p5` to `n5`, `p6` to `n1` and so on, finishing
151+
with `p16` on `n1`. Both the preflists
152+
`[{p15, n5}, {p16, n1}, {p1, n1}]` and
153+
`[{p16, n1}, {p1, n1}, {p2, n2}]` violate the spacing requirements of
154+
the ring. This can lead to dataloss if you expect there to be 3
155+
distinct physical replicas of your data but in fact there may only be
156+
2, and as such is an important bug to fix.
157+
158+
This bug was found by modifying the [quickcheck](http://www.quviq.com/products/) tests to add
159+
multiple nodes to a cluster at once, and fixed by
160+
[Ramen Sen](https://github.com/ramensen) of NHS Digital. The fix
161+
firstly decides if there are tail violations (if the `ring size rem
162+
node count` is greater than zero but less than `target-n-val`) and if
163+
they can be solved given the number of nodes, target-n-val, and ring
164+
size.
165+
166+
Think of the ring as a set of sequences, where a sequence is:
167+
168+
[n1, n2, n3, n4, n5]
169+
170+
Then a ring for 16 vnodes with 5 nodes has 3 full sequences, and a
171+
tail of n1
172+
173+
[n1, n2, n3, n4, n5]
174+
[n1, n2, n3, n4, n5]
175+
[n1, n2, n3, n4, n5]
176+
[n1]
177+
178+
In the cases that tail violations can be solved they are done so by
179+
increasing the tail of the ring so that there is no wrap around tail
180+
violation and the tail ensures `target-n-val`:
181+
182+
[n1, n2, n3, n4, n5]
183+
[n1, n2, n3, n4, n5]
184+
[n1, n2, n3, n4, n5]
185+
[n1, n2, n3, n4]
186+
187+
Each node added to the tail of the ring is then removed from previous
188+
complete sequences. Lowest first from the previous sequence, and so on
189+
back.
190+
191+
[n1, n2, n3, n5]
192+
[n1, n2, n4, n5]
193+
[n1, n3, n4, n5]
194+
[n1, n2, n3, n4]
195+
196+
Leaving a ring where all preflists, even with a single node failure,
197+
have three distinct nodes. The fix is slightly more complex, taking
198+
larger chunks of sequences when possible, but the above should serve
199+
as illustration.
200+
201+
### Balanced Rings
202+
203+
If we have 4 nodes and a ring of size 32 and a nicely balanced ring
204+
where each node has 8 vnodes, and no tail violations but then add a
205+
5th node, what happens? Claim takes the existing ring and adds the
206+
`n5` to it. It calculates how many vnodes each node wants (6, since we
207+
can't have a fraction of a vnode) and how many each existing node can
208+
give up (2 each). It then decides on what the delta for each node
209+
is. Nodes 1 through 4 have a delta of 2 which means they can give up
210+
at most 2 vnodes. Node 5 has a delta of -6, meaning it wants 6
211+
vnodes. Claim then takes vnodes from the existing nodes and assigns
212+
them to the new node in such a way that `target-n-val`is not
213+
violated. The bug is that claim takes all deltas from each node until
214+
wants are satisfied. It takes 2 from node1, 2 from node2, 2 from node3
215+
and then stops. Now node5 has all 6 wants met, and nodes 1 through 3
216+
have 6 vnodes each, all good. Except that node4 has 8 vnodes, 33.3%
217+
more than any other vnode. Node4 is going to be busy, and therefore
218+
slow, and thusly trouble.
219+
220+
![Unbalanced Ring](unbalanced.png "Unbalanced Ring")
221+
222+
The fix for this issue is trivial. Rather than taking all the deltas
223+
at once from a node, instead take a delta from each node in round
224+
robin until the wants of the new node are met. This results in a ring
225+
where two nodes have 7 vnodes and the rest have 6.
226+
227+
![Balanced Ring](balanced.png "Balanced Ring")
228+
229+
## Future Work
230+
231+
As well as fixing the two issues above and adding extra quickcheck
232+
property tests to verify the fixes, there is work to be done. The
233+
existing tests don't call the same code paths as the operator commands
234+
`riak-admin cluster join | plan | remove | etc` which means there may
235+
well be edges undetected and code untested. There is no test for
236+
removing a node. To address this we've started work on a more thorough
237+
quickcheck statem test that models the full life cycle of the ring
238+
over many transitions of node adds and removes. However the ring,
239+
claim, and gossip code seems to have grown organically and is spread
240+
over a large surface in riak_core. This work will take some more time.
241+
242+
# Conclusion
243+
244+
These changes, though small, make claim safer and lead to better
245+
behaved clusters. Though the tail violation issue could be detected if
246+
a diligent operator followed
247+
[best practices](https://www.tiot.jp/riak-docs/riak/kv/latest/setup/upgrading/checklist/#confirming-configuration-with-riaknostic)
248+
and checked the output of the ring before moving to production, there
249+
was no way to mitigate the issues.
250+
251+
These fixes will be released with the next version of Riak we create
252+
for NHS-Digital, along with the new features from previous posts. If
253+
you want to look at the code, or try it for yourself you can find it
254+
[here](https://github.com/ramensen/riak_core/tree/rdb/rs-claim-tail-violations).

docs/unbalanced.png

14.7 KB
Loading

0 commit comments

Comments
 (0)