Make pick_server/1 randomly elect the leader #10435

ariel-anieli · 2024-01-29T05:22:53Z

ariel-anieli
Jan 29, 2024

The discussion follows from PR 10420, wherein I proposed:

pick_server/1 randomly elect the leader.
set_timer/2 use pick_server/1.

The PR answers a question asked in rabbit_fifo_client.erl: from the test I made, it's possible to randomly elect the leader:

it adds fairness to the election
the median consumer latency is the same.

Using a three-node cluster on a single machine (4GiB, 10 quorum queues, nodes joined up by RAM);

$ curl --silent -u guest:guest http://localhost:15672/api/queues | jq -c '.[] | {name: .name, leader: .leader, node: .node}'                                              
{"name":"qq1","leader":"r1@localhost","node":"r1@localhost"}                                                                                                              
{"name":"qq10","leader":"r1@localhost","node":"r1@localhost"}                                                                                                             
{"name":"qq2","leader":"r1@localhost","node":"r1@localhost"}                                                                                                              
{"name":"qq3","leader":"r1@localhost","node":"r1@localhost"}                                                                                                              
{"name":"qq4","leader":"r1@localhost","node":"r1@localhost"}                                                                                                              
{"name":"qq5","leader":"r1@localhost","node":"r1@localhost"}                                                                                                              
{"name":"qq6","leader":"r1@localhost","node":"r1@localhost"}                                                                                                              
{"name":"qq7","leader":"r1@localhost","node":"r1@localhost"}                                                                                                              
{"name":"qq8","leader":"r1@localhost","node":"r1@localhost"}                                                                                                              
{"name":"qq9","leader":"r1@localhost","node":"r1@localhost"}                                                                                                              
                                                                                                                                                                          
$ curl --silent -u guest:guest http://localhost:15672/api/overview/ | jq -c '.listeners[] | select(.protocol=="amqp") | {node: .node, port: .port}'                       
{"node":"r3@localhost","port":5672}                                                                                                                                       
{"node":"r2@localhost","port":5772}                                                                                                                                       
{"node":"r1@localhost","port":5872}

$ curl --silent -u guest:guest http://localhost:15672/api/nodes/r1\@localhost?memory=true | jq -c '. | {total: .memory.total, name: .name}'                               
{"total":{"erlang":408039008,"rss":328749056,"allocated":412835840},"name":"r1@localhost"}                                                                                
                                                                                                                                                                          
$ curl --silent -u guest:guest http://localhost:15672/api/nodes/r2\@localhost?memory=true | jq -c '. | {total: .memory.total, name: .name}'                               
{"total":{"erlang":521714056,"rss":435261440,"allocated":530722816},"name":"r2@localhost"}                                                                                
                                                                                                                                                                          
$ curl --silent -u guest:guest http://localhost:15672/api/nodes/r3\@localhost?memory=true | jq -c '. | {total: .memory.total, name: .name}'                               
{"total":{"erlang":500559392,"rss":462385152,"allocated":511901696},"name":"r3@localhost"}

$ virsh dumpxml fedora | xmllint --xpath '//memory | //vcpu' --format -
<memory unit="KiB">4194304</memory>
<vcpu placement="static">4</vcpu>

I compared the performances of First server chosen as Leader:

$ git log --oneline -n1 
0e63d3b753 (HEAD -> main, origin/main, origin/HEAD) Merge pull request #10417 from rabbitmq/rabbitmq-server-10415-mk-alternative

$ seq 1 10 | sed -e 's/^/qq/' |
                    tr '\n' ',' | 
                    sed -e 's/,$//' | 
                    xargs -I {} sh -c "java -jar perf-test-2.20.0.jar --time 30 -x10 -y10 --quorum-queue --queue {} -mf compact" \
                    -H amqp://localhost:5672,amqp://localhost:5772,amqp://localhost:5872                                                                                                    

sending rate avg: 5396 msg/s                                                                                                                                              
receiving rate avg: 173 msg/s                                                                                                                                             
consumer latency min/median/75th/95th/99th 1298181/22887924/33100222/33999158/35198394 µs                                                                                 

$ seq 1 10 | sed -e 's/^/qq/' |
                    tr '\n' ',' |
                    sed -e 's/,$//' |
                    xargs -I {} sh -c "java -jar perf-test-2.20.0.jar --time 30 -x10 -y10 --quorum-queue --queue {} -mf compact"                                                                                                                                                                         
                                                                                                                                                                          
sending rate avg: 3645 msg/s                                                                                                                                              
receiving rate avg: 170 msg/s                                                                                                                                             
consumer latency min/median/75th/95th/99th 1101557/29192418/32186958/35192632/36194390 µs

And Leader chosen randomly:

$ seq 1 10 | sed -e 's/^/qq/' |
                    tr '\n' ',' | 
                   sed -e 's/,$//' | 
                   xargs -I {} sh -c "java -jar perf-test-2.20.0.jar --time 30 -x10 -y10 --quorum-queue --queue {} -mf compact"                                                                                                                                                                         

sending rate avg: 6424 msg/s                                                                                                                                              
receiving rate avg: 204 msg/s                                                                                                                                             
consumer latency min/median/75th/95th/99th 2099061/22300892/28798536/33006971/34597958 µs                                                                                 

$ seq 1 10 | sed -e 's/^/qq/' |
                    tr '\n' ',' | 
                    sed -e 's/,$//' | 
                    xargs -I {} sh -c "java -jar perf-test-2.20.0.jar --time 30 -x10 -y10 --quorum-queue --queue {} -mf compact" \
                    -H amqp://localhost:5672,amqp://localhost:5772,amqp://localhost:5872                                                                                                    

sending rate avg: 5302 msg/s                                                                                                                                              
receiving rate avg: 138 msg/s                                                                                                                                             
consumer latency min/median/75th/95th/99th 1495440/29609103/31798521/34199375/35696068 µs

In Leader chosen randomly, instead of a case statement, set_timer/1 relies on pick_server/1: it becomes the eighth function that relies on pick_server/1.

% sed -ne '/^[a-z]/{s/pick/&/; t; h}; /pick/{x;G;p}' deps/rabbit/src/rabbit_fifo_client.erl                                                                               
enqueue(QName, Correlation, Msg,                                                                                                                                          
    ServerId = pick_server(State0),                                                                                                                                       
dequeue(QueueName, ConsumerTag, Settlement,                                                                                                                               
    ServerId = pick_server(State0),                                                                                                                                       
settle(ConsumerTag, [_|_] = MsgIds, #state{slow = false} = State0) ->                                                                                                     
    ServerId = pick_server(State0),                                                                                                                                       
return(ConsumerTag, [_|_] = MsgIds, #state{slow = false} = State0) ->                                                                                                     
    ServerId = pick_server(State0),                                                                                                                                       
discard(ConsumerTag, [_|_] = MsgIds, #state{slow = false} = State0) ->                                                                                                    
    ServerId = pick_server(State0),                                                                                                                                       
credit(ConsumerTag, Credit, Drain,                                                                                                                                        
    ServerId = pick_server(State0),                                                                                                                                       
handle_ra_event(QName, From, {applied, Seqs},                                                                                                                             
            ServerId = pick_server(State2),                                                                                                                               
set_timer(QName, State) ->                                                                                                                                                
    Leader = pick_server(State),

Answered by ariel-anieli

Jan 29, 2024

I see. And is this randomness a benefit?

View full answer

michaelklishin · 2024-01-29T05:37:50Z

michaelklishin
Jan 29, 2024
Maintainer

Leader election in Raft is way more than picking a node from the list. I don't think it will make much practical difference to the election process fairness.

But it does introduce a random list element pick for seemingly every inbound message, outbound message, and message acknowledgement. Every single one, for every single user.

1 reply

ariel-anieli Jan 29, 2024
Author

I see. And is this randomness a benefit?

Answer selected by ariel-anieli

michaelklishin · 2024-01-29T05:43:21Z

michaelklishin
Jan 29, 2024
Maintainer

The rates in this example are so low, I suspect you must be running these on the same machine as PerfTest?

With 3 nodes, 10 publishers and 10 consumers, even the most powerful consumer hardware CPUs like Apple's M3 Ultra and 12-16 core x86-64 chips would likely be maxed out. Our tests are usually performed using external clusters, so nodes do not have to compete with each other and PerfTest kernel threads.

For example, post one, two.

5 replies

ariel-anieli Jan 29, 2024
Author

Yes, PerfTest runs on the same machine: in this conditions, is the test relevant?
Thanks for the links, I will look them up.

michaelklishin Jan 29, 2024
Maintainer

I'd say it is not, all nodes compete for the same resources, you run the test for just 30s according to the example flags.

Our tests use dedicated [to tests] clusters and usually run for hours, with all key metrics collected by Prometheus and Grafana.

ariel-anieli Jan 29, 2024
Author

Okay. I sum up here the two points of the proposal:

Benefit of the change.
Relevancy of the test.

On Point One, you have not answered me. 🙂 On Point Two, I can't reproduce your tests:I don't have the means for that.

On Point Two, if the tests were to be run on the same machine, for it to be relevant, what should I change?

michaelklishin Jan 29, 2024
Maintainer

It's OK that you cannot reproduce our exact methodology. But that means that we will have to evaluate the effects of this change ourselves, and I cannot guess how much that might take.

Maybe @mkuratczyk would get interested in this PR specifically :)

This idea may have legs if we limit it to just the parts where a node has to elect e.g. what peer it will vote for. Otherwise, the Raft leader election process involves more than one node and I wouldn't expect much of a difference. It's also not particularly easy to test for fairness of elections because… well, it's testing for a random variable. Even folks who have gone through grad school do not necessarily get it right.

ariel-anieli Jan 29, 2024
Author

🙂 I would then approach the problem with the old way; exhaust all cases. Here are the three alternatives:

Do you see any benefit in trying them all?

kjnilsson · 2024-01-29T09:50:59Z

kjnilsson
Jan 29, 2024
Maintainer

pick_server/1 does not elect a leader, it just chooses one of the members in a Ra cluster to address first, which typically will result in the actual leader being discovered. I cannot see how a random selection would be better in terms of throughput than any other although in the above tests that are short enough the initial redirect overhead may contribute.
Both publishing and consuming first issue a synchronous command which discovers the leader after which it will always use the known leader.

The best way now, when the leader is not known is to use the ra_leaderboard to see if the leader for the queue is already known on the local node, if not it can fall back to random or first selection, I doubt it matters.

like this:

diff --git a/deps/rabbit/src/rabbit_fifo_client.erl b/deps/rabbit/src/rabbit_fifo_client.erl
index d6a32aa976..d2060f5ffd 100644
--- a/deps/rabbit/src/rabbit_fifo_client.erl
+++ b/deps/rabbit/src/rabbit_fifo_client.erl
@@ -813,8 +813,13 @@ get_missing_deliveries(State, From, To, ConsumerTag) ->
 
 pick_server(#state{leader = undefined,
                    cfg = #cfg{servers = [N | _]}}) ->
-    %% TODO: pick random rather that first?
-    N;
+    case ra_leaderboard:lookup_leader(N) of
+        undefined ->
+            %% TODO: pick random rather that first?
+            N;
+        Leader ->
+            Leader
+    end;
 pick_server(#state{leader = Leader}) ->
     Leader.

5 replies

ariel-anieli Jan 29, 2024
Author

Thanks for this explanation, @kjnilsson: here are two other tests. That makes four cases, @michaelklishin:

pick first
pick random
leaderboard then pick first
leaderboard then pick random.

And, from the tests I ran, the lowest median latency is with a random pick.

Leaderboard then pick first

seq 1 10 | sed -e 's/^/qq/' |
                 tr '\n' ',' |
                 sed -e 's/,$//' |
                 xargs -I {} sh -c "java -jar perf-test-2.20.0.jar --time 30 -x10 -y10 --quorum-queue --queue {} -mf compact" \
                 -H amqp://localhost:5672,amqp://localhost:5772,amqp://localhost:5872                                                                                                    

sending rate avg: 4187 msg/s                                                                                                                                              
receiving rate avg: 52 msg/s                                                                                                                                              
consumer latency min/median/75th/95th/99th 992021/28497292/31210699/33590278/34193784 µs                                                                                  

seq 1 10 | sed -e 's/^/qq/' |
                 tr '\n' ',' |
                 sed -e 's/,$//' |
                 xargs -I {} sh -c "java -jar perf-test-2.20.0.jar --time 30 -x10 -y10 --quorum-queue --queue {} -mf compact"                                                                                                                                                                         

sending rate avg: 3437 msg/s                                                                                                                                              
receiving rate avg: 230 msg/s                                                                                                                                             
consumer latency min/median/75th/95th/99th 1494029/23996603/29505019/33303811/35199211 µs

Leaderboard then pick random

seq 1 10 | sed -e 's/^/qq/' |
                 tr '\n' ',' |
                 sed -e 's/,$//' |
                 xargs -I {} sh -c "java -jar perf-test-2.20.0.jar --time 30 -x10 -y10 --quorum-queue --queue {} -mf compact" \
                 -H amqp://localhost:5672,amqp://localhost:5772,amqp://localhost:5872                                                                                                    

sending rate avg: 2804 msg/s                                                                                                                                              
receiving rate avg: 110 msg/s                                                                                                                                             
consumer latency min/median/75th/95th/99th 2104593/26098287/31600947/36591748/37403899 µs                                                                                 

seq 1 10 | sed -e 's/^/qq/' |
                 tr '\n' ',' |
                 sed -e 's/,$//' |
                 xargs -I {} sh -c "java -jar perf-test-2.20.0.jar --time 30 -x10 -y10 --quorum-queue --queue {} -mf compact"                                                                                                                                                                         

sending rate avg: 6295 msg/s                                                                                                                                              
receiving rate avg: 108 msg/s                                                                                                                                             
consumer latency min/median/75th/95th/99th 1292255/31290919/33790676/36383364/36781647 µs

michaelklishin Jan 29, 2024
Maintainer

Is this problem worth solving at all? Has anyone complained about it? How do we measure success?

It sounds like a solution in search of a problem to me.

ariel-anieli Jan 30, 2024
Author

On your questions, @michaelklishin: I don't know. No one. I can't tell.
🙂 As I said before, I came across this question; therefore my proposal.
If it isn't a problem; then, it's all fine.

michaelklishin Jan 30, 2024
Maintainer

It's not a problem we see mentioned, and it is really difficult to measure the outcome. Thank you for taking the time to contribute but I feel this may end up being a waste of time for you and our team alike :(

ariel-anieli Jan 30, 2024
Author

🙂 All good then; thanks for the feedback!

Make pick_server/1 randomly elect the leader #10435

Uh oh!

ariel-anieli Jan 29, 2024

Replies: 3 comments · 11 replies

Uh oh!

michaelklishin Jan 29, 2024 Maintainer

Uh oh!

ariel-anieli Jan 29, 2024 Author

Uh oh!

michaelklishin Jan 29, 2024 Maintainer

Uh oh!

ariel-anieli Jan 29, 2024 Author

Uh oh!

michaelklishin Jan 29, 2024 Maintainer

Uh oh!

ariel-anieli Jan 29, 2024 Author

Uh oh!

michaelklishin Jan 29, 2024 Maintainer

Uh oh!

ariel-anieli Jan 29, 2024 Author

Uh oh!

kjnilsson Jan 29, 2024 Maintainer

Uh oh!

ariel-anieli Jan 29, 2024 Author

Uh oh!

michaelklishin Jan 29, 2024 Maintainer

Uh oh!

ariel-anieli Jan 30, 2024 Author

Uh oh!

michaelklishin Jan 30, 2024 Maintainer

Uh oh!

ariel-anieli Jan 30, 2024 Author

ariel-anieli
Jan 29, 2024

Replies: 3 comments 11 replies

michaelklishin
Jan 29, 2024
Maintainer

ariel-anieli Jan 29, 2024
Author

michaelklishin
Jan 29, 2024
Maintainer

ariel-anieli Jan 29, 2024
Author

michaelklishin Jan 29, 2024
Maintainer

ariel-anieli Jan 29, 2024
Author

michaelklishin Jan 29, 2024
Maintainer

ariel-anieli Jan 29, 2024
Author

kjnilsson
Jan 29, 2024
Maintainer

ariel-anieli Jan 29, 2024
Author

michaelklishin Jan 29, 2024
Maintainer

ariel-anieli Jan 30, 2024
Author

michaelklishin Jan 30, 2024
Maintainer

ariel-anieli Jan 30, 2024
Author