Skip to content

Commit 2be9c2f

Browse files
committed
Fix race in repl_rt_heartbeat due to short timeout
One particular timeout in the repl_rt_heartbeat test was slightly too short, which could cause us to occasionally hit a false positive on this test if various timings lined up just right. This PR bumps up the timeout, which should prevent this from happening again. I would really like to do a proper fix for this, which would use intercepts or something to confirm that the actual timeout is being hit in the code...but we don't really have time for that, and a half fix is better than no fix I suppose.
1 parent 0fc3f77 commit 2be9c2f

File tree

1 file changed

+8
-2
lines changed

1 file changed

+8
-2
lines changed

tests/repl_rt_heartbeat.erl

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -65,8 +65,14 @@ confirm() ->
6565
suspend_heartbeat_messages(LeaderA),
6666

6767
%% sleep longer than the HB timeout interval to force re-connection;
68-
%% and give it time to restart the RT connection. Wait an extra 2 seconds.
69-
timer:sleep(timer:seconds(?HB_TIMEOUT) + 2000),
68+
%% and give it time to restart the RT connection.
69+
%% Since it's possible we may disable heartbeats right after a heartbeat has been fired,
70+
%% it can take up to 2*?HB_TIMEOUT seconds to detect a missed heartbeat. The extra second
71+
%% is to avoid rare race conditions due to the timeouts lining up exactly. Not the prettiest
72+
%% solution, but it failed so rarely at 2*HB_TIMEOUT, that this should be good enough
73+
%% in practice, and it beats having to write a bunch of fancy intercepts to verify that
74+
%% the timeout has been hit internally.
75+
timer:sleep(timer:seconds(?HB_TIMEOUT*2) + 1000),
7076

7177
%% Verify that RT connection has restarted by noting that it's Pid has changed
7278
RTConnPid2 = get_rt_conn_pid(LeaderA),

0 commit comments

Comments
 (0)