Skip to content

Commit cba23ed

Browse files
committed
Timeout and retry broadcast after 3 seconds (#2154)
Client-side timeout of grpc to `put` messages to the etcd cluster. Blocking without a timeout on this is the only explanation we could find to see the `pending-broadcast` queue fill up. --- * [x] CHANGELOG updated * [x] Documentation update not needed * [x] Haddocks update not needed * [x] No new TODOs introduced
1 parent 1af2fb7 commit cba23ed

File tree

2 files changed

+17
-3
lines changed

2 files changed

+17
-3
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,10 @@ changes.
2525
- Fix an internal persistent queue blocking after restart when it reached
2626
capacity.
2727

28+
- Timeout and retry broadcast of network messages after 3 seconds in case the
29+
`etcd` grpc server is not responsive. This should avoid build-up on the
30+
outbound persistent queue.
31+
2832
- Handle failing lease keep alive in network component and avoid bursts in
2933
heartbeating.
3034

hydra-node/src/Hydra/Network/Etcd.hs

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -82,12 +82,17 @@ import Hydra.Network (
8282
import Hydra.Node.EmbedTH (embedExecutable)
8383
import Network.GRPC.Client (
8484
Address (..),
85+
CallParams (..),
8586
ConnParams (..),
8687
Connection,
8788
ReconnectPolicy (..),
8889
ReconnectTo (ReconnectToOriginal),
8990
Server (..),
91+
Timeout (..),
92+
TimeoutUnit (..),
93+
TimeoutValue (..),
9094
rpc,
95+
rpcWith,
9196
withConnection,
9297
)
9398
import Network.GRPC.Client.StreamType.IO (biDiStreaming, nonStreaming)
@@ -323,8 +328,8 @@ checkVersion tracer conn ourVersion NetworkCallback{onConnectivity} = do
323328

324329
-- | Broadcast messages from a queue to the etcd cluster.
325330
--
326-
-- TODO: retrying on failure even needed?
327-
-- Retries on failure to 'putMessage' in case we are on a minority cluster.
331+
-- Retries on failure to 'putMessage' in case we are on a minority cluster or
332+
-- when the grpc call timeouts.
328333
broadcastMessages ::
329334
(ToCBOR msg, Eq msg) =>
330335
Tracer IO EtcdLog ->
@@ -353,8 +358,13 @@ putMessage ::
353358
msg ->
354359
IO ()
355360
putMessage conn ourHost msg =
356-
void $ nonStreaming conn (rpc @(Protobuf KV "put")) req
361+
void $ nonStreaming conn (rpcWith @(Protobuf KV "put") callParams) req
357362
where
363+
-- NOTE: Timeout puts after 3 seconds. This is not tested, but we saw the
364+
-- 'pending-broadcast' queue fill up and suspect that 'put' requests in
365+
-- 'broadcastMessages' were just not served and stay pending forever.
366+
callParams = def{callTimeout = Just . Timeout Second $ TimeoutValue 3}
367+
358368
req =
359369
defMessage
360370
& #key .~ key

0 commit comments

Comments
 (0)