KAFKA-19905: Fix tight reconnection loop during shutdown

fvaleri · fvaleri · commit 5bae582ee8d8 · 2025-11-25T16:39:03.000+01:00
This patch fixes a tight broker to controller reconnection loop that may happen during shutdown.

1. Node 1 and 2 (brokers) request controlled shutdown
2. Controller grants the shutdown
3. Controller itself shuts down (RaftManager shutdown)
4. Node 1 and 2 continue trying to heartbeat to the now-dead controller
5. They get stuck in this reconnection loop because the NodeToControllerRequestThread is still running and hasn't been shut down properly

The reconnection loop goes on for exactly 5 minutes, which is the shutdown timeout hard coded in KafkaBroker trait.

This is what I have from another test logs for one of the brokers:

    SIGTERM received: 14:39:46,282
    Actual shutdown completed: 14:44:46,385
    Time elapsed: 5 minutes and 0.103 seconds (approximately 5 minutes)

I acknowledge that this is unlikely to happen with brokers running on different machine, but not so unlikely when running tests locally on a single physical machine.

Signed-off-by: Federico Valeri &lt;fedevaleri@gmail.com&gt;
diff --git a/core/src/main/scala/kafka/server/NodeToControllerChannelManager.scala b/core/src/main/scala/kafka/server/NodeToControllerChannelManager.scala
@@ -220,7 +220,7 @@ class NodeToControllerRequestThread(
   initialNetworkClient,
   Math.min(Int.MaxValue, Math.min(config.controllerSocketTimeoutMs, retryTimeoutMs)).toInt,
   time,
-  false
+  true
 ) with Logging {
 
   this.logIdent = logPrefix