Zk disconnect after node restart - Strimzi-kafka 0.20.0 #4767

laballab · 2021-04-16T13:50:45Z

laballab
Apr 16, 2021

Hi,

Earlier this week we saw an strange failure in Prod (v1.18.9-eks), our Strimzi-kafka cluster in Prod k8s was unable to connect to zk at startup:
2021-04-12 16:47:43,488 ERROR Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) [main] kafka.zookeeper.ZooKeeperClientTimeoutException: Timed out waiting for connection while in state: CONNECTING at kafka.zookeeper.ZooKeeperClient.$anonfun$waitUntilConnected$3(ZooKeeperClient.scala:262) at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:258) at kafka.zookeeper.ZooKeeperClient.<init>(ZooKeeperClient.scala:119) at kafka.zk.KafkaZkClient$.apply(KafkaZkClient.scala:1865) at kafka.server.KafkaServer.createZkClient$1(KafkaServer.scala:419) at kafka.server.KafkaServer.initZkClient(KafkaServer.scala:444) at kafka.server.KafkaServer.startup(KafkaServer.scala:222) at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:44) at kafka.Kafka$.main(Kafka.scala:82) at kafka.Kafka.main(Kafka.scala) 2021-04-12 16:47:43,489 INFO shutting down (kafka.server.KafkaServer) [main]

We noticed the issue began happening after our node was abruptly restarted.

Kafka cluster pod 1 logs follow the chain of events as described above as well:
kafka_cluster_1_prd.log

We also captured the zookeeper and strimzi-operator logs during the downtime:
prod_zk_logs.txt
prod_strimzi_op_logs.txt

Our primary suspicion is an unrelated network-policy issue triggered by the nodeflip affecting the connectivity to zk.
However, I've seen several issues relating to zk connectivity in the past - is this something that could be affecting our current version of strimzi-kafka as well?

Appreciate the feedback.

Versions:
Strimzi 0.20.0, Kafka 2.6.0, K8s v1.18.9-eks

scholzj · 2021-04-16T17:31:29Z

scholzj
Apr 16, 2021
Maintainer

I'm not entirely sure what and why happened from your description. But it looks from the operator logs it also cannot connect to Zookeeper. So I guess either there is some issue with the networking or general, or your Zookeeper cluster has somehow fallen apart. The first one would probably not impact just Kafka - so I guess you can check if other pods have working network etc. (or you would maybe know if other apps had problems as well). For the second - you can try to exec into the ZooKeeper pods and use the Zookeeper shell (there is a script in the Kafka bin directory) connect to localhost:12181 and see if it connects (it should open the shell and print that it is connected => try to run some commands to be sure). If it works, it might be some networking issue. If that doesn't work it is the Zookeeper cluster. In that case you can probably try to restart the pods manually.

1 reply

laballab Apr 19, 2021
Author

Thanks for the feedback Jakub - to provide some more context, zookeeper recovered ok after the node flip and we actually observed zookeeper alive & healthy during the kafka outage. The kafka-cluster was unable to connect to zookeeper after it also restarted. I've attached the zookeeper logs to the OP as well, @ssandanshi can provide additional context if needed.

I've added your steps to validate zookeeper & the network to our triage process so we can check on this if it happens again - thanks again Jakub 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strimzi

Zk disconnect after node restart - Strimzi-kafka 0.20.0 #4767

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Strimzi

Zk disconnect after node restart - Strimzi-kafka 0.20.0 #4767

Uh oh!

Uh oh!

laballab Apr 16, 2021

Replies: 1 comment · 1 reply

Uh oh!

scholzj Apr 16, 2021 Maintainer

Uh oh!

laballab Apr 19, 2021 Author

laballab
Apr 16, 2021

Replies: 1 comment 1 reply

scholzj
Apr 16, 2021
Maintainer

laballab Apr 19, 2021
Author