Replies: 1 comment 1 reply
-
I'm not entirely sure what and why happened from your description. But it looks from the operator logs it also cannot connect to Zookeeper. So I guess either there is some issue with the networking or general, or your Zookeeper cluster has somehow fallen apart. The first one would probably not impact just Kafka - so I guess you can check if other pods have working network etc. (or you would maybe know if other apps had problems as well). For the second - you can try to exec into the ZooKeeper pods and use the Zookeeper shell (there is a script in the Kafka |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
Earlier this week we saw an strange failure in Prod (v1.18.9-eks), our Strimzi-kafka cluster in Prod k8s was unable to connect to zk at startup:
2021-04-12 16:47:43,488 ERROR Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) [main] kafka.zookeeper.ZooKeeperClientTimeoutException: Timed out waiting for connection while in state: CONNECTING at kafka.zookeeper.ZooKeeperClient.$anonfun$waitUntilConnected$3(ZooKeeperClient.scala:262) at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:258) at kafka.zookeeper.ZooKeeperClient.<init>(ZooKeeperClient.scala:119) at kafka.zk.KafkaZkClient$.apply(KafkaZkClient.scala:1865) at kafka.server.KafkaServer.createZkClient$1(KafkaServer.scala:419) at kafka.server.KafkaServer.initZkClient(KafkaServer.scala:444) at kafka.server.KafkaServer.startup(KafkaServer.scala:222) at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:44) at kafka.Kafka$.main(Kafka.scala:82) at kafka.Kafka.main(Kafka.scala) 2021-04-12 16:47:43,489 INFO shutting down (kafka.server.KafkaServer) [main]
We noticed the issue began happening after our node was abruptly restarted.
Kafka cluster pod 1 logs follow the chain of events as described above as well:
kafka_cluster_1_prd.log
We also captured the zookeeper and strimzi-operator logs during the downtime:
prod_zk_logs.txt
prod_strimzi_op_logs.txt
Our primary suspicion is an unrelated network-policy issue triggered by the nodeflip affecting the connectivity to zk.
However, I've seen several issues relating to zk connectivity in the past - is this something that could be affecting our current version of strimzi-kafka as well?
Appreciate the feedback.
Versions:
Strimzi 0.20.0, Kafka 2.6.0, K8s v1.18.9-eks
Beta Was this translation helpful? Give feedback.
All reactions