Cluster Operator's is stuck - Failed to acquire lock within 10000ms #6591

282ori · 2022-03-29T12:24:34Z

282ori
Mar 29, 2022

Hi, Im using Strimzi 0.22.1 on OpenShift 4.5 platform.
I noticed that sometimes the Cluster operator gets "stuck" and only restarting the operator temporary solves the problem.
This warning in the Cluster operator logs appears for all the kafkas in several namespaces and recurring for many times:

2022-03-28 08:24:21 WARN AbstractOperator:247 - Reconciliation #30(timer) Kafka(some-namespace/some-kafka-cluster): Failed to acquire lock lock::some-namespace::Kafka::some-kafka-cluster within 10000ms.

It appears that threads of the operator are "stuck"; shouldn't the operator know how to deal with this problem and release the lock?
There are warnings that might be related:

io.vertx.core.impl.BlockedThreadChecker
WARNING: Thread Thread[vert.x-eventloop-thread-0, 5, main] =Thread[vert.x-eventloop-thread-0, 5, main] has been blocked for 7609 ms, time limit is 2000 ms
io.vertx.core.VertxException: Thread blocked

Another common errors:

ERROR AbstractOperator:276 - Reconciliation #374(timer) Kafka(some-namespace/some-kafka-cluster): createOrUpdate failed
io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Pod my-kafka-cluster-0 is currently not rollable

WARN WatcherWebSocketListener:102 - Exec Failure
java.net.SocketTimeoutException: sent ping but didn't receive pong within 30000ms (after 4 successful ping/pongs)
...
...
caused by:
java.net.ssl.SSLException: Socket closed
...
caused by:
java.net.SocketException: Socket closed

SEVERE: Unhandled exception
io.fabric8.kuberenetes.client.KubernetesClientException: Operation [get] for kind: [Kafka] with name: [my-kafka-cluster] in namespace: [my-namespace] failed
...
...
caused by:
java.net.ConnectException: Failed to connect to {myKubernetesApiIP}
...
caused by:
java.net.ConnectionException: Connection refused (Connection refused)

The error of KubernetesClientException occurs with another failed [get] operation like secrets and other resources,
it seems like the kuberenetes api does not respond but as far as I know only Strimzi produces this error.

Does anyone know what might be the cause to this issue?
Thanks

scholzj · 2022-03-29T13:07:50Z

scholzj
Mar 29, 2022
Maintainer

It is quite hard to comment on the issue without the full log. As described in the FAQ in our docs, the Failed to acquire lock within 10000ms error might not mean necessarily anything bad. So it is hard to say what it means in your case without the full log. Also, you seem to be using a fairly old Strimzi version - many things were fixed since 0.22.

2 replies

282ori Apr 5, 2022
Author

Is the cluster operator using a different thread for every reconciliation?
is it possible that the reconciliation thread of 1 cluster gets stuck so the reconciliation of that Kafka cluster won't work?
Maybe the 'failed to acquire lock' emerged from this issue that specific reconciliation is not over?

scholzj Apr 5, 2022
Maintainer

Everything is possible, but as I said, without the full log it is hard to say anything more.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strimzi

Cluster Operator's is stuck - Failed to acquire lock within 10000ms #6591

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Strimzi

Cluster Operator's is stuck - Failed to acquire lock within 10000ms #6591

Uh oh!

282ori Mar 29, 2022

Replies: 1 comment · 2 replies

Uh oh!

Uh oh!

scholzj Mar 29, 2022 Maintainer

Uh oh!

282ori Apr 5, 2022 Author

Uh oh!

scholzj Apr 5, 2022 Maintainer

282ori
Mar 29, 2022

Replies: 1 comment 2 replies

scholzj
Mar 29, 2022
Maintainer

282ori Apr 5, 2022
Author

scholzj Apr 5, 2022
Maintainer