Tips on Debugging "Failed to acquire lock within 10000ms" issues #5422

chgl · 2021-08-11T13:57:27Z

chgl
Aug 11, 2021

Changes to a Kafka CRD (specifically the config: section) are currently not propagated to the underlying Kafka pods. Checking the Strimzi Operator logs shows many entries of:

2021-08-11 13:27:38 DEBUG AbstractOperator:390 - Reconciliation #41422(timer) Kafka(pipelines/kafka-cluster): Try to acquire lock lock::pipelines::Kafka::kafka-cluster
...
2021-08-11 13:27:48 DEBUG AbstractOperator:420 - Reconciliation #41422(timer) Kafka(pipelines/kafka-cluster): Failed to acquire lock lock::pipelines::Kafka::kafka-cluster within 10000ms.

Suggesting that the lock couldn't be acquired. However, at the same time locks for operations on the KafkaConnect CRD seemed to work:

2021-08-11 13:27:38 DEBUG AbstractOperator:390 - Reconciliation #41423(timer) KafkaConnect(pipelines/kafka-connect-cluster-2): Try to acquire lock lock::pipelines::KafkaConnect::kafka-connect-cluster-2
2021-08-11 13:27:38 DEBUG AbstractOperator:393 - Reconciliation #41423(timer) KafkaConnect(pipelines/kafka-connect-cluster-2): Lock lock::pipelines::KafkaConnect::kafka-connect-cluster-2 acquired
...
2021-08-11 13:27:38 DEBUG AbstractOperator:410 - Reconciliation #41423(timer) KafkaConnect(pipelines/kafka-connect-cluster-2): Lock lock::pipelines::KafkaConnect::kafka-connect-cluster-2 released

This persisted even after restarting the Strimzi Operator. Should I just try restarting a few times, or are there other approaches?

Since #3844 suggested this might be an operator/cloud provider failure, I thought I'd start as a discussion. But I can create an issue if it's more appropriate.

Using Strimzi 0.24.0, Kafka 2.8.0, Kubernetes v1.20.7 via Kubespray on vSphere VMs.

Answered by scholzj

Aug 11, 2021

Every Strimzi custom resource you create is reconciled periodically (every 2 minutes by default) or when it is updated. For each resource, only one reconciliation can run at a time - if more of them would be running, they might fight with each other etc. So there is a lock which make sure only one of the is running. When a new reconciliation should be started, it tries to get the lock and if it is not available in just ends with this log message instead of waiting longer because it knows that in any case soon another reconciliation will try it again. The lock is per custom resource, so you can see a connector reconcile fine while Kafka reconciliation does not get the lock. So this message…

View full answer

scholzj · 2021-08-11T15:47:29Z

scholzj
Aug 11, 2021
Maintainer

Every Strimzi custom resource you create is reconciled periodically (every 2 minutes by default) or when it is updated. For each resource, only one reconciliation can run at a time - if more of them would be running, they might fight with each other etc. So there is a lock which make sure only one of the is running. When a new reconciliation should be started, it tries to get the lock and if it is not available in just ends with this log message instead of waiting longer because it knows that in any case soon another reconciliation will try it again. The lock is per custom resource, so you can see a connector reconcile fine while Kafka reconciliation does not get the lock. So this message on its own doesn't mean much -> it just means that another reconciliation is running at this point.

It might not mean anything bad - for example the operator is waiting for pod to roll or for something to be created (storage or load balancers take sometimes longer to create) ... these things can easily take more than 10 seconds. If it is happening too often, it might indicate some issues.

The way to check is to look into the whole log ...

In 0.24, you should see also some Reconciliation in progress messages, which should tell you which reconciliation is the one running and having the lock. With the reconciliation number you might be able to trace it back to where is it stuck right now / what is it waiting for.
When things take too long, the reconciliation normally fails with timeout. The timeout exceptions might tell you what is the problem as well.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strimzi

Tips on Debugging "Failed to acquire lock within 10000ms" issues #5422

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Strimzi

Tips on Debugging "Failed to acquire lock within 10000ms" issues #5422

Uh oh!

chgl Aug 11, 2021

Replies: 1 comment

Uh oh!

scholzj Aug 11, 2021 Maintainer

chgl
Aug 11, 2021

scholzj
Aug 11, 2021
Maintainer