[Bug]: Cluster broke after 0.49 upgrade (imagePullBackOff) #12166

discostur · 2025-11-24T09:10:02Z

discostur
Nov 24, 2025

Bug Description

Hi,

we upgraded the strimzi kafka operator from v0.48 to v0.49. The operator started to update the kafka nodepool containers but since the new docker image tag was not available the image / pod was in imagePullbackOff. As far as this this is expected behaviour. But now the strange thing starts: the operator stopped not with the pod rollout after the first one stuck in imagePullBackOff but the tried to roll out the new version to a second node. Since we run a 3x node cluster the cluster went offline. This should never happen in my opinion. The operator should just stop and wait till the first pod is available / healthy again.

We use a private image registry and the new image tag was not synced yet.

config:

spec:
  kafka:
    version: 4.1.0
    metadataVersion: 4.1-IV1
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      default.replication.factor: 3
      min.insync.replicas: 2

Steps to reproduce

rollout kafka operator v0.48 and deploy cluster with provided config
use private docker registry with only v0.48 image tags available
upgrade operator to v0.49
wait till second nodepool pod is in imagePullBackOff

Expected behavior

only one pod should end up in imagePullBackOff - never two since the cluster is down then

Strimzi version

0.49

Kubernetes version

1.33.1

Installation method

YAML Files

Infrastructure

DigitalOcean

Configuration files and logs

No response

Additional context

No response

scholzj · 2025-11-24T09:14:45Z

scholzj
Nov 24, 2025
Maintainer

First of all, Strimzi 0.49.0 does support Kafka 4.1.0. So you do not need to use 4.1.1 instead of it. You can choose on oyur own which version to use. Second, registry.opusdns.dev/vendor/quay.io/strimzi/kafka:0.49.0-kafka-4.1.0 is not our container image. Our images are only in the Quay.io registry. E.g. quay.io/strimzi/kafka:0.49.0-kafka-4.1.0. If you use your own custom registries, it is up to you to make sure they have the required images.

For the third issue of it rolling all pods - that is something we can look into. Please provide full logs from the operator as well as full custom resources and steps to reproduce it.

0 replies

discostur · 2025-11-24T09:17:17Z

discostur
Nov 24, 2025
Author

sorry - just updated the text. please read again ;)
logs will be provided in a minute ...

1 reply

scholzj Nov 24, 2025
Maintainer

Ok. Curious to see what the logs say. I tried to reproduce it, but after 40 minutes, it is still only one node in the ImagePullBackOff state:

NAME                                          READY   STATUS             RESTARTS   AGE
my-cluster-aston-1000                         1/1     Running            0          40m
my-cluster-bodymoor-3000                      1/1     Running            0          40m
my-cluster-controllers-0                      1/1     Running            0          40m
my-cluster-controllers-1                      0/1     ImagePullBackOff   0          37m
my-cluster-controllers-2                      1/1     Running            0          40m
my-cluster-entity-operator-6d5c4ff578-dm2cc   2/2     Running            0          39m
my-cluster-kafka-exporter-84dbdb6fbd-fljx9    1/1     Running            0          39m
my-cluster-witton-2000                        1/1     Running            0          40m
strimzi-cluster-operator-67dd944769-vkjnq     1/1     Running            0          38m

discostur · 2025-11-24T10:49:04Z

discostur
Nov 24, 2025
Author

Ok sorry my first thought was wrong and the logs brought light into the dark. The operator did correct and only one pod from the 3x node cluster was re-scheduled with the new image tag. However through some "external" behaviour an additional pod from the cluster was killed / terminated. After the pod terminated the operator re-created it (which is correct) with the new pod template. Because of the new image tag which was not yet available in the registry the cluster went offline ...

I'm not sure if this is intended that the operator creates the pod with the new template of if he should wait for the first pod to be healthy agin (so create the pod with the old template).

The critical logs are here:

1763963753260	2025-11-24T05:55:53.260Z	  "message" : "[AdminClient clientId=adminclient-154] Connection to node 2 (cluster-fra1-dev1-kafka-cluster-fra1-dev1-kafka-nodepool-2.cluster-fra1-dev1-kafka-kafka-brokers.strimzi-kafka.svc/10.244.0.248:9091) could not be established. Node may not be available.",
1763963753172	2025-11-24T05:55:53.172Z	  "message" : "[AdminClient clientId=adminclient-154] Connection to node 2 (cluster-fra1-dev1-kafka-cluster-fra1-dev1-kafka-nodepool-2.cluster-fra1-dev1-kafka-kafka-brokers.strimzi-kafka.svc/10.244.0.248:9091) could not be established. Node may not be available.",
1763963753110	2025-11-24T05:55:53.110Z	  "message" : "[AdminClient clientId=adminclient-154] Connection to node 2 (cluster-fra1-dev1-kafka-cluster-fra1-dev1-kafka-nodepool-2.cluster-fra1-dev1-kafka-kafka-brokers.strimzi-kafka.svc/10.244.0.248:9091) could not be established. Node may not be available.",
1763963753058	2025-11-24T05:55:53.058Z	  "message" : "[AdminClient clientId=adminclient-153] Connection to node -3 (cluster-fra1-dev1-kafka-cluster-fra1-dev1-kafka-nodepool-2.cluster-fra1-dev1-kafka-kafka-brokers.strimzi-kafka.svc.cluster.local/10.244.0.248:9090) could not be established. Node may not be available.",
1763963752792	2025-11-24T05:55:52.792Z	  "message" : "Reconciliation #4160(timer) Kafka(strimzi-kafka/cluster-fra1-dev1-kafka): Pod cluster-fra1-dev1-kafka-cluster-fra1-dev1-kafka-nodepool-2 is not ready. We will check if KafkaRoller can do anything about it.",
1763963752784	2025-11-24T05:55:52.784Z	  "message" : "Reconciliation #4160(timer) Kafka(strimzi-kafka/cluster-fra1-dev1-kafka): Error waiting for pod strimzi-kafka/cluster-fra1-dev1-kafka-cluster-fra1-dev1-kafka-nodepool-2 to become ready: io.strimzi.operator.common.TimeoutException: Exceeded timeout of 300000ms while waiting for Pods resource cluster-fra1-dev1-kafka-cluster-fra1-dev1-kafka-nodepool-2 in namespace strimzi-kafka to be ready",
1763963752784	2025-11-24T05:55:52.784Z	  "message" : "Reconciliation #4160(timer) Kafka(strimzi-kafka/cluster-fra1-dev1-kafka): Exceeded timeout of 300000ms while waiting for Pods resource cluster-fra1-dev1-kafka-cluster-fra1-dev1-kafka-nodepool-2 in namespace strimzi-kafka to be ready",
1763963302983	2025-11-24T05:48:22.983Z	  "message" : "Reconciliation #4143(timer) Kafka(strimzi-kafka/cluster-fra1-dev1-kafka): Will temporarily skip verifying pod cluster-fra1-dev1-kafka-cluster-fra1-dev1-kafka-nodepool-2/2 is up-to-date due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Pod cluster-fra1-dev1-kafka-cluster-fra1-dev1-kafka-nodepool-2 cannot be updated right now., retrying after at least 250ms",
1763960543014	2025-11-24T05:02:23.014Z	  "message" : "Reconciliation #4084(timer) Kafka(strimzi-kafka/cluster-fra1-dev1-kafka): Will temporarily skip verifying pod cluster-fra1-dev1-kafka-cluster-fra1-dev1-kafka-nodepool-2/2 is up-to-date due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Pod cluster-fra1-dev1-kafka-cluster-fra1-dev1-kafka-nodepool-2 cannot be updated right now., retrying after at least 250ms",
1763959943003	2025-11-24T04:52:23.003Z	  "message" : "Reconciliation #4072(timer) Kafka(strimzi-kafka/cluster-fra1-dev1-kafka): Will temporarily skip verifying pod cluster-fra1-dev1-kafka-cluster-fra1-dev1-kafka-nodepool-2/2 is up-to-date due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Pod cluster-fra1-dev1-kafka-cluster-fra1-dev1-kafka-nodepool-2 cannot be updated right now., retrying after at least 250ms",

You can see that the operator behaves correctly and says that the nodepool-2 node canont be updated right now. Then at 2025-11-24T05:55:52 the pod got killed and the operator re-created it with the new template.

Since i'm not sure if this is expected behaviour please just give me a short feedback and then i will close the discussiion. And sorry for the confusion first ... was on the wrong way :/

2 replies

scholzj Nov 24, 2025
Maintainer

Yeah, that can definitely happen if the pod is deleted from the outside in a situation like this. I'm not sure I would call this intended, but it is certainly intentionally designed that way (so maybe intended is the right word here 🤷 ). The problem is that the old template does not exist anymore at that point and maintaining it would be pretty complicated. But also, there are other situations when using the new template helps to avoid problems that would happen if the Pods were restarted with the old template. So I think both solutions would have some scenarios that would cause issues.

discostur Nov 24, 2025
Author

ok thanks for the response ... make sense for me +1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strimzi

[Bug]: Cluster broke after 0.49 upgrade (imagePullBackOff) #12166

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Strimzi

[Bug]: Cluster broke after 0.49 upgrade (imagePullBackOff) #12166

Uh oh!

Uh oh!

discostur Nov 24, 2025

Bug Description

Steps to reproduce

Expected behavior

Strimzi version

Kubernetes version

Installation method

Infrastructure

Configuration files and logs

Additional context

Replies: 3 comments · 3 replies

Uh oh!

scholzj Nov 24, 2025 Maintainer

Uh oh!

discostur Nov 24, 2025 Author

Uh oh!

scholzj Nov 24, 2025 Maintainer

Uh oh!

Uh oh!

discostur Nov 24, 2025 Author

Uh oh!

scholzj Nov 24, 2025 Maintainer

Uh oh!

discostur Nov 24, 2025 Author

discostur
Nov 24, 2025

Replies: 3 comments 3 replies

scholzj
Nov 24, 2025
Maintainer

discostur
Nov 24, 2025
Author

scholzj Nov 24, 2025
Maintainer

discostur
Nov 24, 2025
Author

scholzj Nov 24, 2025
Maintainer

discostur Nov 24, 2025
Author