-
Notifications
You must be signed in to change notification settings - Fork 51
Description
Hi Team,
I recently encountered a situation when an on-prem production kubeadm cluster was upgraded on a weekend and when all was done, a pod was in an ImagePullBackOff state.
That was when I was involved and I looked at the cachedimages and found that particular image was in an error state as it was tying to pull it from the remote registry and could not find it there, as it did not exist in the remote registry.
However, we know for a fact that there was 1 pod using that image:tag. Below are some timelines, for your quick reference:
- As per Kubernetes events, the cachedImage was deleted on 30th December. Below are the relevant events:
{"component":"cachedimage-controller","count":"1","createdAt":"2025-12-30 07:54:17 +0000 UTC","eventType":"kubernetes-event","host":"","kind":"CachedImage","lastSeenAt":"2025-12-30 07:54:17 +0000 UTC","message":"Image <repository>/ord-handling:v1.7.18 has expired, deleting it","name":"<repository>-ord-handling-v1.7.18","namespace":"default","reason":"Expiring","type":"Normal"}
{"component":"cachedimage-controller","count":"1","createdAt":"2025-12-30 07:54:17 +0000 UTC","eventType":"kubernetes-event","host":"","kind":"CachedImage","lastSeenAt":"2025-12-30 07:54:17 +0000 UTC","message":"Image <repository>/ord-handling:v1.7.18 successfully expired","name":"<repository>-ord-handling-v1.7.18","namespace":"default","reason":"Expired","type":"Normal"}
{"component":"cachedimage-controller","count":"1","createdAt":"2025-12-30 07:54:17 +0000 UTC","eventType":"kubernetes-event","host":"","kind":"CachedImage","lastSeenAt":"2025-12-30 07:54:17 +0000 UTC","message":"Removing image <repository>/ord-handling:v1.7.18 from cache","name":"<repository>-ord-handling-v1.7.18","namespace":"default","reason":"CleaningUp","type":"Normal"}
{"component":"cachedimage-controller","count":"1","createdAt":"2025-12-30 07:54:17 +0000 UTC","eventType":"kubernetes-event","host":"","kind":"CachedImage","lastSeenAt":"2025-12-30 07:54:17 +0000 UTC","message":"Image <repository>/ord-handling:v1.7.18 successfully removed from cache","name":"<repository>-ord-handling-v1.7.18","namespace":"default","reason":"CleanedUp","type":"Normal"}
- The Kubernetes upgrade was performed on the 17th January. The nodes were drained and upgraded during this time and when everything was done, the pod could not start as the image was not present in kuik's local registry and was not there on the remote registry as well. Below are the events relating to this:
{"component":"kubelet","count":"1","createdAt":"2026-01-17 08:01:07 +0000 UTC","eventType":"kubernetes-event","host":"<NODE>","kind":"Pod","lastSeenAt":"2026-01-17 08:01:07 +0000 UTC","message":"Pulling image \"localhost:7439/<repository>/ord-handling:v1.7.18\"","name":"ord-handling-6fbf599777-24cdd","namespace":"<NS>","reason":"Pulling","type":"Normal"}
{"component":"kubelet","count":"1","createdAt":"2026-01-17 08:01:30 +0000 UTC","eventType":"kubernetes-event","host":"<NODE>","kind":"Pod","lastSeenAt":"2026-01-17 08:01:30 +0000 UTC","message":"Failed to pull image \"localhost:7439/<repository>/ord-handling:v1.7.18\": rpc error: code = NotFound desc = failed to pull and unpack image \"localhost:7439/<repository>/ord-handling:v1.7.18\": failed to resolve reference \"localhost:7439/<repository>/ord-handling:v1.7.18\": localhost:7439/<repository>/ord-handling:v1.7.18: not found","name":"ord-handling-6fbf599777-24cdd","namespace":"vip-app","reason":"Failed","type":"Warning"}
Is there a possibility that an image:tag used by a pod is deleted ? Any thoughts on how I can prevent this from happening again ?
Please note that we are currently running the helm chart with version 1.13.1.
Please let me know in case I can provide any additional information.