Skip to content

Image pull backoff until Spegel restart #1159

@Starttoaster

Description

@Starttoaster

Spegel version

v0.6.0

Kubernetes distribution

EKS

Kubernetes version

v1.34.2

CNI

AWS VPC CNI

Describe the bug

Was debugging an imagePullBackoff issue in one of my clusters today. I had just upgraded the nodes yesterday from EKS v1.33 to v1.34, which I'm guessing was related. A Pod was stuck in imagePullBackoff and I noticed the event said context canceled:

Failed to pull image "ghcr.io/$IMAGE_NAME:sha-b90d62f67b7d3c7cf27fb91cc59ed546c4795630": rpc error: code = Canceled desc = failed to pull and unpack image "ghcr.io/$IMAGE_NAME:sha-b90d62f67b7d3c7cf27fb91cc59ed546c4795630": context canceled

I decided it wasn't an auth issue. I half thought it might just be the image registry having issues, but I then inspected Spegel logs, and noticed several logs like the following:

2026-01-07 10:12:10.831 ERROR{
  "time": "2026-01-07T18:12:10.831144603Z",
  "level": "ERROR",
  "source": {
    "function": "github.com/spegel-org/spegel/pkg/registry.(*Registry).mirrorHandler.func2",
    "file": "github.com/spegel-org/spegel/pkg/registry/registry.go",
    "line": 316
  },
  "msg": "request to mirror failed, retrying with next",
  "ref": "sha256:663b447ab1f2fdb605fc608e89eb65159a16a646b120c6fa76aa19c8e1df1bb6",
  "path": "/v2/$IMAGE_NAME/blobs/sha256:663b447ab1f2fdb605fc608e89eb65159a16a646b120c6fa76aa19c8e1df1bb6",
  "attempt": 1,
  "mirror": "10.101.77.184:5000",
  "err": "Get \"http://10.101.77.184:5000/v2/$IMAGE_NAME/blobs/sha256:663b447ab1f2fdb605fc608e89eb65159a16a646b120c6fa76aa19c8e1df1bb6?ns=ghcr.io\": dial tcp 10.101.77.184:5000: i/o timeout"
}

Seemed like Spegel was having some issues reaching its peers, so my obvious first step was to restart the Spegel Daemonset. That actually fixed it right away. The Pod started up nearly immediately. The incident left me with some questions/concerns however, namely:

  • I would have expected a Spegel mirror issue to result in containerd bypassing the Spegel mirror and pulling the image from the registry. Should this not be the case?
  • I'm not sure why Spegel was failing to reach its mirrors, when Spegel containers themselves were reporting healthy. I'd have hoped for Spegel to enter an unhealthy state and restart via a liveness probe, but apparently the readiness probes never reached an unhealthy state, and there are no liveness probes to restart Spegel in the case that it is unrecoverable. I suppose that means the only way to recover here was for me to manually restart the daemonset.
  • Unfortunately I wasn't scraping Spegel's metrics endpoint during this. Were there any metrics I could have been using to watch out for issues like this? My first guess is the libp2p_swarm_dial_errors_total timeseries.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions