Image pull backoff until Spegel restart

### Spegel version

v0.6.0

### Kubernetes distribution

EKS

### Kubernetes version

v1.34.2

### CNI

AWS VPC CNI

### Describe the bug

Was debugging an imagePullBackoff issue in one of my clusters today. I had just upgraded the nodes yesterday from EKS v1.33 to v1.34, which I'm guessing was related. A Pod was stuck in imagePullBackoff and I noticed the event said `context canceled`:

```
Failed to pull image "ghcr.io/$IMAGE_NAME:sha-b90d62f67b7d3c7cf27fb91cc59ed546c4795630": rpc error: code = Canceled desc = failed to pull and unpack image "ghcr.io/$IMAGE_NAME:sha-b90d62f67b7d3c7cf27fb91cc59ed546c4795630": context canceled
```

I decided it wasn't an auth issue. I half thought it might just be the image registry having issues, but I then inspected Spegel logs, and noticed several logs like the following:

```
2026-01-07 10:12:10.831 ERROR{
  "time": "2026-01-07T18:12:10.831144603Z",
  "level": "ERROR",
  "source": {
    "function": "github.com/spegel-org/spegel/pkg/registry.(*Registry).mirrorHandler.func2",
    "file": "github.com/spegel-org/spegel/pkg/registry/registry.go",
    "line": 316
  },
  "msg": "request to mirror failed, retrying with next",
  "ref": "sha256:663b447ab1f2fdb605fc608e89eb65159a16a646b120c6fa76aa19c8e1df1bb6",
  "path": "/v2/$IMAGE_NAME/blobs/sha256:663b447ab1f2fdb605fc608e89eb65159a16a646b120c6fa76aa19c8e1df1bb6",
  "attempt": 1,
  "mirror": "10.101.77.184:5000",
  "err": "Get \"http://10.101.77.184:5000/v2/$IMAGE_NAME/blobs/sha256:663b447ab1f2fdb605fc608e89eb65159a16a646b120c6fa76aa19c8e1df1bb6?ns=ghcr.io\": dial tcp 10.101.77.184:5000: i/o timeout"
}
```

Seemed like Spegel was having some issues reaching its peers, so my obvious first step was to restart the Spegel Daemonset. That actually fixed it right away. The Pod started up nearly immediately. The incident left me with some questions/concerns however, namely:

- I would have expected a Spegel mirror issue to result in containerd bypassing the Spegel mirror and pulling the image from the registry. Should this not be the case?
- I'm not sure why Spegel was failing to reach its mirrors, when Spegel containers themselves were reporting healthy. I'd have hoped for Spegel to enter an unhealthy state and restart via a liveness probe, but apparently the readiness probes never reached an unhealthy state, and there are no liveness probes to restart Spegel in the case that it is unrecoverable. I suppose that means the only way to recover here was for me to manually restart the daemonset.
- Unfortunately I wasn't scraping Spegel's metrics endpoint during this. Were there any metrics I could have been using to watch out for issues like this? My first guess is the `libp2p_swarm_dial_errors_total` timeseries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Image pull backoff until Spegel restart #1159

Spegel version

Kubernetes distribution

Kubernetes version

CNI

Describe the bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Image pull backoff until Spegel restart #1159

Description

Spegel version

Kubernetes distribution

Kubernetes version

CNI

Describe the bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions