-
-
Notifications
You must be signed in to change notification settings - Fork 137
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Spegel version
v0.6.0
Kubernetes distribution
EKS
Kubernetes version
v1.34.2
CNI
AWS VPC CNI
Describe the bug
Was debugging an imagePullBackoff issue in one of my clusters today. I had just upgraded the nodes yesterday from EKS v1.33 to v1.34, which I'm guessing was related. A Pod was stuck in imagePullBackoff and I noticed the event said context canceled:
Failed to pull image "ghcr.io/$IMAGE_NAME:sha-b90d62f67b7d3c7cf27fb91cc59ed546c4795630": rpc error: code = Canceled desc = failed to pull and unpack image "ghcr.io/$IMAGE_NAME:sha-b90d62f67b7d3c7cf27fb91cc59ed546c4795630": context canceled
I decided it wasn't an auth issue. I half thought it might just be the image registry having issues, but I then inspected Spegel logs, and noticed several logs like the following:
2026-01-07 10:12:10.831 ERROR{
"time": "2026-01-07T18:12:10.831144603Z",
"level": "ERROR",
"source": {
"function": "github.com/spegel-org/spegel/pkg/registry.(*Registry).mirrorHandler.func2",
"file": "github.com/spegel-org/spegel/pkg/registry/registry.go",
"line": 316
},
"msg": "request to mirror failed, retrying with next",
"ref": "sha256:663b447ab1f2fdb605fc608e89eb65159a16a646b120c6fa76aa19c8e1df1bb6",
"path": "/v2/$IMAGE_NAME/blobs/sha256:663b447ab1f2fdb605fc608e89eb65159a16a646b120c6fa76aa19c8e1df1bb6",
"attempt": 1,
"mirror": "10.101.77.184:5000",
"err": "Get \"http://10.101.77.184:5000/v2/$IMAGE_NAME/blobs/sha256:663b447ab1f2fdb605fc608e89eb65159a16a646b120c6fa76aa19c8e1df1bb6?ns=ghcr.io\": dial tcp 10.101.77.184:5000: i/o timeout"
}
Seemed like Spegel was having some issues reaching its peers, so my obvious first step was to restart the Spegel Daemonset. That actually fixed it right away. The Pod started up nearly immediately. The incident left me with some questions/concerns however, namely:
- I would have expected a Spegel mirror issue to result in containerd bypassing the Spegel mirror and pulling the image from the registry. Should this not be the case?
- I'm not sure why Spegel was failing to reach its mirrors, when Spegel containers themselves were reporting healthy. I'd have hoped for Spegel to enter an unhealthy state and restart via a liveness probe, but apparently the readiness probes never reached an unhealthy state, and there are no liveness probes to restart Spegel in the case that it is unrecoverable. I suppose that means the only way to recover here was for me to manually restart the daemonset.
- Unfortunately I wasn't scraping Spegel's metrics endpoint during this. Were there any metrics I could have been using to watch out for issues like this? My first guess is the
libp2p_swarm_dial_errors_totaltimeseries.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working