-
Notifications
You must be signed in to change notification settings - Fork 822
Description
Bug Description
On MicroK8s 1.35 with containerd 2.1.3, pulling images from certain OCI-compliant registries hangs indefinitely with no error output. The TCP connection is established and TLS completes successfully, but no data flows — containerd simply stalls waiting for a response that never arrives.
Environment
- MicroK8s: 1.35
- containerd: 2.1.3
- Nodes: multi-node cluster (control plane + worker nodes)
- Issue appears on worker nodes pulling from external registries
Symptoms
crictl pullormicrok8s ctr images pullhangs indefinitely with no error- The OCI index resolves successfully ("already exists"), but platform manifest fetches stay stuck at "waiting"
ssconfirms the TCP connection to the registry is establishedcurlto the same registry endpoint works correctly, returning proper HTTP responses- No timeout, no error — just silence in containerd logs
- Kubernetes Pods using an image from the affected registry stay pending forever. I had also this error with Jobs which never complete their init containers, leading to
context deadline exceededin Helm pre-install hooks
Root Cause
containerd 2.1 introduced a multipart layer fetch feature that sends Range: bytes=0-N HTTP headers to enable parallel downloads. Some registries respond with HTTP 200 (full content) rather than 206 Partial Content when they do not support or choose to ignore range requests.
containerd 2.1.3 does not handle this case — the fetch goroutines hang indefinitely waiting for a partial-content response that will never come. This is tracked upstream as containerd/containerd#11864.
Three fixes in the upstream release/2.1 branch are relevant:
| Upstream commit | Description | First included in |
|---|---|---|
| 34a1cb1dd | Deadlock: semaphore not released on error in dockerFetcher.open() |
v2.1.4 |
| add2dcf86 | Fetcher doesn't always close response body and call Release() |
v2.1.4 |
| ca3de4fe7 | Range-get request ignored by registry not surfaced as errContentRangeIgnored |
v2.1.6 |
The third fix (ca3de4fe7) is the most directly relevant: it ensures that when a registry ignores the Range header and returns a full 200 response, containerd detects this and falls back gracefully rather than hanging.
Workaround (confirmed working)
Create a per-host config for the affected registry under $SNAP_DATA/args/certs.d/:
/var/snap/microk8s/current/args/certs.d/<registry-hostname>/hosts.toml
server = "https://<registry-hostname>"
[host."https://<registry-hostname>"]
capabilities = ["pull", "resolve"]
dial_timeout = "30s"Then restart containerd:
sudo snap restart microk8s.daemon-containerd
Suggested Fix
Bump the containerd version in build-scripts/components/containerd/version.sh from v2.1.3 to v2.1.6 (released 2025-12-17).
-echo "v2.1.3"
+echo "v2.1.6"No patch changes are needed. The existing patches/v2.1.3/ directory is automatically selected by the version selector in build-scripts/print-patches-for.py for any target version ≥ v2.1.3, and the sideload patch applies cleanly to v2.1.6 (it only adds new files with no conflicts).
v2.1.6 also includes an update to the vendored golang.org/x/net/http2 transport (196 lines changed), which may further improve HTTP/2 reliability with various registries.
References
- Upstream issue: image pull hang in containerd 2.1.0 containerd/containerd#11864
- Fix PR (semaphore deadlock): fix(dockerFetcher): resolve deadlock issue in dockerFetcher open containerd/containerd#12126
- Fix (range header ignored): commit
ca3de4fe7onrelease/2.1 - containerd v2.1.6 release: https://github.com/containerd/containerd/releases/tag/v2.1.6