Sporadic "404 page not found" on docker push to local K3s Registry on ARM64 VM (UTM/QEMU) - I/O Bottleneck? #12716
AndrejSchefer
started this conversation in
General
Replies: 1 comment
-
I would probably ask this question in the Traefik project, not here? I believe this indicates that your pod is crashing or failing health checks and is not eligible to receive requests, so the client is just sent a 404 from the default backend. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone,
I'm reaching out for help with a persistent issue I've been troubleshooting for days. I'm running a local K3s cluster and trying to deploy a private Docker Registry (
registry:2image) for my home network using a non-TLS setup.The Problem
docker login registry.localworks perfectly every time. However, adocker push registry.local/my/imagefails sporadically (about 9 out of 10 times) with the errorunknown: 404 page not found. During these failures, the Traefik Ingress logs show a correspondingsubset not found for ic-docker-registry/docker-registry-without-tlserror. Occasionally, a push of a very small image succeeds.Notably, an identical setup on two public x86 servers using TLS certificates from Let's Encrypt works flawlessly. This issue only occurs in my local, non-TLS environment.
My Environment
registry:2image, deployed to theic-docker-registrynamespace.What I've Already Tried (The Troubleshooting Journey)
I've systematically debugged this from the application layer down to the infrastructure. Here’s what I’ve done:
1. Verified Ingress Configuration (
IngressRoute)I initially suspected my Traefik rule was too restrictive.
match: Host('registry.local') && PathPrefix('/v2/')match: Host('registry.local')to ensure all traffic for the host is forwarded.2. Verified Service and Endpoints
I confirmed the Kubernetes service was correctly pointing to the running pod.
kubectl describe service ...showed the service had a valid endpoint pointing to the pod's IP address.3. Isolated Storage as the Cause (NFS -> emptyDir)
My initial setup used a
PersistentVolumeClaimbacked by an NFS share. I suspected slow network storage was the issue.journalctl) on the worker node did, in fact, showNFS: server not respondingerrors during pushes.emptyDirvolume instead, completely removing the network storage dependency.4. Added and Tuned Resource Limits & Health Probes
My next theory was that the pod was crashing under load (e.g., an OOM kill).
resources(requests and limits) to the container, progressively increasing the memory limit up to 2Gi. I also configuredlivenessProbeandreadinessProbewith very tolerant settings (longer timeouts, higher failure thresholds) to prevent Kubernetes from marking the pod asNotReadytoo quickly.describe podshowed a stable, running pod with 0 restarts, but the push still failed.5. Ruled out Inter-Node Networking (
nodeSelector)To eliminate potential issues with the CNI network (Flannel) between nodes, I used a
nodeSelectorto pin the registry pod to a specific worker (k3s-worker1).The Final Diagnosis: Kubelet Logs Reveal the Truth
After exhausting all application-level configurations, I monitored the live
kubeletlogs on the worker node (k3s-worker1) during a failed push attempt. This provided the definitive evidence:Readiness probe failed: Get "http://<POD-IP>:5000/v2/": context deadline exceededThis confirms the exact failure chain:
docker pushstarts, creating a high I/O and CPU load on the virtual machine.NotReady, and immediately removes it from the service's endpoints.docker push, finds no available backend pods for the service and correctly returns a404 page not found. The actual push requests never even reach the pod's logs because Traefik stops them first.The problem is not the Kubernetes configuration, but the performance of the underlying virtualized infrastructure. Even after increasing the VM's RAM to 8GB, the issue persisted, pointing directly to an I/O performance bottleneck.
My question to the community:
Has anyone else experienced similar I/O performance bottlenecks with intensive workloads (like a Docker Registry) in a K3s environment running inside UTM/QEMU on ARM64? Are there known performance issues or specific virtual disk configurations (caching, drivers, etc.) in UTM that can improve I/O throughput for this kind of server workload?
Thanks for any insights you can provide!
Beta Was this translation helpful? Give feedback.
All reactions