Description
After rebooting an edge device running Kubesolo as a systemd service, I’m seeing a boot-time race condition where the Portainer agent starts at the same time as the CoreDNS pod. If CoreDNS takes a bit longer to become ready (often because upstream network isn’t available yet), the Portainer agent fails its “phone home” attempts and ends up in CrashLoopBackOff. Once it reaches CrashLoopBackOff, it won’t recover automatically (it doesn’t keep retrying indefinitely) unless manually restarted.
This is problematic for edge deployments where power loss / cold boot is common and upstream connectivity (e.g., via 4G/5G modem) may come up 30–60s after the device boots.
Expected behavior
On boot, components that depend on cluster DNS / upstream networking (e.g., Portainer agent) should start robustly and recover automatically when networking/CoreDNS becomes ready—without requiring manual intervention.
Actual behavior
On boot with delayed network/CoreDNS readiness:
- CoreDNS starts but is not ready yet (readiness/health checks block traffic).
- Portainer agent starts concurrently, tries to contact its endpoint a few times.
- If DNS/upstream isn’t ready quickly enough, Portainer agent transitions to
CrashLoopBackOff.
- It does not automatically recover once CoreDNS/network becomes ready.
Reproduction steps
- Ensure Kubesolo is installed and configured to start on boot via
systemd (this setup has been stable during normal operation).
- Power down the edge rack completely (device + external 4G/5G modem).
- Power everything back on at the same time.
- Observe pods during startup:
coredns starts but stays unready briefly
portainer-agent starts and then goes to CrashLoopBackOff before CoreDNS finishes becoming ready
Why this happens in my environment (context)
During cold boot, the 4G/5G modem takes ~1 minute to become usable. Until upstream network is available, CoreDNS cannot fully function to serve DNS requests inside the cluster. The Portainer agent appears to depend on DNS/upstream connectivity during its startup “phone home” and fails hard if it can’t resolve/reach the endpoint quickly.
Evidence / logs
- Screenshot attached showing:
kube-system/coredns running/transitioning readiness
portainer/portainer-agent cycling into Error and then CrashLoopBackOff
lol I did "k get pods" without -A
Environment
- OS: self-built Debian distro Debian GNU/Linux 13 (trixie, based on ISAR)
- Kubesolo version:
v1.1.2
- Deployment type: edge device, cold boot with external 4G/5G modem (delayed WAN availability)
Impact
After power loss/cold boot, Portainer agent may remain down indefinitely unless manually restarted, which reduces reliability for unattended edge deployments.
Questions / request
Can Kubesolo/packaging be adjusted to make the Portainer agent more resilient on boot (e.g., restart policy/backoff behavior, init/wait-for-dns, or ordering constraints)?
Additional info
Happy to brainstorm and/or help, test, build etc. 😄
Description
After rebooting an edge device running Kubesolo as a
systemdservice, I’m seeing a boot-time race condition where the Portainer agent starts at the same time as the CoreDNS pod. If CoreDNS takes a bit longer to become ready (often because upstream network isn’t available yet), the Portainer agent fails its “phone home” attempts and ends up inCrashLoopBackOff. Once it reachesCrashLoopBackOff, it won’t recover automatically (it doesn’t keep retrying indefinitely) unless manually restarted.This is problematic for edge deployments where power loss / cold boot is common and upstream connectivity (e.g., via 4G/5G modem) may come up 30–60s after the device boots.
Expected behavior
On boot, components that depend on cluster DNS / upstream networking (e.g., Portainer agent) should start robustly and recover automatically when networking/CoreDNS becomes ready—without requiring manual intervention.
Actual behavior
On boot with delayed network/CoreDNS readiness:
CrashLoopBackOff.Reproduction steps
systemd(this setup has been stable during normal operation).corednsstarts but stays unready brieflyportainer-agentstarts and then goes toCrashLoopBackOffbefore CoreDNS finishes becoming readyWhy this happens in my environment (context)
During cold boot, the 4G/5G modem takes ~1 minute to become usable. Until upstream network is available, CoreDNS cannot fully function to serve DNS requests inside the cluster. The Portainer agent appears to depend on DNS/upstream connectivity during its startup “phone home” and fails hard if it can’t resolve/reach the endpoint quickly.
Evidence / logs
kube-system/corednsrunning/transitioning readinessportainer/portainer-agentcycling intoErrorand thenCrashLoopBackOfflol I did "k get pods" without -AEnvironment
v1.1.2Impact
After power loss/cold boot, Portainer agent may remain down indefinitely unless manually restarted, which reduces reliability for unattended edge deployments.
Questions / request
Can Kubesolo/packaging be adjusted to make the Portainer agent more resilient on boot (e.g., restart policy/backoff behavior, init/wait-for-dns, or ordering constraints)?
Additional info
Happy to brainstorm and/or help, test, build etc. 😄