Skip to content

Portainer agent enters CrashLoopBackOff after reboot if CoreDNS/network not ready yet (race on boot) #134

@benoitschipper

Description

@benoitschipper

Description
After rebooting an edge device running Kubesolo as a systemd service, I’m seeing a boot-time race condition where the Portainer agent starts at the same time as the CoreDNS pod. If CoreDNS takes a bit longer to become ready (often because upstream network isn’t available yet), the Portainer agent fails its “phone home” attempts and ends up in CrashLoopBackOff. Once it reaches CrashLoopBackOff, it won’t recover automatically (it doesn’t keep retrying indefinitely) unless manually restarted.

This is problematic for edge deployments where power loss / cold boot is common and upstream connectivity (e.g., via 4G/5G modem) may come up 30–60s after the device boots.

Expected behavior
On boot, components that depend on cluster DNS / upstream networking (e.g., Portainer agent) should start robustly and recover automatically when networking/CoreDNS becomes ready—without requiring manual intervention.

Actual behavior
On boot with delayed network/CoreDNS readiness:

  • CoreDNS starts but is not ready yet (readiness/health checks block traffic).
  • Portainer agent starts concurrently, tries to contact its endpoint a few times.
  • If DNS/upstream isn’t ready quickly enough, Portainer agent transitions to CrashLoopBackOff.
  • It does not automatically recover once CoreDNS/network becomes ready.

Reproduction steps

  1. Ensure Kubesolo is installed and configured to start on boot via systemd (this setup has been stable during normal operation).
  2. Power down the edge rack completely (device + external 4G/5G modem).
  3. Power everything back on at the same time.
  4. Observe pods during startup:
    • coredns starts but stays unready briefly
    • portainer-agent starts and then goes to CrashLoopBackOff before CoreDNS finishes becoming ready

Why this happens in my environment (context)
During cold boot, the 4G/5G modem takes ~1 minute to become usable. Until upstream network is available, CoreDNS cannot fully function to serve DNS requests inside the cluster. The Portainer agent appears to depend on DNS/upstream connectivity during its startup “phone home” and fails hard if it can’t resolve/reach the endpoint quickly.

Evidence / logs

  • Screenshot attached showing:
    • kube-system/coredns running/transitioning readiness
    • portainer/portainer-agent cycling into Error and then CrashLoopBackOff
Image

lol I did "k get pods" without -A

Environment

  • OS: self-built Debian distro Debian GNU/Linux 13 (trixie, based on ISAR)
  • Kubesolo version: v1.1.2
  • Deployment type: edge device, cold boot with external 4G/5G modem (delayed WAN availability)

Impact
After power loss/cold boot, Portainer agent may remain down indefinitely unless manually restarted, which reduces reliability for unattended edge deployments.

Questions / request
Can Kubesolo/packaging be adjusted to make the Portainer agent more resilient on boot (e.g., restart policy/backoff behavior, init/wait-for-dns, or ordering constraints)?

Additional info
Happy to brainstorm and/or help, test, build etc. 😄

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions