Portainer agent enters CrashLoopBackOff after reboot if CoreDNS/network not ready yet (race on boot)

**Description**  
After rebooting an edge device running Kubesolo as a `systemd` service, I’m seeing a boot-time race condition where the Portainer agent starts at the same time as the CoreDNS pod. If CoreDNS takes a bit longer to become ready (often because upstream network isn’t available yet), the Portainer agent fails its “phone home” attempts and ends up in `CrashLoopBackOff`. Once it reaches `CrashLoopBackOff`, it won’t recover automatically (it doesn’t keep retrying indefinitely) unless manually restarted.

This is problematic for edge deployments where power loss / cold boot is common and upstream connectivity (e.g., via 4G/5G modem) may come up 30–60s after the device boots.

**Expected behavior**  
On boot, components that depend on cluster DNS / upstream networking (e.g., Portainer agent) should start robustly and recover automatically when networking/CoreDNS becomes ready—without requiring manual intervention.

**Actual behavior**  
On boot with delayed network/CoreDNS readiness:

- CoreDNS starts but is not ready yet (readiness/health checks block traffic).
- Portainer agent starts concurrently, tries to contact its endpoint a few times.
- If DNS/upstream isn’t ready quickly enough, Portainer agent transitions to `CrashLoopBackOff`.
- It does not automatically recover once CoreDNS/network becomes ready.

**Reproduction steps**

1. Ensure Kubesolo is installed and configured to start on boot via `systemd` (this setup has been stable during normal operation).
2. Power down the edge rack completely (device + external 4G/5G modem).
3. Power everything back on at the same time.
4. Observe pods during startup:
   - `coredns` starts but stays unready briefly
   - `portainer-agent` starts and then goes to `CrashLoopBackOff` before CoreDNS finishes becoming ready

**Why this happens in my environment (context)**  
During cold boot, the 4G/5G modem takes ~1 minute to become usable. Until upstream network is available, CoreDNS cannot fully function to serve DNS requests inside the cluster. The Portainer agent appears to depend on DNS/upstream connectivity during its startup “phone home” and fails hard if it can’t resolve/reach the endpoint quickly.

**Evidence / logs**

- Screenshot attached showing:
  - `kube-system/coredns` running/transitioning readiness
  - `portainer/portainer-agent` cycling into `Error` and then `CrashLoopBackOff`

<img width="1902" height="618" alt="Image" src="https://github.com/user-attachments/assets/7a6d7cff-3666-4601-9f50-b58428ef3e48" />

`lol I did "k get pods" without -A `

**Environment**

- OS: self-built Debian distro Debian GNU/Linux 13 (trixie, based on ISAR)
- Kubesolo version: `v1.1.2`
- Deployment type: edge device, cold boot with external 4G/5G modem (delayed WAN availability)

**Impact**  
After power loss/cold boot, Portainer agent may remain down indefinitely unless manually restarted, which reduces reliability for unattended edge deployments.

**Questions / request**  
Can Kubesolo/packaging be adjusted to make the Portainer agent more resilient on boot (e.g., restart policy/backoff behavior, init/wait-for-dns, or ordering constraints)? 

**Additional info**  
Happy to brainstorm and/or help, test, build etc. 😄 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Portainer agent enters CrashLoopBackOff after reboot if CoreDNS/network not ready yet (race on boot) #134

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Portainer agent enters CrashLoopBackOff after reboot if CoreDNS/network not ready yet (race on boot) #134

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions