Skip to content

fix(containerd): prevent silent network failures from leaving containers unreachable#202

Merged
sheepbox8646 merged 3 commits intomainfrom
fix/containerd-network-fallback
Mar 7, 2026
Merged

fix(containerd): prevent silent network failures from leaving containers unreachable#202
sheepbox8646 merged 3 commits intomainfrom
fix/containerd-network-fallback

Conversation

@HoneyBBQ
Copy link
Collaborator

@HoneyBBQ HoneyBBQ commented Mar 7, 2026

Summary

  • Root cause: Container CNI network setup failures were silently swallowed at 6 points in the call chain, leaving containers in a "running but unreachable" ghost state — gRPC calls fail with no IP for bot while the system reports everything healthy.
  • Additional root cause: After Docker container restart, the cni0 bridge lingers with a zeroed MAC (00:00:00:00:00:00), causing the CNI bridge plugin to fail with could not set bridge's mac: invalid argument. This prevents all MCP containers from obtaining network connectivity.

Changes

File Fix
internal/containerd/network.go Return error when CNI yields no usable IP; detect stale bridge MAC errors and auto-delete cni0 before retrying
internal/mcp/manager.go Manager.Start() rolls back container when IP is empty; recoverContainerIP retries up to 2× with backoff
internal/handlers/containerd.go Extract setupNetworkOrFail with retry, propagate errors; ReconcileContainers no longer falsely reports "healthy"
internal/mcp/mcpclient/client.go gRPC pool evicts connections stuck in Connecting state for >30s
devenv/server-entrypoint.sh Delete stale cni0 bridge and flush IPAM state before starting containerd
docker/server-entrypoint.sh Same cleanup for production entrypoint

Test plan

  • Start a bot container — verify it gets an IP and gRPC works
  • docker compose restart server — verify MCP containers recover network on reconcile
  • Simulate CNI failure (remove /etc/cni/net.d config) — verify Start() returns error and container is stopped
  • Verify gRPC pool reconnects after container IP changes (stop + start)

HoneyBBQ added 3 commits March 7, 2026 17:22
…ers unreachable

Container network setup failures were silently swallowed at multiple
points in the call chain, leaving containers in a "running but
unreachable" ghost state. This patch closes every silent-failure path:

- setupCNINetwork: return error when CNI yields no usable IP
- Manager.Start: roll back container when IP is empty instead of
  returning success
- ensureContainerAndTask: extract setupNetworkOrFail with 1 retry,
  propagate error to callers
- ReconcileContainers: stop reporting "healthy" when network setup fails
- recoverContainerIP: retry up to 2 times with backoff for transient
  CNI failures (IPAM lock contention, etc.)
- gRPC Pool: evict connections stuck in Connecting state for >30s
After a Docker container restart, the cni0 bridge interface can linger
with a zeroed MAC (00:00:00:00:00:00) and DOWN state. The CNI bridge
plugin then fails with "could not set bridge's mac: invalid argument",
making all MCP containers unreachable.

Two-layer fix:
- Entrypoint: delete cni0 and flush IPAM state before starting containerd
- Go: detect bridge MAC errors in setupCNINetwork and auto-delete cni0
  before retrying, as defense-in-depth for runtime recovery
@HoneyBBQ HoneyBBQ requested a review from sheepbox8646 March 7, 2026 09:40
@sheepbox8646 sheepbox8646 merged commit abbb14c into main Mar 7, 2026
14 checks passed
HoneyBBQ added a commit that referenced this pull request Mar 7, 2026
…ers unreachable (#202)

* fix(containerd): prevent silent network failures from leaving containers unreachable

Container network setup failures were silently swallowed at multiple
points in the call chain, leaving containers in a "running but
unreachable" ghost state. This patch closes every silent-failure path:

- setupCNINetwork: return error when CNI yields no usable IP
- Manager.Start: roll back container when IP is empty instead of
  returning success
- ensureContainerAndTask: extract setupNetworkOrFail with 1 retry,
  propagate error to callers
- ReconcileContainers: stop reporting "healthy" when network setup fails
- recoverContainerIP: retry up to 2 times with backoff for transient
  CNI failures (IPAM lock contention, etc.)
- gRPC Pool: evict connections stuck in Connecting state for >30s

* fix(containerd): clean stale cni0 bridge on startup to prevent MAC error

After a Docker container restart, the cni0 bridge interface can linger
with a zeroed MAC (00:00:00:00:00:00) and DOWN state. The CNI bridge
plugin then fails with "could not set bridge's mac: invalid argument",
making all MCP containers unreachable.

Two-layer fix:
- Entrypoint: delete cni0 and flush IPAM state before starting containerd
- Go: detect bridge MAC errors in setupCNINetwork and auto-delete cni0
  before retrying, as defense-in-depth for runtime recovery

* fix(containerd): use exec.CommandContext to satisfy noctx linter
sheepbox8646 pushed a commit that referenced this pull request Mar 7, 2026
* fix(containerd): prevent silent network failures from leaving containers unreachable (#202)

* fix(containerd): prevent silent network failures from leaving containers unreachable

Container network setup failures were silently swallowed at multiple
points in the call chain, leaving containers in a "running but
unreachable" ghost state. This patch closes every silent-failure path:

- setupCNINetwork: return error when CNI yields no usable IP
- Manager.Start: roll back container when IP is empty instead of
  returning success
- ensureContainerAndTask: extract setupNetworkOrFail with 1 retry,
  propagate error to callers
- ReconcileContainers: stop reporting "healthy" when network setup fails
- recoverContainerIP: retry up to 2 times with backoff for transient
  CNI failures (IPAM lock contention, etc.)
- gRPC Pool: evict connections stuck in Connecting state for >30s

* fix(containerd): clean stale cni0 bridge on startup to prevent MAC error

After a Docker container restart, the cni0 bridge interface can linger
with a zeroed MAC (00:00:00:00:00:00) and DOWN state. The CNI bridge
plugin then fails with "could not set bridge's mac: invalid argument",
making all MCP containers unreachable.

Two-layer fix:
- Entrypoint: delete cni0 and flush IPAM state before starting containerd
- Go: detect bridge MAC errors in setupCNINetwork and auto-delete cni0
  before retrying, as defense-in-depth for runtime recovery

* fix(containerd): use exec.CommandContext to satisfy noctx linter

* fix(mcp): propagate network errors from replaceContainerSnapshot

Network setup failure after snapshot replace (rollback/commit) was
silently swallowed — the container would start but remain unreachable
via gRPC. Return the error so callers (CreateSnapshot, RollbackVersion,
etc.) surface the failure instead of reporting success.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants