fix(containerd): prevent silent network failures from leaving containers unreachable by HoneyBBQ · Pull Request #202 · memohai/Memoh

HoneyBBQ · 2026-03-07T09:23:41Z

Summary

Root cause: Container CNI network setup failures were silently swallowed at 6 points in the call chain, leaving containers in a "running but unreachable" ghost state — gRPC calls fail with no IP for bot while the system reports everything healthy.
Additional root cause: After Docker container restart, the cni0 bridge lingers with a zeroed MAC (00:00:00:00:00:00), causing the CNI bridge plugin to fail with could not set bridge's mac: invalid argument. This prevents all MCP containers from obtaining network connectivity.

Changes

File	Fix
`internal/containerd/network.go`	Return error when CNI yields no usable IP; detect stale bridge MAC errors and auto-delete cni0 before retrying
`internal/mcp/manager.go`	`Manager.Start()` rolls back container when IP is empty; `recoverContainerIP` retries up to 2× with backoff
`internal/handlers/containerd.go`	Extract `setupNetworkOrFail` with retry, propagate errors; `ReconcileContainers` no longer falsely reports "healthy"
`internal/mcp/mcpclient/client.go`	gRPC pool evicts connections stuck in `Connecting` state for >30s
`devenv/server-entrypoint.sh`	Delete stale cni0 bridge and flush IPAM state before starting containerd
`docker/server-entrypoint.sh`	Same cleanup for production entrypoint

Test plan

Start a bot container — verify it gets an IP and gRPC works
docker compose restart server — verify MCP containers recover network on reconcile
Simulate CNI failure (remove /etc/cni/net.d config) — verify Start() returns error and container is stopped
Verify gRPC pool reconnects after container IP changes (stop + start)

…ers unreachable Container network setup failures were silently swallowed at multiple points in the call chain, leaving containers in a "running but unreachable" ghost state. This patch closes every silent-failure path: - setupCNINetwork: return error when CNI yields no usable IP - Manager.Start: roll back container when IP is empty instead of returning success - ensureContainerAndTask: extract setupNetworkOrFail with 1 retry, propagate error to callers - ReconcileContainers: stop reporting "healthy" when network setup fails - recoverContainerIP: retry up to 2 times with backoff for transient CNI failures (IPAM lock contention, etc.) - gRPC Pool: evict connections stuck in Connecting state for >30s

After a Docker container restart, the cni0 bridge interface can linger with a zeroed MAC (00:00:00:00:00:00) and DOWN state. The CNI bridge plugin then fails with "could not set bridge's mac: invalid argument", making all MCP containers unreachable. Two-layer fix: - Entrypoint: delete cni0 and flush IPAM state before starting containerd - Go: detect bridge MAC errors in setupCNINetwork and auto-delete cni0 before retrying, as defense-in-depth for runtime recovery

…ers unreachable (#202) * fix(containerd): prevent silent network failures from leaving containers unreachable Container network setup failures were silently swallowed at multiple points in the call chain, leaving containers in a "running but unreachable" ghost state. This patch closes every silent-failure path: - setupCNINetwork: return error when CNI yields no usable IP - Manager.Start: roll back container when IP is empty instead of returning success - ensureContainerAndTask: extract setupNetworkOrFail with 1 retry, propagate error to callers - ReconcileContainers: stop reporting "healthy" when network setup fails - recoverContainerIP: retry up to 2 times with backoff for transient CNI failures (IPAM lock contention, etc.) - gRPC Pool: evict connections stuck in Connecting state for >30s * fix(containerd): clean stale cni0 bridge on startup to prevent MAC error After a Docker container restart, the cni0 bridge interface can linger with a zeroed MAC (00:00:00:00:00:00) and DOWN state. The CNI bridge plugin then fails with "could not set bridge's mac: invalid argument", making all MCP containers unreachable. Two-layer fix: - Entrypoint: delete cni0 and flush IPAM state before starting containerd - Go: detect bridge MAC errors in setupCNINetwork and auto-delete cni0 before retrying, as defense-in-depth for runtime recovery * fix(containerd): use exec.CommandContext to satisfy noctx linter

* fix(containerd): prevent silent network failures from leaving containers unreachable (#202) * fix(containerd): prevent silent network failures from leaving containers unreachable Container network setup failures were silently swallowed at multiple points in the call chain, leaving containers in a "running but unreachable" ghost state. This patch closes every silent-failure path: - setupCNINetwork: return error when CNI yields no usable IP - Manager.Start: roll back container when IP is empty instead of returning success - ensureContainerAndTask: extract setupNetworkOrFail with 1 retry, propagate error to callers - ReconcileContainers: stop reporting "healthy" when network setup fails - recoverContainerIP: retry up to 2 times with backoff for transient CNI failures (IPAM lock contention, etc.) - gRPC Pool: evict connections stuck in Connecting state for >30s * fix(containerd): clean stale cni0 bridge on startup to prevent MAC error After a Docker container restart, the cni0 bridge interface can linger with a zeroed MAC (00:00:00:00:00:00) and DOWN state. The CNI bridge plugin then fails with "could not set bridge's mac: invalid argument", making all MCP containers unreachable. Two-layer fix: - Entrypoint: delete cni0 and flush IPAM state before starting containerd - Go: detect bridge MAC errors in setupCNINetwork and auto-delete cni0 before retrying, as defense-in-depth for runtime recovery * fix(containerd): use exec.CommandContext to satisfy noctx linter * fix(mcp): propagate network errors from replaceContainerSnapshot Network setup failure after snapshot replace (rollback/commit) was silently swallowed — the container would start but remain unreachable via gRPC. Return the error so callers (CreateSnapshot, RollbackVersion, etc.) surface the failure instead of reporting success.

HoneyBBQ added 3 commits March 7, 2026 17:22

fix(containerd): use exec.CommandContext to satisfy noctx linter

15ac5fa

HoneyBBQ requested a review from sheepbox8646 March 7, 2026 09:40

sheepbox8646 approved these changes Mar 7, 2026

View reviewed changes

sheepbox8646 merged commit abbb14c into main Mar 7, 2026
14 checks passed

HoneyBBQ mentioned this pull request Mar 7, 2026

fix(mcp): propagate network errors from replaceContainerSnapshot #204

Closed

2 tasks

HoneyBBQ mentioned this pull request Mar 7, 2026

fix(containerd): backport network fallback fixes to v0.4 #205

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(containerd): prevent silent network failures from leaving containers unreachable#202

fix(containerd): prevent silent network failures from leaving containers unreachable#202
sheepbox8646 merged 3 commits intomainfrom
fix/containerd-network-fallback

HoneyBBQ commented Mar 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HoneyBBQ commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HoneyBBQ commented Mar 7, 2026 •

edited

Loading