Skip to content

Conversation

@aritrbas
Copy link
Collaborator

Problem

In dual-stack or IPv6-enabled clusters, the agent can crash when it attempts to announce or withdraw a BGP path for an IPv6 address, but the node does not have a corresponding IPv6 address configured in HostMetadata.

level=error msg="error making path to announce: no ip6 address for node" component=routing
level=warning msg="Tomb function errored with error making path to announce: no ip6 address for node"
level=error msg="tomb Dying error making path to announce: no ip6 address for node"
level=fatal msg="GrpcServer Server returned grpc: the server has been stopped" component=cni

RCA

  • common.MakePath() is called to construct a BGP path for announcement/withdrawal.
  • When the required node IP is missing (nodeIPv6 == nil), MakePath() returns an error.
  • The error propagates through routing_server.announceLocalAddress() or prefix_watcher.WatchPrefix().
  • The Go() wrapper passes the error to tomb, which enters the Dying state, signaling all goroutines to stop.
  • The agent crashes, disrupting all network connectivity for pods on the node.

Impact

  • agent crash - all networking functionality is lost.
  • pod connectivity disruption - other pods on the node lose network access.
  • cluster instability - affected node becomes unavailable for other workloads.

Solution

Implement graceful degradation by treating missing node IP as a non-fatal, recoverable condition.
The missing node IP is a transient or configuration condition, it does not mean the routing component is broken. Returning an error would stop the routing/prefix watchers and trigger the tomb to kill the whole agent.
By returning early, the routing/prefix watchers continue, and missing paths can be restored later. The logic keeps state in memory (localAddressMap), so once the node IP becomes available, RestoreLocalAddresses() will re-announce.

In dual-stack or IPv6-enabled clusters, the agent can crash when it attempts
to announce or withdraw a BGP path for an IPv6 address, but the nodes does
not have a corresponding IPv6 address configured in HostMetadata.

Before this change, common.MakePath() returned a generic error ('no ip6
address for node'). That error was wrapped by the routing_server and
propagated back to tomb, causing the routing watcher to stop and the
main process to tear down (ending in a fatal gRPC server error).

Changes:
- Added sentinel errors ErrNoNodeIPv4 and ErrNoNodeIPv6 in common.go
- Added helper function IsMissingNodeIP() to detect these specific errors
- Updated MakePath() to return sentinel errors (including for SRv6 next-hop)
- Updated routing_server and prefix_watcher to treat missing-node-IP as a
  non-fatal condition: log a warning indicating we skip announce/withdraw,
  returning nil so tomb does not enter Dying state

This prevents the agent from crashing with a clear warning log for operators.

Signed-off-by: Aritra Basu <[email protected]>
@aritrbas aritrbas self-assigned this Jan 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant