-
Notifications
You must be signed in to change notification settings - Fork 458
Description
Describe the bug
Controller.GetFQDNCache() iterates over fqdnController.dnsEntryCache without acquiring fqdnSelectorMutex, while other code paths (such as DNS response handling and FQDN selector updates) modify the same map under that mutex. Because Go maps are not safe for concurrent read/write access, this can result in a fatal runtime error and terminate the antrea-agent process.
To Reproduce
- Deploy an
AntreaClusterNetworkPolicywith an FQDN rule (for example:matchName: "*.example.com"). - Ensure selected Pods generate DNS traffic so that DNS responses are being processed by the agent.
- While DNS responses are being handled, run:
antctl get fqdn-cache
or query the/fqdncacheAPI endpoint. - Under concurrent DNS activity, the agent may crash with:
fatal error: concurrent map read and map write
This can occur during normal cluster operation when FQDN policies are active and the FQDN cache is queried.
Expected
Access to dnsEntryCache should be consistently synchronized using fqdnSelectorMutex, preventing concurrent read/write on the underlying map.
Actual behavior
GetFQDNCache() directly ranges over dnsEntryCache without holding fqdnSelectorMutex, while other functions such as onDNSResponse() and cleanupFQDNSelectorItem() modify the same map under lock. This introduces a concurrent map access path and can crash the antrea-agent process.
Versions:
- Antrea version: current
mainbranch (observed in latest code) - Kubernetes version: any
- Container runtime: any
- Linux kernel version: any
- OVS kernel module: any
This issue is code-level and not environment-specific.
Additional context
All other accessors of dnsEntryCache use fqdnSelectorMutex, but GetFQDNCache() does not. Since this method is invoked from the agent API handler, it can execute concurrently with DNS response processing goroutines, introducing a crash path under normal usage.