Fix cns zap logger issue #4069

paulyufan2 · 2025-10-08T14:23:44Z

Reason for Change:

Customers found CNS crash sometimes due to cns zap logging on RequestIPConfigsHander() which defers service.publishIPStateMetrics() → every request ends by signaling the recorder:

fatal error: unexpected signal during runtime execution
[signal 0xc0000005 code=0x1 addr=0x11 pc=0xb92bb6]  // Windows 
...
go.uber.org/zap/internal/stacktrace.Capture
go.uber.org/zap.(*Logger).check
go.uber.org/zap.(*Logger).Info
github.com/.../cns/restserver.(*asyncMetricsRecorder).record
github.com/.../cns/restserver.(*HTTPRestService).publishIPStateMetrics.func1
fatal error: unexpected signal during runtime execution
[signal 0xc0000005 code=0x1 addr=0x11 pc=0xb92bb6]  // Windows vilation
...
go.uber.org/zap/internal/stacktrace.Capture
go.uber.org/zap.(*Logger).check
go.uber.org/zap.(*Logger).Info
github.com/.../cns/restserver.(*asyncMetricsRecorder).record
github.com/.../cns/restserver.(*HTTPRestService).publishIPStateMetrics.func1
...
restserver.(*asyncMetricsRecorder).record

publishIPStateMetrics() spins go recorder.run() (once) and sends on recorder.sig.
run() calls record(), which logs the pool watermarks via: logger.Printf("Allocated IPs: %d, ...", ...)

The crash stack shows it dies inside zap while capturing a stack trace. So zap's stacktrace capture on an Info log that fires immediately after handler returns, causing windows access violation. On windows, stacktrace.Capture() can crash under load, so disabling it could remove the crash path.
CNS zap’s stacktrace.Capture() uses runtime.Callers, and we’ve seen rare access violations in that path when the stacktrace capture is aggressive. As the result, capturing stacktraces on hot Info paths can crash the process in your environment.

In CNS metrics.go, the sig is unbuffered:
recorder.sig = make(chan struct{})

When send with a non-blocking select dropping the signal if run() is not parked on receive. An unbuffered send succeeds only if a receiver is already waiting. If run() is not parked on <-a.sig, the default branch runs and the signal could be silently dropped. So making this buffered size 1 will bring up:

no drops while worker is busy: if run() is in record(), a new event will queue instead of being dropped.

Issue Fixed:

Requirements:

uses conventional commit messages
includes documentation
adds unit tests
relevant PR labels added

Notes:

Copilot

Pull Request Overview

This PR fixes a CNS crash issue caused by aggressive stacktrace capture in zap logging during metrics recording. The crash occurs when zap attempts to capture stacktraces on Info-level logs in hot paths, leading to Windows access violations.

Changed unbuffered channel to buffered channel (size 1) to prevent signal drops during metrics recording
Added stacktrace capture only for WarnLevel and above to reduce aggressive stacktrace capture on Info logs

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
cns/restserver/metrics.go	Changed channel from unbuffered to buffered (size 1) to prevent signal drops
cns/logger/cnslogger.go	Added stacktrace capture configuration to limit it to WarnLevel and above

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

cns/logger/cnslogger.go

rbtr

these changes seem superficial, so I'm confused about what the underlying issue is here. from the description it sounds like sometimes on Windows getting a stacktrace panics? but if we're getting a stacktrace I expect we're in a catastrophic failure already?
I don't think buffering a channel is going to help resolve that.

rbtr · 2025-10-08T14:58:45Z

cns/logger/cnslogger.go

+	zapLogger := zap.New(
+		platformCore,
+		zap.AddCaller(),
+		zap.AddStacktrace(zapcore.WarnLevel),


why Warn? imo we don't need stacks until Error+

rbtr · 2025-10-08T15:02:14Z

cns/restserver/metrics.go

 	recorder.once.Do(func() {
 		recorder.podIPConfigSrc = service.PodIPConfigStates
-		recorder.sig = make(chan struct{})
+		recorder.sig = make(chan struct{}, 1)


disagree with this change, unless this can't be called more than twice concurrently (and if that was the case, why twice instead of once and leave this unbuffered?). this smells and indicates some other structural problem.
all it does is delay when you will get a blocking send by one more event. maybe sig should only have non-blocking sends?

fix cns zap logger issue

11307e9

paulyufan2 requested a review from a team as a code owner October 8, 2025 14:23

paulyufan2 requested a review from thatmattlong October 8, 2025 14:23

paulyufan2 added the cns Related to CNS. label Oct 8, 2025

Copilot AI review requested due to automatic review settings October 8, 2025 14:23

Copilot AI reviewed Oct 8, 2025

View reviewed changes

cns/logger/cnslogger.go Outdated Show resolved Hide resolved

paulyufan2 requested a review from rbtr October 8, 2025 14:24

fix linter issue

7310ebb

paulyufan2 force-pushed the fixCNSBug branch from ca70786 to 7310ebb Compare October 8, 2025 14:46

rbtr reviewed Oct 8, 2025

View reviewed changes

paulyufan2 closed this Oct 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix cns zap logger issue #4069

Fix cns zap logger issue #4069

Uh oh!

paulyufan2 commented Oct 8, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

rbtr left a comment

Uh oh!

rbtr Oct 8, 2025

Uh oh!

rbtr Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix cns zap logger issue #4069

Fix cns zap logger issue #4069

Uh oh!

Conversation

paulyufan2 commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

rbtr left a comment

Choose a reason for hiding this comment

Uh oh!

rbtr Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

rbtr Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

paulyufan2 commented Oct 8, 2025 •

edited

Loading