Skip to content

Conversation

@paulyufan2
Copy link
Contributor

@paulyufan2 paulyufan2 commented Oct 8, 2025

Reason for Change:

Customers found CNS crash sometimes due to cns zap logging on RequestIPConfigsHander() which defers service.publishIPStateMetrics() → every request ends by signaling the recorder:

fatal error: unexpected signal during runtime execution
[signal 0xc0000005 code=0x1 addr=0x11 pc=0xb92bb6]  // Windows 
...
go.uber.org/zap/internal/stacktrace.Capture
go.uber.org/zap.(*Logger).check
go.uber.org/zap.(*Logger).Info
github.com/.../cns/restserver.(*asyncMetricsRecorder).record
github.com/.../cns/restserver.(*HTTPRestService).publishIPStateMetrics.func1
fatal error: unexpected signal during runtime execution
[signal 0xc0000005 code=0x1 addr=0x11 pc=0xb92bb6]  // Windows vilation
...
go.uber.org/zap/internal/stacktrace.Capture
go.uber.org/zap.(*Logger).check
go.uber.org/zap.(*Logger).Info
github.com/.../cns/restserver.(*asyncMetricsRecorder).record
github.com/.../cns/restserver.(*HTTPRestService).publishIPStateMetrics.func1
...
restserver.(*asyncMetricsRecorder).record

publishIPStateMetrics() spins go recorder.run() (once) and sends on recorder.sig.
run() calls record(), which logs the pool watermarks via: logger.Printf("Allocated IPs: %d, ...", ...)

The crash stack shows it dies inside zap while capturing a stack trace. So zap's stacktrace capture on an Info log that fires immediately after handler returns, causing windows access violation. On windows, stacktrace.Capture() can crash under load, so disabling it could remove the crash path.
CNS zap’s stacktrace.Capture() uses runtime.Callers, and we’ve seen rare access violations in that path when the stacktrace capture is aggressive. As the result, capturing stacktraces on hot Info paths can crash the process in your environment.

In CNS metrics.go, the sig is unbuffered:
recorder.sig = make(chan struct{})

When send with a non-blocking select dropping the signal if run() is not parked on receive. An unbuffered send succeeds only if a receiver is already waiting. If run() is not parked on <-a.sig, the default branch runs and the signal could be silently dropped. So making this buffered size 1 will bring up:

  1. no drops while worker is busy: if run() is in record(), a new event will queue instead of being dropped.

Issue Fixed:

Requirements:

Notes:

@paulyufan2 paulyufan2 requested a review from a team as a code owner October 8, 2025 14:23
@paulyufan2 paulyufan2 added the cns Related to CNS. label Oct 8, 2025
Copilot AI review requested due to automatic review settings October 8, 2025 14:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a CNS crash issue caused by aggressive stacktrace capture in zap logging during metrics recording. The crash occurs when zap attempts to capture stacktraces on Info-level logs in hot paths, leading to Windows access violations.

  • Changed unbuffered channel to buffered channel (size 1) to prevent signal drops during metrics recording
  • Added stacktrace capture only for WarnLevel and above to reduce aggressive stacktrace capture on Info logs

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
cns/restserver/metrics.go Changed channel from unbuffered to buffered (size 1) to prevent signal drops
cns/logger/cnslogger.go Added stacktrace capture configuration to limit it to WarnLevel and above

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@paulyufan2 paulyufan2 requested a review from rbtr October 8, 2025 14:24
Copy link
Collaborator

@rbtr rbtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these changes seem superficial, so I'm confused about what the underlying issue is here. from the description it sounds like sometimes on Windows getting a stacktrace panics? but if we're getting a stacktrace I expect we're in a catastrophic failure already?
I don't think buffering a channel is going to help resolve that.

zapLogger := zap.New(
platformCore,
zap.AddCaller(),
zap.AddStacktrace(zapcore.WarnLevel),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why Warn? imo we don't need stacks until Error+

recorder.once.Do(func() {
recorder.podIPConfigSrc = service.PodIPConfigStates
recorder.sig = make(chan struct{})
recorder.sig = make(chan struct{}, 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disagree with this change, unless this can't be called more than twice concurrently (and if that was the case, why twice instead of once and leave this unbuffered?). this smells and indicates some other structural problem.
all it does is delay when you will get a blocking send by one more event. maybe sig should only have non-blocking sends?

@paulyufan2 paulyufan2 closed this Oct 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cns Related to CNS.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants