-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Description
Go version
go version go1.23.4 linux/amd64
Output of go env
in your module/workspace:
GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/root/.cache/go-build'
GOENV='/root/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/root/go/pkg/mod'
GONOPROXY='git.garena.com'
GONOSUMDB='git.garena.com'
GOOS='linux'
GOPATH='/root/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.23.4'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/root/.config/go/telemetry'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build892259413=/tmp/go-build -gno-record-gcc-switches'
What did you do?
We discovered that CPU profiling causes severe scheduler delays when combined with deep context chains, making profiling unusable in production environments with real network operations.
Reproduction steps:
- Download reproduction program:
# Save the attached go_profiling_issue_repro.go file
go_profiling_issue_repro.go.zip
-
Run with CPU profiling + ZooKeeper (demonstrates severe issue):
go run go_profiling_issue_repro.go -maxprocs=4 -enable-zk=true -enable-profiling=true -context-depth=10000000
-
Run without CPU profiling (normal behavior):
go run go_profiling_issue_repro.go -maxprocs=4 -enable-zk=true -enable-profiling=false -context-depth=10000000
Environment:
- Go version: go1.21.6 darwin/arm64
- OS: macOS (also reproduced on Linux)
- Hardware: Apple M1/M2, x86_64
- GOMAXPROCS: 4
Test Configuration:
- Context depth: 10,000,000 layers
- GOMAXPROCS: 4
- Real ZooKeeper network operations
- Periodic CPU profiling (10s every 1min)
- Ticker-based time-sensitive operations (50ms intervals)
Reproduction program code:
https://go.dev/play/p/1sgZodXCKvW?v=goprev
What did you see happen?
Program startup and initialization:
=== Go Context Scheduler Issue Reproduction ===
Go Version: go1.21.6
GOMAXPROCS: 4 (was: 14)
Context Depth: 10000000, Ticker Interval: 50ms
ZooKeeper: 127.0.0.1:2181 (enabled: true)
CPU Profiling: true
π§ Creating 10000000-layer context chain...
β
Deep context chain created
π Starting background operations...
π Starting periodic CPU profiling: 10s every 1min
π Connecting to ZooKeeper at 127.0.0.1:2181...
π Started 10s CPU profiling session: repro_cpu_1758700953.prof
Severe performance degradation observed:
π ZK operation slow: 511.066667ms (scheduler delay: -32.968416ms, tick: 11, op: 22)
π ZK operation slow: 528.069875ms (scheduler delay: 58.083375ms, tick: 10, op: 23)
β οΈ Ticker delayed by 83.60925ms (count: 20)
π ZK operation slow: 639.356875ms (scheduler delay: 2.306149875s, tick: 28, op: 64)
π ZK operation slow: 545.889459ms (scheduler delay: 61.617583ms, tick: 29, op: 65)
π ZK operation slow: 766.755709ms (scheduler delay: -9.641959ms, tick: 37, op: 82)
π Baseline: 31 ticks in 5 seconds
Key observations:
-
Extreme Scheduler Delays:
- Normal operations: Expected ~50ms intervals
- During CPU profiling: 2.306149875s scheduler delay (46x slower)
- ZK operations taking 500-800ms instead of expected <50ms
-
Network Operation Impact:
- ZooKeeper operations severely affected during profiling
- Connection timeouts and delays cascade through the system
- Real network I/O becomes unreliable during profiling sessions
-
Ticker Performance Collapse:
- Expected: 50ms ticker intervals
- Observed: 83ms+ delays (67% slower)
- Baseline shows only 31 ticks in 5 seconds (should be ~100 ticks)
-
Context Chain Amplification:
- 10M-layer context chains amplify the profiling overhead
- Each
context.WithTimeout
operation becomes extremely expensive - Stack expansion during profiling creates performance bottlenecks
What did you expect to see?
Expected behavior:
- CPU profiling should have minimal impact on Go Runtime Scheduler Latency
Root cause hypothesis:
Deep context chains require significant stack space during operations like context.WithTimeout
. CPU profiling appears to interfere with goroutine scheduling during stack expansion, particularly when combined with network I/O operations.
Business impact:
If there's a deep context chain, user cannot use CPU profiling for performance debugging in production enviorenment.
This significantly impacts the observability of Go applications in production environments where deep context chains are standard practice for request tracing, authentication, and resource management.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status