fix: go unwinding stops at systemstacks#1313
fix: go unwinding stops at systemstacks#1313wehzzz wants to merge 9 commits intoopen-telemetry:mainfrom
Conversation
Hope KubeCon went well! I ended up getting nerd-sniped by this problem and wanted to dig into it since the previous PR was closed. No worries at all if you already have a fix in the works based on #1279. I currently have something working, but if you think your approach is the better way to tackle this, we can absolutely close this PR. Just let me know how you'd like to proceed! |
dfb9247 to
6961a92
Compare
fabled
left a comment
There was a problem hiding this comment.
Awesome stuff! Added first round of comments and questions.
| // Although systemstack is declared with $0 frame size, Go's linker injects | ||
| // a frame pointer prologue (PUSH RBP + MOVQ RSP, RBP) for all non-NOFRAME | ||
| // functions that contain a CALL instruction. | ||
| // https://github.com/golang/go/blob/affadc7997466dfacad5b9a3dc90ee5e7a7b6085/src/cmd/internal/obj/x86/obj6.go#L637 |
There was a problem hiding this comment.
Could we then get away by using frame pointer unwinding?
There was a problem hiding this comment.
Good catch. Frame pointers are enabled for all following go versions on Linux:
AMD64
| Version | Condition to PUSH RBP | Source | Prologue on Linux ? |
|---|---|---|---|
| Go 1.13 | Framepointer_enabled && !NoFrame && !(frameless leaf) |
obj6.go#L623, default=1, true for amd64+linux | Oui |
| Go 1.17 | !NoFrame && !(frameless nosplit) && !(frameless leaf) |
obj6.go#L593 | Oui |
| Go 1.25 | !NoFrame && !(frameless leaf) |
obj6.go#L622 | Oui |
ARM64
| Version | Condition to save R29 | Source | Prologue on Linux ? |
|---|---|---|---|
| Go 1.13 | Framepointer_enabled(goos, goarch) - true pour arm64 && linux |
obj7.go#L641, Framepointer_enabled | Oui |
| Go 1.17 | Inconditionnel (quand frame > 0) | obj7.go#L617 | Oui |
| Go 1.25 | Inconditionnel (small frame path) | obj7.go#L719 | Oui |
I will dig into this and move systemstack to UnwindInfoFramePointer
| // synthetic marker for Go's stack scanner and scheduler, not a real return address. | ||
| // | ||
| // https://github.com/golang/go/blob/917949cc1d16c652cb09ba369718f45e5d814d8f/src/runtime/asm_amd64.s#L886 | ||
| GoLabelsOffsets *go_offs = go_get_go_offsets(); |
There was a problem hiding this comment.
go_get_go_offsets is used in many places, and this is inside a rolled loop. I'm wondering if it'd make sense to read this data in the native unwinder beginning to the PerCPURecord? The Go plugins also could use it and avoid the lookup later on.
There was a problem hiding this comment.
We may want to copy GoLabelsOffsets into PerCPURecord to avoid reading the same values again. I'm not sure what would be the best place for the first read - collect_trace seems to initialize the struct, so it should work there, but from my understanding it would mean calling this for every non-Go process, adding one bpf_read for each of them (at the moment it's only for go processes).
Not sure if we want to do this here or in a follow-up.
| // the unwinder crosses back to the goroutine stack using the goroutine's saved | ||
| // context from g.sched (gobuf). | ||
| "runtime.systemstack": &sdtypes.UnwindInfoGoSystemstack, | ||
| "runtime.mcall": &sdtypes.UnwindInfoGoMcall, |
There was a problem hiding this comment.
These add the special unwind command for the whole function. Does it make sense to apply this only to the portion to which applies? Or does the command support correctly unwinding all locations of the corresponding functions?
There was a problem hiding this comment.
Since systemstack now relies on frame pointer unwinding, it's fine for it.
In the case of mcall, it's a bit more nuanced. We rely on gobuf sched values for unwinding, however we might want to special-case the first part of the function.
TEXT runtime·mcall<ABIInternal>(SB), NOSPLIT, $0-8
#ifdef GOEXPERIMENT_runtimesecret
CMPL g_secret(R14), $0
JEQ nosecret
CALL ·secretEraseRegistersMcall(SB)
nosecret:
#endif
MOVQ AX, DX // DX = fn
// Save state in g->sched. The caller's SP and PC are restored by gogo to
// resume execution in the caller's frame (implicit return). The caller's BP
// is also restored to support frame pointer unwinding.
MOVQ SP, BX // hide (SP) reads from vet
MOVQ 8(BX), BX // caller's PC
MOVQ BX, (g_sched+gobuf_pc)(R14)
LEAQ fn+0(FP), BX // caller's SP
MOVQ BX, (g_sched+gobuf_sp)(R14)
// Get the caller's frame pointer by dereferencing BP. Storing BP as it is
// can cause a frame pointer cycle, see CL 476235.
MOVQ (BP), BX // caller's BP
MOVQ BX, (g_sched+gobuf_bp)(R14)
If we sample in the above code, we have a valid curg, however the sched values are not yet fully populated. The unwinding will stop cleanly (via the !saved_sp || !saved_pc check), but here we could instead rely on frame pointer unwinding since BP is still valid.
If we sample in the code below, go_resolve_mcall_goroutine handles both the case where we sample before dropg (using m.curg) and the case where we sample after (falling back to *(g0.sched.sp - 8)). There should be no issue as sched is fully populated at that point.
// switch to m->g0 & its stack, call fn
MOVQ g_m(R14), BX
MOVQ m_g0(BX), SI // SI = g.m.g0
CMPQ SI, R14 // if g == m->g0 call badmcall
JNE goodm
JMP runtime·badmcall(SB)
goodm:
MOVQ R14, AX // AX (and arg 0) = g
MOVQ SI, R14 // g = g.m.g0
get_tls(CX) // Set G in TLS
MOVQ R14, g(CX)
MOVQ (g_sched+gobuf_sp)(R14), SP // sp = g0.sched.sp
MOVQ $0, BP // clear frame pointer, as caller may execute on another M
PUSHQ AX // open up space for fn's arg spill slot
MOVQ 0(DX), R12
CALL R12 // fn(g)
// The Windows native stack unwinder incorrectly classifies the next instruction
// as part of the function epilogue, producing a wrong call stack.
// Add a NOP to work around this issue. See go.dev/issue/67007.
BYTE $0x90
POPQ AX
JMP runtime·badmcall2(SB)
RET
Unwinding Through Go Stack Switches
Reference: #1275, Based on @florianl's work #1279.
systemstack
Locating the user goroutine
During systemstack,
m.curgalways points to the frozen user goroutine (systemstack does not calldropg). The profiler readscurg.sched.spwhich points into the user stack where the frame pointer prologue saved FP and LR/RA.gobuf.pccontainssystemstack_switch+8(a synthetic UNDEF marker for Go's stack scanner) and is intentionally ignored.AMD64 recovery
The Go linker injects
PUSH RBP; MOV RSP, RBPinto systemstack (obj6.go#L637).gosave_systemstack_switchthen savesgobuf.sp = LEAQ 8(SP), which skips the useless return address fromCALL gosaveand points to the saved RBP:ARM64 recovery
Go's ARM64 assembler (
cmd/internal/obj/arm64, functionpreprocessin obj7.go#L691) injects aMOVD.Wprologue instead of the standardSTP. UnlikeSTP X29, X30which places R29 at the lower address and LR at the higher address, theMOVD.Wsequence places LR at SP+0 and R29 below SP at SP-8.Verified by disassembling
runtime.systemstack(go tool objdump -s 'runtime\.systemstack' ./binary):The profiler reads:
mcall
Locating the user goroutine
mcall's callee (
park_m,goexit0, etc.) callsdropg()which setsm.curg = nilandg.m = nil. In most samples,m.curgis nil. The profiler falls back to reading the goroutine pointer from the g0 stack at*(g0.sched.sp - 8).AMD64: mcall does
PUSHQ AX(old_g) before calling fn. This writes old_g tog0.sched.sp - 8deterministically. This is hand-written assembly, stable across all Go versions.Verified by disassembling
runtime.mcall(go tool objdump -s 'runtime\.mcall' ./binary):ARM64: mcall passes old_g in R0 (no push). The goroutine pointer is at
g0.sched.sp - 8only if fn spills R0 to its ABIInternal arg area. This depends on the compiler's spill decisions, not hand-written assembly. Stable for the current ABI; a future ABI change could alter the spill offset. The DWARFDW_OP_fbreg 8should be used as the source of truth.Verified by disassembling
runtime.mcallon ARM64:fn's prologue then spills R0 to
entry_SP + 8 = g0.sched.sp - 16 + 8 = g0.sched.sp - 8. This is confirmed by both disassembly and DWARF.park_mspills (go tool objdump -s 'runtime\.park_m' ./binary):DWARF confirms (
dwarfdump -i -S match=runtime.park_m -Wc ./binary):ARM64: functions that don't spill
goexit0does NOT spill - passes R0 directly togdestroy(go tool objdump -s 'runtime\.goexit0' ./binary):DWARF confirms - location list only contains
DW_OP_reg0, noDW_OP_fbreg:All mcall callees can be found with:
Each callee's spill behavior can be verified with DWARF:
If the
gpparameter's location list containsDW_OP_fbreg 8, it spills tog0.sched.sp - 8. If it only containsDW_OP_reg0followed byend-of-list, it does not spill.When the profiler cannot resolve the user goroutine through mcall (non-spilling function on ARM64, or goroutine already rescheduled on another M), it stops unwinding at the mcall frame (
*stop = true) without crossing to the user stack. The g0 frames (park_m, schedule, findRunnable, etc.) are still captured. This is the same behavior as before this change (UNWIND_COMMAND_STOP). There is no regression.park_mDW_OP_fbreg 8preemptParkDW_OP_fbreg 8exitsyscall0DW_OP_fbreg 8goyield_mDW_OP_fbreg 8goschedguarded_mDW_OP_fbreg 8goexit0DW_OP_reg0onlygosched_mDW_OP_reg0onlygopreempt_mStale goroutine detection
The goroutine at the g0 slot may have been rescheduled on another M since the mcall. Its gobuf could then contain values from systemstack on that other thread (observed in testing:
systemstack_switchframes from a different M). The profiler validatescandidate.m == nil(parked, gobuf reliable) vscandidate.m != nil(rescheduled, STOP).Recovery
Unlike systemstack, mcall saves the caller's actual registers into gobuf (not a synthetic marker):
systemstack test
amd64arm64mcall test
amd64arm64