Skip to content

fix: go unwinding stops at systemstacks#1313

Open
wehzzz wants to merge 9 commits intoopen-telemetry:mainfrom
wehzzz:fix-go-unwinding-stops-at-systemstacks
Open

fix: go unwinding stops at systemstacks#1313
wehzzz wants to merge 9 commits intoopen-telemetry:mainfrom
wehzzz:fix-go-unwinding-stops-at-systemstacks

Conversation

@wehzzz
Copy link
Copy Markdown
Contributor

@wehzzz wehzzz commented Apr 1, 2026

Unwinding Through Go Stack Switches

Reference: #1275, Based on @florianl's work #1279.

systemstack

Locating the user goroutine

During systemstack, m.curg always points to the frozen user goroutine (systemstack does not call dropg). The profiler reads curg.sched.sp which points into the user stack where the frame pointer prologue saved FP and LR/RA.

gobuf.pc contains systemstack_switch+8 (a synthetic UNDEF marker for Go's stack scanner) and is intentionally ignored.

AMD64 recovery

The Go linker injects PUSH RBP; MOV RSP, RBP into systemstack (obj6.go#L637). gosave_systemstack_switch then saves gobuf.sp = LEAQ 8(SP), which skips the useless return address from CALL gosave and points to the saved RBP:

caller_fp = *(sched_sp);       // saved RBP
caller_pc = *(sched_sp + 8);   // return address
caller_sp = sched_sp + 16;

ARM64 recovery

Go's ARM64 assembler (cmd/internal/obj/arm64, function preprocess in obj7.go#L691) injects a MOVD.W prologue instead of the standard STP. Unlike STP X29, X30 which places R29 at the lower address and LR at the higher address, the MOVD.W sequence places LR at SP+0 and R29 below SP at SP-8.

Verified by disassembling runtime.systemstack (go tool objdump -s 'runtime\.systemstack' ./binary):

TEXT runtime.systemstack.abi0(SB)
  asm_arm64.s:255  0x83d70  f81f0ffe  MOVD.W R30, -16(RSP)    // *(SP+0) = LR
  asm_arm64.s:255  0x83d74  f81f83fd  MOVD R29, -8(RSP)       // *(SP-8) = R29
  asm_arm64.s:255  0x83d78  d10023fd  SUB $8, RSP, R29

The profiler reads:

caller_pc = *(sched_sp);       // LR at SP+0
caller_fp = *(sched_sp - 8);   // R29 at SP-8
caller_sp = sched_sp + 16;

mcall

Locating the user goroutine

mcall's callee (park_m, goexit0, etc.) calls dropg() which sets m.curg = nil and g.m = nil. In most samples, m.curg is nil. The profiler falls back to reading the goroutine pointer from the g0 stack at *(g0.sched.sp - 8).

AMD64: mcall does PUSHQ AX (old_g) before calling fn. This writes old_g to g0.sched.sp - 8 deterministically. This is hand-written assembly, stable across all Go versions.

Verified by disassembling runtime.mcall (go tool objdump -s 'runtime\.mcall' ./binary):

TEXT runtime.mcall(SB)
  asm_amd64.s:450  0x46f514  MOVQ R14, AX                  // AX = old_g
  asm_amd64.s:451  0x46f517  MOVQ SI, R14                  // R14 = g0
  asm_amd64.s:453  0x46f51a  MOVQ R14, FS:0xfffffff8       // TLS = g0
  asm_amd64.s:454  0x46f523  MOVQ 0x38(R14), SP            // SP = g0.sched.sp
  asm_amd64.s:455  0x46f527  MOVQ $0x0, BP
  asm_amd64.s:456  0x46f52e  PUSHQ AX                      // *(g0.sched.sp - 8) = old_g
  asm_amd64.s:458  0x46f532  CALL R12                      // fn(old_g)

ARM64: mcall passes old_g in R0 (no push). The goroutine pointer is at g0.sched.sp - 8 only if fn spills R0 to its ABIInternal arg area. This depends on the compiler's spill decisions, not hand-written assembly. Stable for the current ABI; a future ABI change could alter the spill offset. The DWARF DW_OP_fbreg 8 should be used as the source of truth.

Verified by disassembling runtime.mcall on ARM64:

TEXT runtime.mcall(SB)
  asm_arm64.s:234  0x83d24  MOVD 56(R28), R0              // R0 = g0.sched.sp
  asm_arm64.s:235  0x83d28  MOVD R0, RSP                  // RSP = g0.sched.sp
  asm_arm64.s:236  0x83d2c  MOVD ZR, R29
  asm_arm64.s:237  0x83d30  MOVD R3, R0                   // R0 = old_g
  asm_arm64.s:238  0x83d34  MOVD ZR, -16(RSP) 
  asm_arm64.s:239  0x83d38  SUB $16, RSP, RSP             // allocate 16-byte arg space
  asm_arm64.s:241  0x83d40  CALL (R4)                     // fn(old_g)

fn's prologue then spills R0 to entry_SP + 8 = g0.sched.sp - 16 + 8 = g0.sched.sp - 8. This is confirmed by both disassembly and DWARF.

park_m spills (go tool objdump -s 'runtime\.park_m' ./binary):

TEXT runtime.park_m(SB)
  proc.go:4229  0x55f6c  MOVD.W R30, -96(RSP)             // frame = 96 bytes
  proc.go:4229  0x55f70  MOVD R29, -8(RSP)
  proc.go:4229  0x55f74  SUB $8, RSP, R29
  proc.go:4229  0x55f78  MOVD R0, 104(RSP)                // spill gp: RSP+104 = CFA+8 = g0.sched.sp - 8

DWARF confirms (dwarfdump -i -S match=runtime.park_m -Wc ./binary):

DW_TAG_formal_parameter
  DW_AT_name   gp
  DW_AT_location:
    [0x55f60, 0x55fa0): DW_OP_reg0          <- gp in R0 (before spill)
    [0x55fa0, 0x56230): DW_OP_fbreg 8       <- gp at CFA+8 = g0.sched.sp - 8

ARM64: functions that don't spill

goexit0 does NOT spill - passes R0 directly to gdestroy (go tool objdump -s 'runtime\.goexit0' ./binary):

TEXT runtime.goexit0(SB)
  proc.go:4447  0x56b7c  MOVD.W R30, -32(RSP) 
  proc.go:4447  0x56b80  MOVD R29, -8(RSP)
  proc.go:4447  0x56b84  SUB $8, RSP, R29
  proc.go:4448  0x56b88  CALL runtime.gdestroy(SB)         // R0 passed directly, no spill

DWARF confirms - location list only contains DW_OP_reg0, no DW_OP_fbreg:

DW_TAG_formal_parameter
  DW_AT_name   gp
  DW_AT_location:
    [0x56b70, 0x56b8c): DW_OP_reg0          <- gp stays in R0, never spilled
    end-of-list

All mcall callees can be found with:

grep -n 'mcall(' /usr/local/go/src/runtime/*.go | grep -v '//' | grep -v 'func mcall'

Each callee's spill behavior can be verified with DWARF:

dwarfdump -i -S match=runtime.<callee> -Wc ./binary

If the gp parameter's location list contains DW_OP_fbreg 8, it spills to g0.sched.sp - 8. If it only contains DW_OP_reg0 followed by end-of-list, it does not spill.

When the profiler cannot resolve the user goroutine through mcall (non-spilling function on ARM64, or goroutine already rescheduled on another M), it stops unwinding at the mcall frame (*stop = true) without crossing to the user stack. The g0 frames (park_m, schedule, findRunnable, etc.) are still captured. This is the same behavior as before this change (UNWIND_COMMAND_STOP). There is no regression.

Callee Spills R0? DWARF
park_m Yes DW_OP_fbreg 8
preemptPark Yes DW_OP_fbreg 8
exitsyscall0 Yes DW_OP_fbreg 8
goyield_m Yes DW_OP_fbreg 8
goschedguarded_m Yes DW_OP_fbreg 8
goexit0 No DW_OP_reg0 only
gosched_m No DW_OP_reg0 only
gopreempt_m No DWARF abstract (inlined), verified via objdump

Stale goroutine detection

The goroutine at the g0 slot may have been rescheduled on another M since the mcall. Its gobuf could then contain values from systemstack on that other thread (observed in testing: systemstack_switch frames from a different M). The profiler validates candidate.m == nil (parked, gobuf reliable) vs candidate.m != nil (rescheduled, STOP).

Recovery

Unlike systemstack, mcall saves the caller's actual registers into gobuf (not a synthetic marker):

state->pc = *(curg + sched_pc);    // gobuf.pc = real return address
state->sp = *(curg + sched_sp);    // gobuf.sp = real SP
state->fp = *(curg + sched_bp);    // gobuf.bp = real FP

systemstack test

amd64
image

arm64
image

mcall test

amd64
image

arm64
image

@florianl
Copy link
Copy Markdown
Member

florianl commented Apr 1, 2026

I also do have something in progress, based on #1279, but need to land #1310 first. During KubeCon last week, I didn't find time to work on it.

@wehzzz
Copy link
Copy Markdown
Contributor Author

wehzzz commented Apr 2, 2026

I also do have something in progress, based on #1279, but need to land #1310 first. During KubeCon last week, I didn't find time to work on it.

Hope KubeCon went well!

I ended up getting nerd-sniped by this problem and wanted to dig into it since the previous PR was closed. No worries at all if you already have a fix in the works based on #1279. I currently have something working, but if you think your approach is the better way to tackle this, we can absolutely close this PR. Just let me know how you'd like to proceed!

@wehzzz wehzzz changed the title [WIP] fix: go unwinding stops at systemstacks fix: go unwinding stops at systemstacks Apr 2, 2026
@wehzzz wehzzz marked this pull request as ready for review April 8, 2026 08:01
@wehzzz wehzzz requested review from a team as code owners April 8, 2026 08:01
Copy link
Copy Markdown
Contributor

@fabled fabled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome stuff! Added first round of comments and questions.

Comment on lines +545 to +548
// Although systemstack is declared with $0 frame size, Go's linker injects
// a frame pointer prologue (PUSH RBP + MOVQ RSP, RBP) for all non-NOFRAME
// functions that contain a CALL instruction.
// https://github.com/golang/go/blob/affadc7997466dfacad5b9a3dc90ee5e7a7b6085/src/cmd/internal/obj/x86/obj6.go#L637
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we then get away by using frame pointer unwinding?

Copy link
Copy Markdown
Contributor Author

@wehzzz wehzzz Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Frame pointers are enabled for all following go versions on Linux:

AMD64

Version Condition to PUSH RBP Source Prologue on Linux ?
Go 1.13 Framepointer_enabled && !NoFrame && !(frameless leaf) obj6.go#L623, default=1, true for amd64+linux Oui
Go 1.17 !NoFrame && !(frameless nosplit) && !(frameless leaf) obj6.go#L593 Oui
Go 1.25 !NoFrame && !(frameless leaf) obj6.go#L622 Oui

ARM64

Version Condition to save R29 Source Prologue on Linux ?
Go 1.13 Framepointer_enabled(goos, goarch) - true pour arm64 && linux obj7.go#L641, Framepointer_enabled Oui
Go 1.17 Inconditionnel (quand frame > 0) obj7.go#L617 Oui
Go 1.25 Inconditionnel (small frame path) obj7.go#L719 Oui

I will dig into this and move systemstack to UnwindInfoFramePointer

// synthetic marker for Go's stack scanner and scheduler, not a real return address.
//
// https://github.com/golang/go/blob/917949cc1d16c652cb09ba369718f45e5d814d8f/src/runtime/asm_amd64.s#L886
GoLabelsOffsets *go_offs = go_get_go_offsets();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go_get_go_offsets is used in many places, and this is inside a rolled loop. I'm wondering if it'd make sense to read this data in the native unwinder beginning to the PerCPURecord? The Go plugins also could use it and avoid the lookup later on.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to copy GoLabelsOffsets into PerCPURecord to avoid reading the same values again. I'm not sure what would be the best place for the first read - collect_trace seems to initialize the struct, so it should work there, but from my understanding it would mean calling this for every non-Go process, adding one bpf_read for each of them (at the moment it's only for go processes).

Not sure if we want to do this here or in a follow-up.

// the unwinder crosses back to the goroutine stack using the goroutine's saved
// context from g.sched (gobuf).
"runtime.systemstack": &sdtypes.UnwindInfoGoSystemstack,
"runtime.mcall": &sdtypes.UnwindInfoGoMcall,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These add the special unwind command for the whole function. Does it make sense to apply this only to the portion to which applies? Or does the command support correctly unwinding all locations of the corresponding functions?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since systemstack now relies on frame pointer unwinding, it's fine for it.

In the case of mcall, it's a bit more nuanced. We rely on gobuf sched values for unwinding, however we might want to special-case the first part of the function.

TEXT runtime·mcall<ABIInternal>(SB), NOSPLIT, $0-8
#ifdef GOEXPERIMENT_runtimesecret
	CMPL	g_secret(R14), $0
	JEQ	nosecret
	CALL	·secretEraseRegistersMcall(SB)
nosecret:
#endif

	MOVQ	AX, DX	// DX = fn

	// Save state in g->sched. The caller's SP and PC are restored by gogo to
	// resume execution in the caller's frame (implicit return). The caller's BP
	// is also restored to support frame pointer unwinding.
	MOVQ	SP, BX	// hide (SP) reads from vet
	MOVQ	8(BX), BX	// caller's PC
	MOVQ	BX, (g_sched+gobuf_pc)(R14)
	LEAQ	fn+0(FP), BX	// caller's SP
	MOVQ	BX, (g_sched+gobuf_sp)(R14)
	// Get the caller's frame pointer by dereferencing BP. Storing BP as it is
	// can cause a frame pointer cycle, see CL 476235.
	MOVQ	(BP), BX // caller's BP
	MOVQ	BX, (g_sched+gobuf_bp)(R14)

If we sample in the above code, we have a valid curg, however the sched values are not yet fully populated. The unwinding will stop cleanly (via the !saved_sp || !saved_pc check), but here we could instead rely on frame pointer unwinding since BP is still valid.

If we sample in the code below, go_resolve_mcall_goroutine handles both the case where we sample before dropg (using m.curg) and the case where we sample after (falling back to *(g0.sched.sp - 8)). There should be no issue as sched is fully populated at that point.

	// switch to m->g0 & its stack, call fn
	MOVQ	g_m(R14), BX
	MOVQ	m_g0(BX), SI	// SI = g.m.g0
	CMPQ	SI, R14	// if g == m->g0 call badmcall
	JNE	goodm
	JMP	runtime·badmcall(SB)
goodm:
	MOVQ	R14, AX		// AX (and arg 0) = g
	MOVQ	SI, R14		// g = g.m.g0
	get_tls(CX)		// Set G in TLS
	MOVQ	R14, g(CX)
	MOVQ	(g_sched+gobuf_sp)(R14), SP	// sp = g0.sched.sp
	MOVQ	$0, BP	// clear frame pointer, as caller may execute on another M
	PUSHQ	AX	// open up space for fn's arg spill slot
	MOVQ	0(DX), R12
	CALL	R12		// fn(g)
	// The Windows native stack unwinder incorrectly classifies the next instruction
	// as part of the function epilogue, producing a wrong call stack.
	// Add a NOP to work around this issue. See go.dev/issue/67007.
	BYTE	$0x90
	POPQ	AX
	JMP	runtime·badmcall2(SB)
	RET

@wehzzz wehzzz requested a review from fabled April 10, 2026 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants