fix: go unwinding stops at systemstacks by wehzzz · Pull Request #1313 · open-telemetry/opentelemetry-ebpf-profiler

wehzzz · 2026-04-01T16:04:23Z

Unwinding Through Go Stack Switches

Reference: #1275, Based on @florianl's work #1279.

systemstack

Locating the user goroutine

During systemstack, m.curg always points to the frozen user goroutine (systemstack does not call dropg). The profiler reads curg.sched.sp which points into the user stack where the frame pointer prologue saved FP and LR/RA.

gobuf.pc contains systemstack_switch+8 (a synthetic UNDEF marker for Go's stack scanner) and is intentionally ignored.

AMD64 recovery

The Go linker injects PUSH RBP; MOV RSP, RBP into systemstack (obj6.go#L637). gosave_systemstack_switch then saves gobuf.sp = LEAQ 8(SP), which skips the useless return address from CALL gosave and points to the saved RBP:

caller_fp = *(sched_sp);       // saved RBP
caller_pc = *(sched_sp + 8);   // return address
caller_sp = sched_sp + 16;

ARM64 recovery

Go's ARM64 assembler (cmd/internal/obj/arm64, function preprocess in obj7.go#L691) injects a MOVD.W prologue instead of the standard STP. Unlike STP X29, X30 which places R29 at the lower address and LR at the higher address, the MOVD.W sequence places LR at SP+0 and R29 below SP at SP-8.

Verified by disassembling runtime.systemstack (go tool objdump -s 'runtime\.systemstack' ./binary):

TEXT runtime.systemstack.abi0(SB)
  asm_arm64.s:255  0x83d70  f81f0ffe  MOVD.W R30, -16(RSP)    // *(SP+0) = LR
  asm_arm64.s:255  0x83d74  f81f83fd  MOVD R29, -8(RSP)       // *(SP-8) = R29
  asm_arm64.s:255  0x83d78  d10023fd  SUB $8, RSP, R29

The profiler reads:

caller_pc = *(sched_sp);       // LR at SP+0
caller_fp = *(sched_sp - 8);   // R29 at SP-8
caller_sp = sched_sp + 16;

mcall

Locating the user goroutine

mcall's callee (park_m, goexit0, etc.) calls dropg() which sets m.curg = nil and g.m = nil. In most samples, m.curg is nil. The profiler falls back to reading the goroutine pointer from the g0 stack at *(g0.sched.sp - 8).

AMD64: mcall does PUSHQ AX (old_g) before calling fn. This writes old_g to g0.sched.sp - 8 deterministically. This is hand-written assembly, stable across all Go versions.

Verified by disassembling runtime.mcall (go tool objdump -s 'runtime\.mcall' ./binary):

TEXT runtime.mcall(SB)
  asm_amd64.s:450  0x46f514  MOVQ R14, AX                  // AX = old_g
  asm_amd64.s:451  0x46f517  MOVQ SI, R14                  // R14 = g0
  asm_amd64.s:453  0x46f51a  MOVQ R14, FS:0xfffffff8       // TLS = g0
  asm_amd64.s:454  0x46f523  MOVQ 0x38(R14), SP            // SP = g0.sched.sp
  asm_amd64.s:455  0x46f527  MOVQ $0x0, BP
  asm_amd64.s:456  0x46f52e  PUSHQ AX                      // *(g0.sched.sp - 8) = old_g
  asm_amd64.s:458  0x46f532  CALL R12                      // fn(old_g)

ARM64: mcall passes old_g in R0 (no push). The goroutine pointer is at g0.sched.sp - 8 only if fn spills R0 to its ABIInternal arg area. This depends on the compiler's spill decisions, not hand-written assembly. Stable for the current ABI; a future ABI change could alter the spill offset. The DWARF DW_OP_fbreg 8 should be used as the source of truth.

Verified by disassembling runtime.mcall on ARM64:

TEXT runtime.mcall(SB)
  asm_arm64.s:234  0x83d24  MOVD 56(R28), R0              // R0 = g0.sched.sp
  asm_arm64.s:235  0x83d28  MOVD R0, RSP                  // RSP = g0.sched.sp
  asm_arm64.s:236  0x83d2c  MOVD ZR, R29
  asm_arm64.s:237  0x83d30  MOVD R3, R0                   // R0 = old_g
  asm_arm64.s:238  0x83d34  MOVD ZR, -16(RSP) 
  asm_arm64.s:239  0x83d38  SUB $16, RSP, RSP             // allocate 16-byte arg space
  asm_arm64.s:241  0x83d40  CALL (R4)                     // fn(old_g)

fn's prologue then spills R0 to entry_SP + 8 = g0.sched.sp - 16 + 8 = g0.sched.sp - 8. This is confirmed by both disassembly and DWARF.

park_m spills (go tool objdump -s 'runtime\.park_m' ./binary):

TEXT runtime.park_m(SB)
  proc.go:4229  0x55f6c  MOVD.W R30, -96(RSP)             // frame = 96 bytes
  proc.go:4229  0x55f70  MOVD R29, -8(RSP)
  proc.go:4229  0x55f74  SUB $8, RSP, R29
  proc.go:4229  0x55f78  MOVD R0, 104(RSP)                // spill gp: RSP+104 = CFA+8 = g0.sched.sp - 8

DWARF confirms (dwarfdump -i -S match=runtime.park_m -Wc ./binary):

DW_TAG_formal_parameter
  DW_AT_name   gp
  DW_AT_location:
    [0x55f60, 0x55fa0): DW_OP_reg0          <- gp in R0 (before spill)
    [0x55fa0, 0x56230): DW_OP_fbreg 8       <- gp at CFA+8 = g0.sched.sp - 8

ARM64: functions that don't spill

goexit0 does NOT spill - passes R0 directly to gdestroy (go tool objdump -s 'runtime\.goexit0' ./binary):

TEXT runtime.goexit0(SB)
  proc.go:4447  0x56b7c  MOVD.W R30, -32(RSP) 
  proc.go:4447  0x56b80  MOVD R29, -8(RSP)
  proc.go:4447  0x56b84  SUB $8, RSP, R29
  proc.go:4448  0x56b88  CALL runtime.gdestroy(SB)         // R0 passed directly, no spill

DWARF confirms - location list only contains DW_OP_reg0, no DW_OP_fbreg:

DW_TAG_formal_parameter
  DW_AT_name   gp
  DW_AT_location:
    [0x56b70, 0x56b8c): DW_OP_reg0          <- gp stays in R0, never spilled
    end-of-list

All mcall callees can be found with:

grep -n 'mcall(' /usr/local/go/src/runtime/*.go | grep -v '//' | grep -v 'func mcall'

Each callee's spill behavior can be verified with DWARF:

dwarfdump -i -S match=runtime.<callee> -Wc ./binary

If the gp parameter's location list contains DW_OP_fbreg 8, it spills to g0.sched.sp - 8. If it only contains DW_OP_reg0 followed by end-of-list, it does not spill.

When the profiler cannot resolve the user goroutine through mcall (non-spilling function on ARM64, or goroutine already rescheduled on another M), it stops unwinding at the mcall frame (*stop = true) without crossing to the user stack. The g0 frames (park_m, schedule, findRunnable, etc.) are still captured. This is the same behavior as before this change (UNWIND_COMMAND_STOP). There is no regression.

Callee	Spills R0?	DWARF
`park_m`	Yes	`DW_OP_fbreg 8`
`preemptPark`	Yes	`DW_OP_fbreg 8`
`exitsyscall0`	Yes	`DW_OP_fbreg 8`
`goyield_m`	Yes	`DW_OP_fbreg 8`
`goschedguarded_m`	Yes	`DW_OP_fbreg 8`
`goexit0`	No	`DW_OP_reg0` only
`gosched_m`	No	`DW_OP_reg0` only
`gopreempt_m`	No	DWARF abstract (inlined), verified via objdump

Stale goroutine detection

The goroutine at the g0 slot may have been rescheduled on another M since the mcall. Its gobuf could then contain values from systemstack on that other thread (observed in testing: systemstack_switch frames from a different M). The profiler validates candidate.m == nil (parked, gobuf reliable) vs candidate.m != nil (rescheduled, STOP).

Recovery

Unlike systemstack, mcall saves the caller's actual registers into gobuf (not a synthetic marker):

state->pc = *(curg + sched_pc);    // gobuf.pc = real return address
state->sp = *(curg + sched_sp);    // gobuf.sp = real SP
state->fp = *(curg + sched_bp);    // gobuf.bp = real FP

systemstack test

amd64

arm64

mcall test

amd64

arm64

florianl · 2026-04-01T16:07:09Z

I also do have something in progress, based on #1279, but need to land #1310 first. During KubeCon last week, I didn't find time to work on it.

wehzzz · 2026-04-02T08:20:37Z

I also do have something in progress, based on #1279, but need to land #1310 first. During KubeCon last week, I didn't find time to work on it.

Hope KubeCon went well!

I ended up getting nerd-sniped by this problem and wanted to dig into it since the previous PR was closed. No worries at all if you already have a fix in the works based on #1279. I currently have something working, but if you think your approach is the better way to tackle this, we can absolutely close this PR. Just let me know how you'd like to proceed!

fabled

Awesome stuff! Added first round of comments and questions.

fabled · 2026-04-08T12:23:52Z

support/ebpf/native_stack_trace.ebpf.c

+      // Although systemstack is declared with $0 frame size, Go's linker injects
+      // a frame pointer prologue (PUSH RBP + MOVQ RSP, RBP) for all non-NOFRAME
+      // functions that contain a CALL instruction.
+      // https://github.com/golang/go/blob/affadc7997466dfacad5b9a3dc90ee5e7a7b6085/src/cmd/internal/obj/x86/obj6.go#L637


Could we then get away by using frame pointer unwinding?

Good catch. Frame pointers are enabled for all following go versions on Linux:

AMD64

Version Condition to PUSH RBP Source Prologue on Linux ?

Go 1.13 Framepointer_enabled && !NoFrame && !(frameless leaf) obj6.go#L623, default=1, true for amd64+linux Oui

Go 1.17 !NoFrame && !(frameless nosplit) && !(frameless leaf) obj6.go#L593 Oui

Go 1.25 !NoFrame && !(frameless leaf) obj6.go#L622 Oui

ARM64

Version Condition to save R29 Source Prologue on Linux ?

Go 1.13 Framepointer_enabled(goos, goarch) - true pour arm64 && linux obj7.go#L641, Framepointer_enabled Oui

Go 1.17 Inconditionnel (quand frame > 0) obj7.go#L617 Oui

Go 1.25 Inconditionnel (small frame path) obj7.go#L719 Oui

I will dig into this and move systemstack to UnwindInfoFramePointer

support/ebpf/native_stack_trace.ebpf.c

fabled · 2026-04-08T12:26:03Z

support/ebpf/native_stack_trace.ebpf.c

+      // synthetic marker for Go's stack scanner and scheduler, not a real return address.
+      //
+      // https://github.com/golang/go/blob/917949cc1d16c652cb09ba369718f45e5d814d8f/src/runtime/asm_amd64.s#L886
+      GoLabelsOffsets *go_offs = go_get_go_offsets();


go_get_go_offsets is used in many places, and this is inside a rolled loop. I'm wondering if it'd make sense to read this data in the native unwinder beginning to the PerCPURecord? The Go plugins also could use it and avoid the lookup later on.

We may want to copy GoLabelsOffsets into PerCPURecord to avoid reading the same values again. I'm not sure what would be the best place for the first read - collect_trace seems to initialize the struct, so it should work there, but from my understanding it would mean calling this for every non-Go process, adding one bpf_read for each of them (at the moment it's only for go processes).

Not sure if we want to do this here or in a follow-up.

support/ebpf/native_stack_trace.ebpf.c

fabled · 2026-04-08T13:58:30Z

nativeunwind/elfunwindinfo/elfgopclntab.go

+	// the unwinder crosses back to the goroutine stack using the goroutine's saved
+	// context from g.sched (gobuf).
+	"runtime.systemstack": &sdtypes.UnwindInfoGoSystemstack,
+	"runtime.mcall":       &sdtypes.UnwindInfoGoMcall,


These add the special unwind command for the whole function. Does it make sense to apply this only to the portion to which applies? Or does the command support correctly unwinding all locations of the corresponding functions?

Since systemstack now relies on frame pointer unwinding, it's fine for it.

In the case of mcall, it's a bit more nuanced. We rely on gobuf sched values for unwinding, however we might want to special-case the first part of the function.

TEXT runtime·mcall<ABIInternal>(SB), NOSPLIT, $0-8 #ifdef GOEXPERIMENT_runtimesecret CMPL g_secret(R14), $0 JEQ nosecret CALL ·secretEraseRegistersMcall(SB) nosecret: #endif MOVQ AX, DX // DX = fn // Save state in g->sched. The caller's SP and PC are restored by gogo to // resume execution in the caller's frame (implicit return). The caller's BP // is also restored to support frame pointer unwinding. MOVQ SP, BX // hide (SP) reads from vet MOVQ 8(BX), BX // caller's PC MOVQ BX, (g_sched+gobuf_pc)(R14) LEAQ fn+0(FP), BX // caller's SP MOVQ BX, (g_sched+gobuf_sp)(R14) // Get the caller's frame pointer by dereferencing BP. Storing BP as it is // can cause a frame pointer cycle, see CL 476235. MOVQ (BP), BX // caller's BP MOVQ BX, (g_sched+gobuf_bp)(R14)

If we sample in the above code, we have a valid curg, however the sched values are not yet fully populated. The unwinding will stop cleanly (via the !saved_sp || !saved_pc check), but here we could instead rely on frame pointer unwinding since BP is still valid.

If we sample in the code below, go_resolve_mcall_goroutine handles both the case where we sample before dropg (using m.curg) and the case where we sample after (falling back to *(g0.sched.sp - 8)). There should be no issue as sched is fully populated at that point.

// switch to m->g0 & its stack, call fn MOVQ g_m(R14), BX MOVQ m_g0(BX), SI // SI = g.m.g0 CMPQ SI, R14 // if g == m->g0 call badmcall JNE goodm JMP runtime·badmcall(SB) goodm: MOVQ R14, AX // AX (and arg 0) = g MOVQ SI, R14 // g = g.m.g0 get_tls(CX) // Set G in TLS MOVQ R14, g(CX) MOVQ (g_sched+gobuf_sp)(R14), SP // sp = g0.sched.sp MOVQ $0, BP // clear frame pointer, as caller may execute on another M PUSHQ AX // open up space for fn's arg spill slot MOVQ 0(DX), R12 CALL R12 // fn(g) // The Windows native stack unwinder incorrectly classifies the next instruction // as part of the function epilogue, producing a wrong call stack. // Add a NOP to work around this issue. See go.dev/issue/67007. BYTE $0x90 POPQ AX JMP runtime·badmcall2(SB) RET

wehzzz changed the title ~~[WIP] fix: go unwinding stops at systemstacks~~ fix: go unwinding stops at systemstacks Apr 2, 2026

wehzzz marked this pull request as ready for review April 8, 2026 08:01

wehzzz requested review from a team as code owners April 8, 2026 08:01

wehzzz added 8 commits April 8, 2026 08:10

feat: add support for systemstack unwinding through sp & bp

bc9e03d

fix: wrong memory mapping

8215d62

recompile binary

88cb801

feat: add mcall support

2161f2c

chore: lint

d4da8f2

fix: remove mcall support

54b803d

chore: make ebpf

5db0218

feat: add mcall support and fix systemstack arm64 register order

6961a92

wehzzz force-pushed the fix-go-unwinding-stops-at-systemstacks branch from dfb9247 to 6961a92 Compare April 8, 2026 08:16

florianl mentioned this pull request Apr 8, 2026

ebpf: support Go cgo stack unwinding through goroutine stack #1331

Open

fabled reviewed Apr 8, 2026

View reviewed changes

alban mentioned this pull request Apr 8, 2026

[PoC][wip] ustack: user stack with otel-ebpf-profiler used as a library inspektor-gadget/inspektor-gadget#4925

Draft

fix: use fp strategy for systemstack unwinding; apply review

759f10b

wehzzz requested a review from fabled April 10, 2026 11:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: go unwinding stops at systemstacks#1313

fix: go unwinding stops at systemstacks#1313
wehzzz wants to merge 9 commits intoopen-telemetry:mainfrom
wehzzz:fix-go-unwinding-stops-at-systemstacks

wehzzz commented Apr 1, 2026 •

edited

Loading

Uh oh!

florianl commented Apr 1, 2026 •

edited

Loading

Uh oh!

wehzzz commented Apr 2, 2026

Uh oh!

fabled left a comment

Uh oh!

fabled Apr 8, 2026

Uh oh!

wehzzz Apr 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

fabled Apr 8, 2026

Uh oh!

wehzzz Apr 9, 2026

Uh oh!

Uh oh!

Uh oh!

fabled Apr 8, 2026

Uh oh!

wehzzz Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Version	Condition to PUSH RBP	Source	Prologue on Linux ?
Go 1.13	`Framepointer_enabled && !NoFrame && !(frameless leaf)`	obj6.go#L623, default=1, true for amd64+linux	Oui
Go 1.17	`!NoFrame && !(frameless nosplit) && !(frameless leaf)`	obj6.go#L593	Oui
Go 1.25	`!NoFrame && !(frameless leaf)`	obj6.go#L622	Oui

Version	Condition to save R29	Source	Prologue on Linux ?
Go 1.13	`Framepointer_enabled(goos, goarch)` - true pour `arm64 && linux`	obj7.go#L641, Framepointer_enabled	Oui
Go 1.17	Inconditionnel (quand frame > 0)	obj7.go#L617	Oui
Go 1.25	Inconditionnel (small frame path)	obj7.go#L719	Oui

Conversation

wehzzz commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unwinding Through Go Stack Switches

systemstack

Locating the user goroutine

AMD64 recovery

ARM64 recovery

mcall

Locating the user goroutine

ARM64: functions that don't spill

Stale goroutine detection

Recovery

systemstack test

mcall test

Uh oh!

florianl commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wehzzz commented Apr 2, 2026

Uh oh!

fabled left a comment

Choose a reason for hiding this comment

Uh oh!

fabled Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

wehzzz Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

AMD64

ARM64

Uh oh!

Uh oh!

fabled Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

wehzzz Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fabled Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

wehzzz Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wehzzz commented Apr 1, 2026 •

edited

Loading

florianl commented Apr 1, 2026 •

edited

Loading

wehzzz Apr 8, 2026 •

edited

Loading