Skip to content

Combine python and native unwinder into single loop#1288

Open
gnurizen wants to merge 4 commits intoopen-telemetry:mainfrom
parca-dev:python-native-hybrid
Open

Combine python and native unwinder into single loop#1288
gnurizen wants to merge 4 commits intoopen-telemetry:mainfrom
parca-dev:python-native-hybrid

Conversation

@gnurizen
Copy link
Copy Markdown
Contributor

@gnurizen gnurizen commented Mar 26, 2026

Combine python and native unwinder into single loop
Python, especially pytorch programs can exhaust the tail call limit
by switching from python to native unwinders more than 29 times.
This happens because of eval/delegation patterns where one python
frame will be decorated with a couple native frames.

In order to unwind these stack successfully fold the native unwinder
into the python unwinder so at each frame a python or native frame
can be unwound.

Replace the separate walk_python_stack inner loop and outer
transition loop with a single switch-in-loop structure using
step_python and step_native helper functions. This reduces
tail call usage from one per batch to one per loop budget
exhaustion (PYTHON_NATIVE_LOOP_ITERS=9 iterations).

Move native unwinder map externs (exe_id_to_*_stack_deltas,
stack_delta_page_to_info, unwind_info_array) out of the
TESTING_COREDUMP guard in extmaps.h so python_tracer.ebpf.c
can include native_stack_trace.h.

Python loop iters is now a ro_vars entry so it can be set low by
default and jacked up with debug_prints are disabled which allows for
much bigger stacks.

@gnurizen gnurizen changed the title python native hybrid Combine python and native unwinder into single loop Mar 26, 2026
@gnurizen gnurizen force-pushed the python-native-hybrid branch 3 times, most recently from a83b6d6 to 365d706 Compare March 26, 2026 23:40
@gnurizen gnurizen marked this pull request as ready for review March 27, 2026 00:59
@gnurizen gnurizen requested review from a team as code owners March 27, 2026 00:59
@gnurizen
Copy link
Copy Markdown
Contributor Author

@gnurizen
Copy link
Copy Markdown
Contributor Author

@fabled @florianl tagging you guys for review consideration, no hurry just want to make sure this gets on the appropriate radars. Thanks!

On arm64, rt_regs[34] consumes 272 bytes of the 512-byte BPF stack.
When unwind_one_frame is inlined into interpreter unwinders, this
exceeds the stack limit. Move rt_regs into the PerCPURecord scratch
union which is already 1024+ bytes and unused during signal frame
handling.
@fabled
Copy link
Copy Markdown
Contributor

fabled commented Apr 8, 2026

Seems ruby has same issue. See #1335 .

I wonder if something more elaborate could be done. Or is it better to bundle native unwinder with the interpreters that need it due to mixing native/HLL frames every few frames.

gnurizen added 3 commits April 8, 2026 14:56
This is a prep the patient PR to make room for a hybrid python/native
unwinder that we found necessary to unwind large pytorch stacks that
go back and forth from python to native more times than the tail call
limit will allow.

This change is pure code motion and changes nothing functionally.
Python, especially pytorch programs can exhaust the tail call limit
by switching from python to native unwinders more than 29 times.
This happens because of eval/delegation patterns where one python
frame will be decorated with a couple native frames.

In order to unwind these stack successfully fold the native unwinder
into the python unwinder so at each frame a python or native frame
can be unwound.

Replace the separate walk_python_stack inner loop and outer
transition loop with a single switch-in-loop structure using
step_python and step_native helper functions. This reduces
tail call usage from one per batch to one per loop budget
exhaustion (PYTHON_NATIVE_LOOP_ITERS=9 iterations).

Move native unwinder map externs (exe_id_to_*_stack_deltas,
stack_delta_page_to_info, unwind_info_array) out of the
TESTING_COREDUMP guard in extmaps.h so python_tracer.ebpf.c
can include native_stack_trace.h.

Python loop iters is now a ro_vars entry so it can be set low by
default and jacked up with debug_prints are disabled which allows for
much bigger stacks.
@gnurizen gnurizen force-pushed the python-native-hybrid branch from 2489c65 to a4809c1 Compare April 8, 2026 19:07
@gnurizen
Copy link
Copy Markdown
Contributor Author

gnurizen commented Apr 8, 2026

Rebased to PR #1286. Yeah I'd love to get @dalehamel 's thought on the applicability of this approach to the Ruby situation.

@dalehamel
Copy link
Copy Markdown
Contributor

Seems ruby has same issue. See #1335 .

I wonder if something more elaborate could be done. Or is it better to bundle native unwinder with the interpreters that need it due to mixing native/HLL frames every few frames.

Yes especially in production we see this problem. With yjit, the problem is masked by the fact that the jit is only the leaf frame and we don't run with jit frame pointers in production for performance reasons.

However the ruby unwinder is already quite instruction heavy as we need to do the complex CME resolution for each ruby frame, so it might be hard to get all frames if we add the native unwinder into it too. In reality we mostly really care about the actual 'ruby' stack state (which shouldn't really matter if the frame is jit or interpreter backed, as it might be either with zjit) + jit leaf state the majority of the time and that's been fine for our purposes.

If we could manage to actually continue the native unwinding without exhausting tail calls, that would certainly be the best of both worlds, but i wouldn't say it's the highest priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants