Skip to content

Conversation

user202729
Copy link
Contributor

@user202729 user202729 commented Aug 30, 2025

Follow-up to #40613.

Save the following into b.sage

for i in range(10):
	print("start", flush=True)
	try:
		alarm(0.1); ignore = libgap([1000]*10000); cancel_alarm()
	except:
		print("exception", flush=True)
	else:
		print("no exception", flush=True)

then run sage b.sage. It should segmentation fault soon.


The cause

As is well known, if Cython generates code of the following form

GAP_Enter(); // or sig_on()
__pyx_t_8 = <something>; // (A)
<something that raises GAP error here> // (B)
__pyx_t_8 = 0;
__Pyx_XDECREF(__pyx_t_8);

then the __Pyx_XDECREF(__pyx_t_8) gets optimized to if (__pyx_t_8 != 0) SEGMENTATION_FAULT;,
and if GAP error is raised, we get null pointer dereference.


Before #40613,

  • user KeyboardInterrupt or AlarmInterrupt jumps to sig_on(),
  • GAP error, if sig_GAP_Enter() is used, first jumps to GAP_Enter(), then a sig_error() call jumps to sig_on() to raise a Python exception,
  • GAP error, if GAP_Enter() is used, jumps to GAP_Enter(), which then raises a Python exception.

After #40613, in all cases, the outermost GAP_Enter() is jumped to.


Debug guide

There are 3 parts of debugging:

  • how to find the variable name __pyx_t_8,
  • how to find the call (B),
  • how to find the assignment (A).

Finding the variable name is easy: it suffices to run sage --gdb then run the script.
When it segmentation fault, look at the instruction that causes the segmentation fault.
In this case, it is

<__pyx_f_4sage_4libs_3gap_7element_make_gap_list+1272>   cmp    QWORD PTR [rbp-0x90],0x0
<__pyx_f_4sage_4libs_3gap_7element_make_gap_list+1280>   je     0x7fa8f7f0449a <__pyx_f_4sage_4libs_3gap_7element_make_gap_list+1306>
<__pyx_f_4sage_4libs_3gap_7element_make_gap_list+1282>   xor    edx,edx
<__pyx_f_4sage_4libs_3gap_7element_make_gap_list+1284>   mov    rax,QWORD PTR [rdx]

therefore the variable is [rbp-0x90].

How to map back from assembly to .c file

  • go to compile_commands.json
  • look for the compile command used to generate meson-generated_src_sage_libs_gap_element.pyx.c.o from gap/element.pyx.c
  • change .o to .s then add -S -fverbose-asm -masm=intel and recompile, then read the .s file generated

alternatively I believe adding -g at appropriate places also give you this.

How to find the call

Break at either __pyx_f_4sage_4libs_3gap_4util_gap_interrupt_asap, GAP_THROW, or __longjmp_chk.
Then CYSIGNALS_CRASH_QUIET=1 rr record it, rr replay -e and reverse-continue from the point of segmentation fault.

Relevant backtrace:

#0  0x00007fa94be53060 in __pyx_f_4sage_4libs_3gap_4util_gap_interrupt_asap ()
#1  <signal handler called>
⋮
#40 READ_ALL_COMMANDS (instream=<optimized out>, echo=<optimized out>, capture=0x100000005f70, 
#41 0x00007fa94be5d398 in __pyx_f_4sage_4libs_3gap_4util_gap_eval ()
#42 0x00007fa8f6fee3b5 in __pyx_pw_4sage_4libs_3gap_6libgap_3Gap_7eval ()
#43 0x00007fa95159b803 in __pyx_pw_4sage_9structure_11sage_object_10SageObject_49_libgap_ ()
⋮
#48 0x00007fa8f7f04266 in __pyx_f_4sage_4libs_3gap_7element_make_gap_list ()
#49 0x00007fa8f6ff0e2d in __pyx_pw_4sage_4libs_3gap_6libgap_3Gap_3_element_constructor_ ()
#50 0x00007fa9509abf86 in __pyx_f_4sage_9structure_11coerce_maps_24DefaultConvertMap_unique__call_

run frame 48 and see the assembly is

# .../libs/gap/element.pyx.c:8996:           __pyx_t_5 = __Pyx_PyObject_FastCall(__pyx_t_8, __pyx_callargs+1-__pyx_t_10, 1+__pyx_t_10);
	call	__Pyx_PyObject_FastCallDict.constprop.0	#
# .../mamba/envs/sage-dev/include/python3.11/object.h:601:     if (op != _Py_NULL) {
	cmp	QWORD PTR -152[rbp], 0	# %sfp,
# .../libs/gap/element.pyx.c:8996:           __pyx_t_5 = __Pyx_PyObject_FastCall(__pyx_t_8, __pyx_callargs+1-__pyx_t_10, 1+__pyx_t_10);

which correspond to (look at element.pyx.c:8996 and scroll a few lines up)

        /* "sage/libs/gap/element.pyx":58
 *         for i, x in enumerate(sage_list):
 *             if not isinstance(x, GapElement):
 *                 elem = <GapElement>libgap(x)             # <<<<<<<<<<<<<<
 *             else:
 *                 elem = <GapElement>x
 */

We determine this is the call (B).

How to find the assignment instruction

We want to find (A).

Assume we're in rr, run continue then reverse-stepi a few times until the cmp instruction is executed, then run

(rr) p $rbp-0x90
$28 = (void *) 0x7ffd08a57d10
(rr) watch *(void**)0x7ffd08a57d10
Hardware watchpoint 12: *(void**)0x7ffd08a57d10

where the argument to watch is copied from the output of p.

Then reverse-continue. Whichever instruction last set the memory address is the instruction (A).

        /* "sage/libs/gap/element.pyx":58
 *         for i, x in enumerate(sage_list):
 *             if not isinstance(x, GapElement):
 *                 elem = <GapElement>libgap(x)             # <<<<<<<<<<<<<<
 *             else:
 *                 elem = <GapElement>x
 */
        __Pyx_GetModuleGlobalName(__pyx_t_8, __pyx_n_s_libgap); if (unlikely(!__pyx_t_8)) __PYX_ERR(0, 58, __pyx_L4_error)
        __Pyx_GOTREF(__pyx_t_8);

Turns out it's the __Pyx_GetModuleGlobalName (to fetch the libgap Python global variable).

Conclusion

Don't access any Python object between GAP_Enter ... GAP_Leave. Not even accessing the Python global variable libgap.

Side note, if I add cdef libgap right before the from ... import libgap line, then this particular cause is eliminated, but it still segmentation fault because of the enumerate() plus some complex combination of compiler optimization.

This fix creates a temporarily variable, but I think the overhead of creating a Python tuple is small.

We're still accessing gap_elements[i], which is a Python API call (!!!), but as it happens, GAP_AssList never throws, so we're safe.


📝 Checklist

  • The title is concise and informative.
  • The description explains in detail what this PR is about.
  • I have linked a relevant issue or discussion.
  • I have created tests covering the changes.
  • I have updated the documentation and checked the documentation preview.

⌛ Dependencies

Copy link

Documentation preview for this PR (built with commit 7eeab93; changes) is ready! 🎉
This preview will update shortly after each push to this PR.

@orlitzky
Copy link
Contributor

Argh, I noticed this while working on #40585 but it's not what was causing that reproducible segfault so I eventually reverted my changes and forgot about it.

I think you'll find the same issue in make_gap_matrix(). Maybe that's what caused a problem in matrix_gap.pyx?

@orlitzky
Copy link
Contributor

The CI failures are #40715, you'll probably have to pad the max_wait_after_interrupt to get them passing reliably.

@user202729
Copy link
Contributor Author

I suspect the timeout to be more likely #40556 . Anyway…

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants