Skip to content

Fix SIGUSR1/SIGUSR2 handler clobbering breaking JVM processes#160

Open
charford wants to merge 1 commit intoProject-HAMi:mainfrom
charford:fix-sigusr-signal-handler-chaining
Open

Fix SIGUSR1/SIGUSR2 handler clobbering breaking JVM processes#160
charford wants to merge 1 commit intoProject-HAMi:mainfrom
charford:fix-sigusr-signal-handler-chaining

Conversation

@charford
Copy link
Copy Markdown

@charford charford commented Mar 17, 2026

Problem

libvgpu.so registers signal handlers for SIGUSR1 and SIGUSR2 using signal(), which overwrites any previously installed handlers without saving them. This causes JVM processes to crash with SIGSEGV in Monitor::wait() because the JVM uses SIGUSR1/SIGUSR2 internally for GC safepoints and thread management.

Observed on HAMi volcano-vgpu nodes (hami-core mode) when running PyTorch jobs with a JVM component — the crash occurs at startup before CUDA initializes.

Additional risks from the current implementation:

  • libvgpu.so intercepts dlsym(), which the JVM also uses for native library loading
  • The ENSURE_RUNNING() spin loop can cause a deadlock if a Java thread holding a JVM monitor gets suspended

Fix

Replace signal() calls in init_proc_slot_withlock() with sigaction(), saving the previous handlers into static struct sigaction variables and chaining to them in sig_swap_stub / sig_restore_stub. This ensures other runtimes (JVM, Python, etc.) can still process SIGUSR1/SIGUSR2 correctly.

Changes

  • src/multiprocess/multiprocess_memory_limit.c: Replace signal() with sigaction(), chain to previous handlers

Testing

  • Verified JVM workloads no longer crash with SIGSEGV in Monitor::wait() when running under libvgpu.so injection on HAMi volcano-vgpu nodes (prod-ltx1-k8s-1, prod-ltx1-k8s-2)

Closes #161

@hami-robot
Copy link
Copy Markdown
Contributor

hami-robot bot commented Mar 17, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: charford
Once this PR has been reviewed and has the lgtm label, please assign archlitchi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hami-robot
Copy link
Copy Markdown
Contributor

hami-robot bot commented Mar 17, 2026

Welcome @charford! It looks like this is your first PR to Project-HAMi/HAMi-core 🎉

@hami-robot hami-robot bot added the size/S label Mar 17, 2026
signal() overwrote SIGUSR1 and SIGUSR2 without saving previous handlers,
causing JVM crashes (SIGSEGV in Monitor::wait) since the JVM uses these
signals internally for GC safepoints and thread management.

Replace signal() with sigaction() to save the old handlers, and chain
to them in the stubs so other runtimes (JVM, etc.) can still process
the signals correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Casey Harford <casey@caseyharford.com>
@charford charford force-pushed the fix-sigusr-signal-handler-chaining branch from 3413af0 to bf1e307 Compare March 17, 2026 21:51
@charford charford marked this pull request as ready for review March 18, 2026 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SIGUSR1/SIGUSR2 handler clobbering breaking JVM processes

1 participant