Fix SIGUSR1/SIGUSR2 handler clobbering breaking JVM processes by charford · Pull Request #160 · Project-HAMi/HAMi-core

charford · 2026-03-17T21:25:33Z

Problem

libvgpu.so registers signal handlers for SIGUSR1 and SIGUSR2 using signal(), which overwrites any previously installed handlers without saving them. This causes JVM processes to crash with SIGSEGV in Monitor::wait() because the JVM uses SIGUSR1/SIGUSR2 internally for GC safepoints and thread management.

Observed on HAMi volcano-vgpu nodes (hami-core mode) when running PyTorch jobs with a JVM component — the crash occurs at startup before CUDA initializes.

Additional risks from the current implementation:

libvgpu.so intercepts dlsym(), which the JVM also uses for native library loading
The ENSURE_RUNNING() spin loop can cause a deadlock if a Java thread holding a JVM monitor gets suspended

Fix

Replace signal() calls in init_proc_slot_withlock() with sigaction(), saving the previous handlers into static struct sigaction variables and chaining to them in sig_swap_stub / sig_restore_stub. This ensures other runtimes (JVM, Python, etc.) can still process SIGUSR1/SIGUSR2 correctly.

Changes

src/multiprocess/multiprocess_memory_limit.c: Replace signal() with sigaction(), chain to previous handlers

Testing

Verified JVM workloads no longer crash with SIGSEGV in Monitor::wait() when running under libvgpu.so injection on HAMi volcano-vgpu nodes (prod-ltx1-k8s-1, prod-ltx1-k8s-2)

Closes #161

hami-robot · 2026-03-17T21:25:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: charford
Once this PR has been reviewed and has the lgtm label, please assign archlitchi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hami-robot · 2026-03-17T21:25:43Z

Welcome @charford! It looks like this is your first PR to Project-HAMi/HAMi-core 🎉

signal() overwrote SIGUSR1 and SIGUSR2 without saving previous handlers, causing JVM crashes (SIGSEGV in Monitor::wait) since the JVM uses these signals internally for GC safepoints and thread management. Replace signal() with sigaction() to save the old handlers, and chain to them in the stubs so other runtimes (JVM, etc.) can still process the signals correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Casey Harford <casey@caseyharford.com>

hami-robot bot added do-not-merge/work-in-progress dco-signoff: no labels Mar 17, 2026

hami-robot bot requested review from archlitchi and chaunceyjiang March 17, 2026 21:25

hami-robot bot added the size/S label Mar 17, 2026

charford force-pushed the fix-sigusr-signal-handler-chaining branch from 3413af0 to bf1e307 Compare March 17, 2026 21:51

hami-robot bot added dco-signoff: yes and removed dco-signoff: no labels Mar 17, 2026

charford marked this pull request as ready for review March 18, 2026 17:06

hami-robot bot removed the do-not-merge/work-in-progress label Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SIGUSR1/SIGUSR2 handler clobbering breaking JVM processes#160

Fix SIGUSR1/SIGUSR2 handler clobbering breaking JVM processes#160
charford wants to merge 1 commit intoProject-HAMi:mainfrom
charford:fix-sigusr-signal-handler-chaining

charford commented Mar 17, 2026 •

edited

Loading

Uh oh!

hami-robot bot commented Mar 17, 2026

Uh oh!

hami-robot bot commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

charford commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Changes

Testing

Uh oh!

hami-robot bot commented Mar 17, 2026

Uh oh!

hami-robot bot commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

charford commented Mar 17, 2026 •

edited

Loading