Skip to content

SIGUSR1/SIGUSR2 handler clobbering breaking JVM processes #161

@charford

Description

@charford

Problem

libvgpu.so registers signal handlers for SIGUSR1 and SIGUSR2 using signal(), which overwrites any previously installed handlers without saving them. This causes JVM processes to crash with SIGSEGV in Monitor::wait() because the JVM uses SIGUSR1/SIGUSR2 internally for GC safepoints and thread management.

Observed on HAMi volcano-vgpu nodes (hami-core mode) when running PyTorch jobs with a JVM component — the crash occurs at startup before CUDA initializes.

Additional risks from the current implementation:

  • libvgpu.so intercepts dlsym(), which the JVM also uses for native library loading
  • The ENSURE_RUNNING() spin loop can cause a deadlock if a Java thread holding a JVM monitor gets suspended

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions