Skip to content

[Bug]: Mooncake Store Exit Cause Host CrashΒ #1447

@whybeyoung

Description

@whybeyoung

Bug Report

When undeloying the install of the mooncake store ,Some nodes may experience host freezes and a sharp surge in memory usage.

i catch the crashdump:

Image
PID: 960054   TASK: ffff8cd109059b80  CPU: 106  COMMAND: "python"
 #0 [ffffa5d59a524d58] machine_kexec at ffffffff8d061ab0
 #1 [ffffa5d59a524da0] __crash_kexec at ffffffff8d1ad06a
 #2 [ffffa5d59a524e60] panic at ffffffff8da96bd8
 #3 [ffffa5d59a524ed8] watchdog_timer_fn.cold at ffffffff8daa113f
 #4 [ffffa5d59a524f08] __run_hrtimer at ffffffff8d18b99c
 #5 [ffffa5d59a524f40] __hrtimer_run_queues at ffffffff8d18bb41
 #6 [ffffa5d59a524f78] hrtimer_interrupt at ffffffff8d18c040
 #7 [ffffa5d59a524fd8] __sysvec_apic_timer_interrupt at ffffffff8d05877c
 #8 [ffffa5d59a524ff0] asm_call_sysvec_on_stack at ffffffff8dc010ff
--- <IRQ stack> ---
    [exception RIP: no symbolic reference]
    RIP: ffffa5d5b134bb00  RSP: fffff9a83d2ba280  RFLAGS: fffff9a83d2ba280
    RAX: ffff8cb27fffaec0  RBX: fffff9a835d16a08  RCX: fffff9a80323ce88
    RDX: ffffffffffffffff  RSI: ffffffff8d317a5e  RDI: 0000000000000010
    RBP: 00000000060c8f3a   R8: ffffa5d5b134bab8   R9: ffffa5d5b134bab8
    R10: ffff8cb27fffad80  R11: fffff9a835d16a08  R12: 000000000003b820
    R13: 0000000000000297  R14: fffff9a83d2ba280  R15: 0000000000000000
    ORIG_RAX: 0000000000000297  CS: 0018  SS: 0001
bt: WARNING: possibly bogus exception frame
#10 [ffffa5d5b134bb18] unpin_user_pages_dirty_lock at ffffffff8d2e81bb
#11 [ffffa5d5b134bb48] __ib_umem_release at ffffffffc030271a [ib_core]
#12 [ffffa5d5b134bb80] native_ib_umem_release at ffffffffc0302c5e [ib_core]
#13 [ffffa5d5b134bb90] mlx5_ib_dereg_mr at ffffffffc5a0b61c [mlx5_ib]
#14 [ffffa5d5b134bbf0] ib_dereg_mr_user at ffffffffc02d8770 [ib_core]
#15 [ffffa5d5b134bc20] destroy_hw_idr_uobject at ffffffffc03926eb [ib_uverbs]
#16 [ffffa5d5b134bc40] uverbs_destroy_uobject at ffffffffc0392d04 [ib_uverbs]
#17 [ffffa5d5b134bc70] __uverbs_cleanup_ufile at ffffffffc0392f20 [ib_uverbs]
#18 [ffffa5d5b134bd18] uverbs_destroy_ufile_hw at ffffffffc03936a8 [ib_uverbs]
#19 [ffffa5d5b134bd48] ib_uverbs_close at ffffffffc038a91f [ib_uverbs]
#20 [ffffa5d5b134bd60] __fput at ffffffff8d39fc53
#21 [ffffa5d5b134bd90] task_work_run at ffffffff8d10bee9
#22 [ffffa5d5b134bdb0] do_exit at ffffffff8d0eec05
#23 [ffffa5d5b134bde0] do_group_exit at ffffffff8d0eee83
#24 [ffffa5d5b134be08] get_signal at ffffffff8d0fd541
#25 [ffffa5d5b134be88] arch_do_signal_or_restart at ffffffff8d0230fc
#26 [ffffa5d5b134bf10] exit_to_user_mode_loop at ffffffff8d185153
#27 [ffffa5d5b134bf30] exit_to_user_mode_prepare at ffffffff8d18522e
#28 [ffffa5d5b134bf40] syscall_exit_to_user_mode at ffffffff8dadf462
#29 [ffffa5d5b134bf50] entry_SYSCALL_64_after_hwframe at ffffffff8dc00099
    RIP: 00007f5a94491117  RSP: 00007f56cf7fdb90  RFLAGS: 00000246
    RAX: fffffffffffffe00  RBX: 0000000000000000  RCX: 00007f5a94491117
    RDX: 0000000000000000  RSI: 0000000000000189  RDI: 00005641ac2387c8
    RBP: 00005641ac2387a0   R8: 0000000000000000   R9: 00000000ffffffff
    R10: 0000000000000000  R11: 0000000000000246  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 00005641ac2387c8
    ORIG_RAX: 00000000000000ca  CS: 0033  SS: 002b
Image

it's indicate that:
πŸ‘‰ The Python process is releasing the RDMA Memory Region (MR) upon exit.
πŸ‘‰ It freezes in the MR teardown path of the mlx5 network card driver.
πŸ‘‰ Prolonged CPU non-scheduling β†’ softlockup β†’ watchdog panic.

Before submitting...

  • Ensure you searched for relevant issues and read the [documentation]

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions