-
Notifications
You must be signed in to change notification settings - Fork 550
Open
Labels
Description
Bug Report
When undeloying the install of the mooncake store ,Some nodes may experience host freezes and a sharp surge in memory usage.
i catch the crashdump:
PID: 960054 TASK: ffff8cd109059b80 CPU: 106 COMMAND: "python"
#0 [ffffa5d59a524d58] machine_kexec at ffffffff8d061ab0
#1 [ffffa5d59a524da0] __crash_kexec at ffffffff8d1ad06a
#2 [ffffa5d59a524e60] panic at ffffffff8da96bd8
#3 [ffffa5d59a524ed8] watchdog_timer_fn.cold at ffffffff8daa113f
#4 [ffffa5d59a524f08] __run_hrtimer at ffffffff8d18b99c
#5 [ffffa5d59a524f40] __hrtimer_run_queues at ffffffff8d18bb41
#6 [ffffa5d59a524f78] hrtimer_interrupt at ffffffff8d18c040
#7 [ffffa5d59a524fd8] __sysvec_apic_timer_interrupt at ffffffff8d05877c
#8 [ffffa5d59a524ff0] asm_call_sysvec_on_stack at ffffffff8dc010ff
--- <IRQ stack> ---
[exception RIP: no symbolic reference]
RIP: ffffa5d5b134bb00 RSP: fffff9a83d2ba280 RFLAGS: fffff9a83d2ba280
RAX: ffff8cb27fffaec0 RBX: fffff9a835d16a08 RCX: fffff9a80323ce88
RDX: ffffffffffffffff RSI: ffffffff8d317a5e RDI: 0000000000000010
RBP: 00000000060c8f3a R8: ffffa5d5b134bab8 R9: ffffa5d5b134bab8
R10: ffff8cb27fffad80 R11: fffff9a835d16a08 R12: 000000000003b820
R13: 0000000000000297 R14: fffff9a83d2ba280 R15: 0000000000000000
ORIG_RAX: 0000000000000297 CS: 0018 SS: 0001
bt: WARNING: possibly bogus exception frame
#10 [ffffa5d5b134bb18] unpin_user_pages_dirty_lock at ffffffff8d2e81bb
#11 [ffffa5d5b134bb48] __ib_umem_release at ffffffffc030271a [ib_core]
#12 [ffffa5d5b134bb80] native_ib_umem_release at ffffffffc0302c5e [ib_core]
#13 [ffffa5d5b134bb90] mlx5_ib_dereg_mr at ffffffffc5a0b61c [mlx5_ib]
#14 [ffffa5d5b134bbf0] ib_dereg_mr_user at ffffffffc02d8770 [ib_core]
#15 [ffffa5d5b134bc20] destroy_hw_idr_uobject at ffffffffc03926eb [ib_uverbs]
#16 [ffffa5d5b134bc40] uverbs_destroy_uobject at ffffffffc0392d04 [ib_uverbs]
#17 [ffffa5d5b134bc70] __uverbs_cleanup_ufile at ffffffffc0392f20 [ib_uverbs]
#18 [ffffa5d5b134bd18] uverbs_destroy_ufile_hw at ffffffffc03936a8 [ib_uverbs]
#19 [ffffa5d5b134bd48] ib_uverbs_close at ffffffffc038a91f [ib_uverbs]
#20 [ffffa5d5b134bd60] __fput at ffffffff8d39fc53
#21 [ffffa5d5b134bd90] task_work_run at ffffffff8d10bee9
#22 [ffffa5d5b134bdb0] do_exit at ffffffff8d0eec05
#23 [ffffa5d5b134bde0] do_group_exit at ffffffff8d0eee83
#24 [ffffa5d5b134be08] get_signal at ffffffff8d0fd541
#25 [ffffa5d5b134be88] arch_do_signal_or_restart at ffffffff8d0230fc
#26 [ffffa5d5b134bf10] exit_to_user_mode_loop at ffffffff8d185153
#27 [ffffa5d5b134bf30] exit_to_user_mode_prepare at ffffffff8d18522e
#28 [ffffa5d5b134bf40] syscall_exit_to_user_mode at ffffffff8dadf462
#29 [ffffa5d5b134bf50] entry_SYSCALL_64_after_hwframe at ffffffff8dc00099
RIP: 00007f5a94491117 RSP: 00007f56cf7fdb90 RFLAGS: 00000246
RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007f5a94491117
RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00005641ac2387c8
RBP: 00005641ac2387a0 R8: 0000000000000000 R9: 00000000ffffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 00005641ac2387c8
ORIG_RAX: 00000000000000ca CS: 0033 SS: 002b
it's indicate that:
π The Python process is releasing the RDMA Memory Region (MR) upon exit.
π It freezes in the MR teardown path of the mlx5 network card driver.
π Prolonged CPU non-scheduling β softlockup β watchdog panic.
Before submitting...
- Ensure you searched for relevant issues and read the [documentation]
Reactions are currently unavailable