Skip to content

Fix std::terminate in ibverbs destructors on systems without RDMA hardware#500

Merged
meta-codesync[bot] merged 1 commit intomainfrom
d4l3k/fix_ibv
Mar 20, 2026
Merged

Fix std::terminate in ibverbs destructors on systems without RDMA hardware#500
meta-codesync[bot] merged 1 commit intomainfrom
d4l3k/fix_ibv

Conversation

@d4l3k
Copy link
Copy Markdown
Member

@d4l3k d4l3k commented Mar 17, 2026

Summary

  • On CI runners (ubuntu-latest with -DUSE_IBVERBS=ON), rdma-core userspace providers make ibv_get_device_list() return devices even without real RDMA hardware. The ibverbs Device constructor succeeds, but ibv_create_qp() fails in the Pair constructor, throwing EnforceNotMet. During stack unwinding, ~Pair() and ~Device() call GLOO_ENFORCE which throws from implicitly noexcept destructors (C++11+), causing std::terminate().

Fixes the crash at AllgatherRing/AllgatherTest.VarNumPointer/360 seen in every CI run: https://github.com/pytorch/gloo/actions/runs/22975489184/job/66702898253

See also #497 which identified the same crash but addressed it differently.

Test plan

  • Full test suite passes locally (3058 passed, 1061 skipped, 0 failed)
  • CI passes on ubuntu-latest with -DUSE_IBVERBS=ON -DUSE_LIBUV=ON -DUSE_TCP_OPENSSL_LINK=ON

@meta-cla meta-cla bot added the CLA Signed label Mar 17, 2026
@d4l3k d4l3k force-pushed the d4l3k/fix_ibv branch 2 times, most recently from 7de27ba to 7b0254f Compare March 19, 2026 01:22
@d4l3k d4l3k marked this pull request as ready for review March 19, 2026 17:53
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 19, 2026

@d4l3k has imported this pull request. If you are a Meta employee, you can view this in D97331908.

…vice

On CI runners without real RDMA hardware, rdma-core software providers
let ibv_open_device/ibv_alloc_pd/ibv_create_comp_channel succeed but
ibv_create_qp fails with EINVAL. Creating a gloo Device starts a
background thread; after fork() in TransportMultiProcTest the thread
handle is invalid, causing SIGSEGV (exit 139) in Device::~Device.

Fix: probe ibverbs capability using raw APIs (through ibv_create_qp)
in the test's createDevice() before constructing a gloo Device. If QP
creation fails, mark IBVERBS as unavailable and return nullptr.

Also moves GTEST_SKIP() out of worker threads to avoid concurrent
calls racing on GTest internals (exit 134), adds a SIGSEGV backtrace
handler for test debugging, and builds with RelWithDebInfo.
@dolpm dolpm self-requested a review March 20, 2026 00:05
@meta-codesync meta-codesync bot merged commit 6f4c667 into main Mar 20, 2026
19 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants