Commit d47dc93
Fix NCCLX UnitTest
Summary:
Mocks doesn't return any valid cuda device count. The code was relying on
uninitialized value, at the mercy of compiler. Recently the code started to
segfault when it does `rank_ % device_count` with device_count being 0
Changes
- Explicitly initialize the device_count to 0
- Extend mock to always return `1` by default for `getDeviceCount`
- Fix the UT to setup mock behaviors by default
Test Failure - https://www.internalfb.com/intern/test/562950190647332
```
[ RUN ] TorchCommNCCLXTest.InitializationFailsWithInvalidDeviceId
I1026 04:38:23.001531 1483368 TorchCommNCCLXBootstrap.cpp:43] [TC] TORCHCOMM_NCCLX_BOOTSTRAP_UNIQUEID_EXCHANGE_METHOD not set, defaulting to auto
*** Aborted at 1761478703 (Unix time, try 'date -d 1761478703') ***
*** Signal 8 (SIGFPE) (0x3170a51f) received by PID 1483368 (pthread TID 0x7fd11c3e3000) (linux TID 1483368) (code: integer divide by zero), stack trace: ***
@ 000000000ba032ff folly::symbolizer::(anonymous namespace)::signalHandler(int, siginfo_t*, void*)
./fbcode/folly/debugging/symbolizer/SignalHandler.cpp:528
@ 000000000004455f (unknown)
/home/engshare/third-party2/glibc/2.34/src/glibc-2.34/signal/../sysdeps/unix/sysv/linux/libc_sigaction.c:8
-> /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c
@ 000000003170a51f torch::comms::TorchCommNCCLXBootstrap::TorchCommNCCLXBootstrap(c10::intrusive_ptr<c10d::Store, c10::detail::intrusive_target_default_null_type<c10d::Store> >, c10::Device, std::shared_ptr<torch::comms::NcclxApi>, std::shared_ptr<torch::comms::CudaApi>, std::chrono::duration<long, std::ratio<1l, 1000l> >)
./fbcode/comms/torchcomms/ncclx/TorchCommNCCLXBootstrap.cpp:63
@ 00000000316fd281 torch::comms::TorchCommNCCLX::init(c10::Device, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::comms::CommOptions const&)
./fbcode/comms/torchcomms/ncclx/TorchCommNCCLX.cpp:74
@ 000000000b94fb02 torch::comms::test::TorchCommNCCLXTest_InitializationFailsWithInvalidDeviceId_Test::TestBody()
./fbcode/comms/torchcomms/ncclx/tests/unit/cpp/TorchCommNCCLXTest.cpp:155
@ 000000000b9a586f testing::Test::Run()
fbsource/src/gtest.cc:2751
@ 000000000b9a6cde testing::TestInfo::Run()
fbsource/src/gtest.cc:2897
@ 000000000b9a80f4 testing::TestSuite::Run()
fbsource/src/gtest.cc:3075
@ 000000000b9ba72f testing::internal::UnitTestImpl::RunAllTests()
fbsource/src/gtest.cc:6066
@ 000000000b9b980f testing::UnitTest::Run()
fbsource/src/gtest.cc:5606
@ 000000000b9c977d main
fbsource/gtest/gtest.h:2337
@ 000000000002c656 __libc_start_call_main
/home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../sysdeps/nptl/libc_start_call_main.h:58
-> /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../sysdeps/x86/libc-start.c
@ 000000000002c717 __libc_start_main
/home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../csu/libc-start.c:409
-> /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../sysdeps/x86/libc-start.c
@ 000000000b944220 _start
/home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../sysdeps/x86_64/start.S:116
```
Reviewed By: pavanbalaji
Differential Revision: D85489376
fbshipit-source-id: 993377d73b05c91cb0df4219e4f8f4e9b78de36b1 parent ea81c2e commit d47dc93
File tree
3 files changed
+16
-4
lines changed- comms/torchcomms/ncclx
- tests/unit/cpp
- mocks
3 files changed
+16
-4
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | | - | |
| 3 | + | |
| 4 | + | |
4 | 5 | | |
5 | 6 | | |
6 | 7 | | |
7 | 8 | | |
8 | 9 | | |
9 | 10 | | |
10 | 11 | | |
| 12 | + | |
11 | 13 | | |
12 | 14 | | |
13 | 15 | | |
| |||
54 | 56 | | |
55 | 57 | | |
56 | 58 | | |
57 | | - | |
| 59 | + | |
58 | 60 | | |
59 | 61 | | |
60 | 62 | | |
61 | 63 | | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
62 | 69 | | |
63 | 70 | | |
64 | 71 | | |
| |||
Lines changed: 4 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
51 | 51 | | |
52 | 52 | | |
53 | 53 | | |
54 | | - | |
55 | | - | |
56 | 54 | | |
57 | 55 | | |
| 56 | + | |
| 57 | + | |
58 | 58 | | |
59 | 59 | | |
60 | 60 | | |
| |||
134 | 134 | | |
135 | 135 | | |
136 | 136 | | |
| 137 | + | |
| 138 | + | |
137 | 139 | | |
138 | 140 | | |
139 | 141 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
19 | 22 | | |
20 | 23 | | |
21 | 24 | | |
| |||
0 commit comments