Commit 30e3a31
rename to test_multidevice_tutorial (#5756)
**Two minor change:**
1. Rename from `tutorial_multidevice` to `test_multidevice_tutorial`
2. Noticed 2 test failures in a local node with 1 gpu. Revised to skip
these two tests if there is only 1 gpu. Do we know why CI didn’t catch
this issue? I’m wondering if it might be related to CI consistently
running this test on nodes with more than one GPU. @xwang233
**After revision:** (in a node with 1 gpu)
```
[ SKIPPED ] 2 tests, listed below:
[ SKIPPED ] MultiDeviceTutorial.SimplePipelining
[ SKIPPED ] MultiDeviceTutorial.HostIrKernekPipelining
```
**Original errs:**
```
[ RUN ] MultiDeviceTutorial.SimplePipelining
unknown file: Failure
C++ exception with description "Expected (requested_n_gpus)<=(communicator_->size()) . Found 2 vs 1.
Exception raised from validate at /opt/pytorch/nvfuser/csrc/host_ir/evaluator.cpp:134 (most recent call first):
frame #0: nvfuser::nvfCheckFail(char const*, char const*, long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x110 (0xbbbd9d5e8530 in ./test_tutorial_multidevice)
frame #1: <unknown function> + 0x6e8e58 (0xbbbd9d988e58 in ./test_tutorial_multidevice)
frame #2: <unknown function> + 0x6ea55c (0xbbbd9d98a55c in ./test_tutorial_multidevice)
frame #3: <unknown function> + 0x971fb8 (0xbbbd9dc11fb8 in ./test_tutorial_multidevice)
frame #4: <unknown function> + 0xddcb84 (0xbbbd9e07cb84 in ./test_tutorial_multidevice)
frame #5: <unknown function> + 0xe2f6a0 (0xbbbd9e0cf6a0 in ./test_tutorial_multidevice)
frame #6: <unknown function> + 0xe15a94 (0xbbbd9e0b5a94 in ./test_tutorial_multidevice)
frame #7: <unknown function> + 0xe15f88 (0xbbbd9e0b5f88 in ./test_tutorial_multidevice)
frame #8: <unknown function> + 0xe16584 (0xbbbd9e0b6584 in ./test_tutorial_multidevice)
frame #9: <unknown function> + 0xe23830 (0xbbbd9e0c3830 in ./test_tutorial_multidevice)
frame #10: <unknown function> + 0xe16760 (0xbbbd9e0b6760 in ./test_tutorial_multidevice)
frame #11: <unknown function> + 0x351104 (0xbbbd9d5f1104 in ./test_tutorial_multidevice)
frame #12: <unknown function> + 0x284c4 (0xfc5a4eb684c4 in /usr/lib/aarch64-linux-gnu/libc.so.6)
frame #13: __libc_start_main + 0x98 (0xfc5a4eb68598 in /usr/lib/aarch64-linux-gnu/libc.so.6)
frame #14: <unknown function> + 0x36bd70 (0xbbbd9d60bd70 in ./test_tutorial_multidevice)
" thrown in the test body.
To reproduce: NVFUSER_TEST_RANDOM_SEED=1767623470 NVFUSER_TEST_ATEN_RANDOM_SEED=0 test_nvfuser --gtest_filter='MultiDeviceTutorial.SimplePipelining'
[ FAILED ] MultiDeviceTutorial.SimplePipelining (0 ms)
```
```
[ RUN ] MultiDeviceTutorial.HostIrKernekPipelining
[gb-nvl-118-compute03:111994:0:111994] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)
==== backtrace (tid: 111994) ====
0 /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2cc) [0xfc5a14ff19fc]
1 /opt/hpcx/ucx/lib/libucs.so.0(+0x31bac) [0xfc5a14ff1bac]
2 /opt/hpcx/ucx/lib/libucs.so.0(+0x31ed8) [0xfc5a14ff1ed8]
3 linux-vdso.so.1(__kernel_rt_sigreturn+0) [0xfc5a7edc0968]
4 ./test_tutorial_multidevice(+0x6f38b4) [0xbbbd9d9938b4]
5 ./test_tutorial_multidevice(+0xde1ab4) [0xbbbd9e081ab4]
6 ./test_tutorial_multidevice(+0xe2f6a0) [0xbbbd9e0cf6a0]
7 ./test_tutorial_multidevice(+0xe15a94) [0xbbbd9e0b5a94]
8 ./test_tutorial_multidevice(+0xe15f88) [0xbbbd9e0b5f88]
9 ./test_tutorial_multidevice(+0xe16584) [0xbbbd9e0b6584]
10 ./test_tutorial_multidevice(+0xe23830) [0xbbbd9e0c3830]
11 ./test_tutorial_multidevice(+0xe16760) [0xbbbd9e0b6760]
12 ./test_tutorial_multidevice(+0x351104) [0xbbbd9d5f1104]
13 /usr/lib/aarch64-linux-gnu/libc.so.6(+0x284c4) [0xfc5a4eb684c4]
14 /usr/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98) [0xfc5a4eb68598]
15 ./test_tutorial_multidevice(+0x36bd70) [0xbbbd9d60bd70]
```
---------
Co-authored-by: Jingyue Wu <[email protected]>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>1 parent 1bbd375 commit 30e3a31
File tree
3 files changed
+7
-6
lines changed- tests/cpp
3 files changed
+7
-6
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1237 | 1237 | | |
1238 | 1238 | | |
1239 | 1239 | | |
1240 | | - | |
1241 | | - | |
| 1240 | + | |
| 1241 | + | |
1242 | 1242 | | |
1243 | 1243 | | |
1244 | 1244 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
68 | 68 | | |
69 | 69 | | |
70 | 70 | | |
71 | | - | |
| 71 | + | |
72 | 72 | | |
73 | 73 | | |
74 | 74 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
31 | 31 | | |
32 | 32 | | |
33 | 33 | | |
34 | | - | |
| 34 | + | |
35 | 35 | | |
36 | 36 | | |
37 | 37 | | |
| |||
43 | 43 | | |
44 | 44 | | |
45 | 45 | | |
46 | | - | |
| 46 | + | |
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
| |||
311 | 311 | | |
312 | 312 | | |
313 | 313 | | |
| 314 | + | |
314 | 315 | | |
315 | 316 | | |
316 | 317 | | |
| |||
992 | 993 | | |
993 | 994 | | |
994 | 995 | | |
995 | | - | |
| 996 | + | |
996 | 997 | | |
997 | 998 | | |
998 | 999 | | |
| |||
0 commit comments