rename to test_multidevice_tutorial #5756

liqiangxl · 2026-01-05T14:34:11Z

Two minor change:

Rename from tutorial_multidevice to test_multidevice_tutorial
Noticed 2 test failures in a local node with 1 gpu. Revised to skip these two tests if there is only 1 gpu. Do we know why CI didn’t catch this issue? I’m wondering if it might be related to CI consistently running this test on nodes with more than one GPU. @xwang233

After revision: (in a node with 1 gpu)

[  SKIPPED ] 2 tests, listed below:
[  SKIPPED ] MultiDeviceTutorial.SimplePipelining
[  SKIPPED ] MultiDeviceTutorial.HostIrKernekPipelining

Original errs:

[ RUN      ] MultiDeviceTutorial.SimplePipelining
unknown file: Failure
C++ exception with description "Expected (requested_n_gpus)<=(communicator_->size()) . Found 2 vs 1. 
Exception raised from validate at /opt/pytorch/nvfuser/csrc/host_ir/evaluator.cpp:134 (most recent call first):
frame #0: nvfuser::nvfCheckFail(char const*, char const*, long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x110 (0xbbbd9d5e8530 in ./test_tutorial_multidevice)
frame #1: <unknown function> + 0x6e8e58 (0xbbbd9d988e58 in ./test_tutorial_multidevice)
frame #2: <unknown function> + 0x6ea55c (0xbbbd9d98a55c in ./test_tutorial_multidevice)
frame #3: <unknown function> + 0x971fb8 (0xbbbd9dc11fb8 in ./test_tutorial_multidevice)
frame #4: <unknown function> + 0xddcb84 (0xbbbd9e07cb84 in ./test_tutorial_multidevice)
frame #5: <unknown function> + 0xe2f6a0 (0xbbbd9e0cf6a0 in ./test_tutorial_multidevice)
frame #6: <unknown function> + 0xe15a94 (0xbbbd9e0b5a94 in ./test_tutorial_multidevice)
frame #7: <unknown function> + 0xe15f88 (0xbbbd9e0b5f88 in ./test_tutorial_multidevice)
frame #8: <unknown function> + 0xe16584 (0xbbbd9e0b6584 in ./test_tutorial_multidevice)
frame #9: <unknown function> + 0xe23830 (0xbbbd9e0c3830 in ./test_tutorial_multidevice)
frame #10: <unknown function> + 0xe16760 (0xbbbd9e0b6760 in ./test_tutorial_multidevice)
frame #11: <unknown function> + 0x351104 (0xbbbd9d5f1104 in ./test_tutorial_multidevice)
frame #12: <unknown function> + 0x284c4 (0xfc5a4eb684c4 in /usr/lib/aarch64-linux-gnu/libc.so.6)
frame #13: __libc_start_main + 0x98 (0xfc5a4eb68598 in /usr/lib/aarch64-linux-gnu/libc.so.6)
frame #14: <unknown function> + 0x36bd70 (0xbbbd9d60bd70 in ./test_tutorial_multidevice)
" thrown in the test body.

To reproduce: NVFUSER_TEST_RANDOM_SEED=1767623470 NVFUSER_TEST_ATEN_RANDOM_SEED=0 test_nvfuser --gtest_filter='MultiDeviceTutorial.SimplePipelining'
[  FAILED  ] MultiDeviceTutorial.SimplePipelining (0 ms)

[ RUN      ] MultiDeviceTutorial.HostIrKernekPipelining
[gb-nvl-118-compute03:111994:0:111994] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)
==== backtrace (tid: 111994) ====
 0  /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2cc) [0xfc5a14ff19fc]
 1  /opt/hpcx/ucx/lib/libucs.so.0(+0x31bac) [0xfc5a14ff1bac]
 2  /opt/hpcx/ucx/lib/libucs.so.0(+0x31ed8) [0xfc5a14ff1ed8]
 3  linux-vdso.so.1(__kernel_rt_sigreturn+0) [0xfc5a7edc0968]
 4  ./test_tutorial_multidevice(+0x6f38b4) [0xbbbd9d9938b4]
 5  ./test_tutorial_multidevice(+0xde1ab4) [0xbbbd9e081ab4]
 6  ./test_tutorial_multidevice(+0xe2f6a0) [0xbbbd9e0cf6a0]
 7  ./test_tutorial_multidevice(+0xe15a94) [0xbbbd9e0b5a94]
 8  ./test_tutorial_multidevice(+0xe15f88) [0xbbbd9e0b5f88]
 9  ./test_tutorial_multidevice(+0xe16584) [0xbbbd9e0b6584]
10  ./test_tutorial_multidevice(+0xe23830) [0xbbbd9e0c3830]
11  ./test_tutorial_multidevice(+0xe16760) [0xbbbd9e0b6760]
12  ./test_tutorial_multidevice(+0x351104) [0xbbbd9d5f1104]
13  /usr/lib/aarch64-linux-gnu/libc.so.6(+0x284c4) [0xfc5a4eb684c4]
14  /usr/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98) [0xfc5a4eb68598]
15  ./test_tutorial_multidevice(+0x36bd70) [0xbbbd9d60bd70]

greptile-apps · 2026-01-05T14:36:12Z

Greptile Summary

Renames the test binary from tutorial_multidevice to test_multidevice_tutorial for consistency with naming conventions.

Key Changes:

Renamed test binary in CMakeLists.txt and manual_ci.sh
Updated test name references in documentation and skip messages
Fixed test name typo: "Kernek" → "Kernel"
Added SKIP_IF_NOT_ENOUGH_DEVICES check to SimplePipelining test to properly skip when only 1 GPU is available
Disabled HostIrKernelPipelining test entirely with DISABLED_ prefix

Critical Issue:
The HostIrKernelPipelining test is being disabled rather than fixed. The test segfaults with 1 GPU, but inspection shows it passes communicator=nullptr to HostIrEvaluator, meaning it shouldn't require multiple GPUs. Disabling the test hides a real bug in the HostIR pipeline logic that needs investigation rather than suppression.

Confidence Score: 2/5

This PR has one critical issue that masks a real bug by disabling a test
The renaming changes are safe and straightforward, but disabling the HostIrKernelPipelining test instead of fixing the underlying segfault hides a real bug. The test doesn't require multiple GPUs (passes communicator=nullptr), so the segfault indicates a genuine issue in the HostIR execution logic that should be investigated.
Pay attention to tests/cpp/test_multidevice_tutorial.cpp - the disabled test needs proper investigation and fix

Important Files Changed

Filename	Overview
tests/cpp/test_multidevice_tutorial.cpp	Added GPU count check for `SimplePipelining`, disabled `HostIrKernelPipelining` test entirely instead of fixing the underlying issue

Sequence Diagram

sequenceDiagram
    participant Test as Test Runner
    participant SimplePipelining as SimplePipelining Test
    participant HostIrKernelPipelining as HostIrKernelPipelining Test
    participant Macro as SKIP_IF_NOT_ENOUGH_DEVICES
    participant Executor as MultiDeviceExecutor
    
    Note over Test: Single GPU Environment
    
    Test->>SimplePipelining: Run test
    SimplePipelining->>Macro: Check device count against fusion requirements
    alt Has 2+ GPUs
        Macro->>SimplePipelining: Continue
        SimplePipelining->>Executor: Create MultiDeviceExecutor
        Executor->>SimplePipelining: Execute test
    else Only 1 GPU
        Macro->>SimplePipelining: GTEST_SKIP
        SimplePipelining->>Test: Test skipped
    end
    
    Test->>HostIrKernelPipelining: Run test
    Note over HostIrKernelPipelining: DISABLED_ prefix prevents execution
    HostIrKernelPipelining->>Test: Test disabled (not run)
    Note right of HostIrKernelPipelining: Underlying segfault bug<br/>remains unaddressed

liqiangxl · 2026-01-05T14:43:10Z

!test

github-actions · 2026-01-05T14:51:38Z

Review updated until commit 9386247

Auto-merge Status

❌ Internal CI is finished (nvfuser-ci status not found)
✅ No failed checks
❌ PR is mergeable (blocked)
ℹ️ PR mergeable_state: blocked

Description

Rename test binary from tutorial_multidevice to test_multidevice_tutorial for consistency
Add GPU availability check to skip SimplePipelining test on single-GPU systems
Disable HostIrKernelPipelining test and fix typo in test name
Update build configuration and CI scripts to use new test binary name

Changes walkthrough

Relevant files

Enhancement

test_multidevice_tutorial.cpp `Update test names and add GPU availability checks` tests/cpp/test_multidevice_tutorial.cpp Rename command references from `tutorial_multidevice` to `test_multidevice_tutorial` Add `SKIP_IF_NOT_ENOUGH_DEVICES(fusion)` to SimplePipelining test Disable HostIrKernelPipelining test and fix typo in name	+4/-3

Configuration changes

manual_ci.sh `Update CI script with new test binary name` manual_ci.sh Update MPI_TESTS array to use `test_multidevice_tutorial` instead of `tutorial_multidevice`	+1/-1
CMakeLists.txt `Update build configuration for renamed test` CMakeLists.txt Change add_test_without_main target from `tutorial_multidevice` to `test_multidevice_tutorial` Update TEST_BINARIES list to use new test name	+2/-2

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Test naming consistency The PR fixes a typo in the test name from 'HostIrKernekPipelining' to 'DISABLED_HostIrKernelPipelining' (line 996), but the original typo 'Kernek' should be corrected to 'Kernel' for consistency with proper naming conventions. TEST_F(MultiDeviceTutorial, DISABLED_HostIrKernelPipelining) { Test skipping logic The SKIP_IF_NOT_ENOUGH_DEVICES macro is added at line 314 for SimplePipelining test, but the PR description mentions two tests failing on single-GPU nodes. Verify that HostIrKernelPipelining (now disabled) and any other affected tests have appropriate device checking. SKIP_IF_NOT_ENOUGH_DEVICES(fusion);

xwang233 · 2026-01-05T17:19:43Z

For binary test, we only run those binaries that start with test_

wujingyue · 2026-01-05T18:49:19Z

tests/cpp/test_multidevice_tutorial.cpp

 // To do so, we will be using new Host IRs: Stream (a Val), SetStream, ForLoop.
 TEST_F(MultiDeviceTutorial, HostIrKernekPipelining) {
+  if (communicator_->size() < 2) {
+    GTEST_SKIP() << "Need at least 2 devices to run this test";


Can you instead call

Fuser/tests/cpp/multidevice.h

Line 61 in 7c38ba3

const auto num_devices = communicator_->size(); \

right before runWithInput? That gives max coverage -- the HostIrContainer creation doesn't need multiple GPUs.

Moved check to right before runWithInput

My message must have landed poorly -- I meant the SKIP_IF_NOT_ENOUGH_DEVICES macro:

Fuser/tests/cpp/multidevice.h

Line 59 in 7c38ba3

#define SKIP_IF_NOT_ENOUGH_DEVICES(fusion) \

wujingyue · 2026-01-05T18:49:24Z

tests/cpp/test_multidevice_tutorial.cpp

 // Stages.
 TEST_F(MultiDeviceTutorial, SimplePipelining) {
+  if (communicator_->size() < 2) {
+    GTEST_SKIP() << "Need at least 2 devices to run this test";


Can't move to right before runWithInput. In this case HostIrEvaluator --> HostIrEvaluator::validate() --> NVF_CHECK_LE(requested_n_gpus, communicator_->size()); where requested_n_gpus = 2

Then move it before HostIrEvaluator is constructed.

The reason I'm proposing this change is to avoid coupling. If/when the input fusion is changed to require 4 GPUs, SKIP_IF_NOT_ENOUGH_DEVICES will work just fine without any extra change.

wujingyue · 2026-01-05T18:51:51Z

2. Do we know why CI didn’t catch this issue?

http://nv/e-0

I don't think CI runs multidevice_tutorial at all. I think jit.xml should be changed to run all test_multidevice_*

liqiangxl · 2026-01-05T19:17:44Z

Do we know why CI didn’t catch this issue?

http://nv/e-0

I don't think CI runs multidevice_tutorial at all. I think jit.xml should be changed to run all test_multidevice_*

good point. I renamed to test_multidevice_tutorial

liqiangxl · 2026-01-05T19:18:02Z

Do we know why CI didn’t catch this issue?

http://nv/e-0
I don't think CI runs multidevice_tutorial at all. I think jit.xml should be changed to run all test_multidevice_*

good point. I renamed to test_multidevice_tutorial

@xwang233

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-05T19:21:39Z

tests/cpp/test_multidevice_tutorial.cpp

+  if (communicator_->size() < 2) {
+    GTEST_SKIP() << "Need at least 2 devices to run this test";
+  }


logic: GPU count check is too late - placed after HostIrEvaluator construction on line 1123-1126. When hic is moved into the evaluator, it's consumed. If the test then skips, the HostIrContainer is already destroyed. Move this check before line 1123.

Suggested change

if (communicator_->size() < 2) {

GTEST_SKIP() << "Need at least 2 devices to run this test";

}

if (communicator_->size() < 2) {

GTEST_SKIP() << "Need at least 2 devices to run this test";

}

auto outputs = hie.runWithInput({{tv0, aten_tv0}, {tv2, aten_tv2}});

xwang233 · 2026-01-05T19:35:25Z

!test

…o llu/rename_test

liqiangxl · 2026-01-05T20:00:17Z

@wujingyue CI script is updated, TEST_F(MultiDeviceTutorial, HostIrKernekPipelining) failed in CI with mpirun -np 3, there may have a real bug. err msg Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8), see [CI link](http://nv/e-1).

liqiangxl · 2026-01-05T20:00:31Z

!test

wujingyue · 2026-01-05T20:04:55Z

@wujingyue CI script is updated, TEST_F(MultiDeviceTutorial, HostIrKernekPipelining) failed in CI with mpirun -np 3, there may have a real bug. err msg Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8), see [CI link](http://nv/e-1).

No problem -- disable that test and I'll take a look.

wujingyue

undo LGTM

wujingyue · 2026-01-05T20:09:07Z

tests/cpp/test_multidevice_tutorial.cpp

 */
 // To do so, we will be using new Host IRs: Stream (a Val), SetStream, ForLoop.
 TEST_F(MultiDeviceTutorial, HostIrKernekPipelining) {
+  GTEST_SKIP() << "Caught signal 11 (Segmentation fault: address not mapped to "


Don't use GTEST_SKIP to disable. Use DISABLE_ so people can at least https://google.github.io/googletest/advanced.html#temporarily-enabling-disabled-tests. In contrast, there's no way to "unskip" a GTEST_SKIP.

tests/cpp/test_multidevice_tutorial.cpp

wujingyue · 2026-01-05T21:39:39Z

!test

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

tests/cpp/test_multidevice_tutorial.cpp

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

wujingyue · 2026-01-05T23:20:36Z

!test

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

tests/cpp/test_multidevice_tutorial.cpp

rename to test_tutorial_multidevice

af0dd75

skip tests require 2 gpus

b83b04a

liqiangxl requested a review from wujingyue January 5, 2026 18:43

wujingyue approved these changes Jan 5, 2026

View reviewed changes

liqiangxl and others added 2 commits January 5, 2026 11:19

rename to test_multidevice_tutorial

1fb626c

Merge branch 'main' into llu/rename_test

323f5ed

greptile-apps bot reviewed Jan 5, 2026

View reviewed changes

liqiangxl changed the title ~~rename to test_tutorial_multidevice~~ rename to test_multidevice_tutorial Jan 5, 2026

liqiangxl and others added 3 commits January 5, 2026 11:55

skip HostIrKernekPipelining

92e1169

Merge branch 'llu/rename_test' of https://github.com/nvidia/fuser int…

25471fd

…o llu/rename_test

Merge branch 'main' into llu/rename_test

268ff9e

wujingyue requested changes Jan 5, 2026

View reviewed changes

wujingyue reviewed Jan 5, 2026

View reviewed changes

diable

ad099f8

wujingyue approved these changes Jan 5, 2026

View reviewed changes

tests/cpp/test_multidevice_tutorial.cpp Show resolved Hide resolved

tests/cpp/test_multidevice_tutorial.cpp Outdated Show resolved Hide resolved

tests/cpp/test_multidevice_tutorial.cpp Outdated Show resolved Hide resolved

Apply suggestions from code review

dd47772

wujingyue added the enable-auto-merge Auto-merge a PR when: 1) PR mergeable 2) Internal CI complete 3) No failures label Jan 5, 2026

greptile-apps bot reviewed Jan 5, 2026

View reviewed changes

tests/cpp/test_multidevice_tutorial.cpp Outdated Show resolved Hide resolved

Update tests/cpp/test_multidevice_tutorial.cpp

9386247

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

greptile-apps bot reviewed Jan 5, 2026

View reviewed changes

tests/cpp/test_multidevice_tutorial.cpp Show resolved Hide resolved

wujingyue merged commit 30e3a31 into main Jan 6, 2026
58 of 59 checks passed

wujingyue deleted the llu/rename_test branch January 6, 2026 01:19

github-actions bot removed the enable-auto-merge Auto-merge a PR when: 1) PR mergeable 2) Internal CI complete 3) No failures label Jan 6, 2026

rename to test_multidevice_tutorial #5756

rename to test_multidevice_tutorial #5756

Uh oh!

Conversation

liqiangxl commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Uh oh!

liqiangxl commented Jan 5, 2026

Uh oh!

github-actions bot commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Auto-merge Status

Description

Changes walkthrough

PR Reviewer Guide

Uh oh!

xwang233 commented Jan 5, 2026

Uh oh!

wujingyue Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

liqiangxl Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

wujingyue Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

wujingyue Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

liqiangxl Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

wujingyue Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

wujingyue commented Jan 5, 2026

Uh oh!

liqiangxl commented Jan 5, 2026

Uh oh!

liqiangxl commented Jan 5, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

xwang233 commented Jan 5, 2026

Uh oh!

liqiangxl commented Jan 5, 2026

Uh oh!

liqiangxl commented Jan 5, 2026

Uh oh!

wujingyue commented Jan 5, 2026

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

wujingyue Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wujingyue commented Jan 5, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wujingyue commented Jan 5, 2026

Uh oh!

liqiangxl commented Jan 5, 2026 •

edited

Loading

greptile-apps bot commented Jan 5, 2026 •

edited

Loading

github-actions bot commented Jan 5, 2026 •

edited

Loading