Skip to content

Conversation

@liqiangxl
Copy link
Collaborator

@liqiangxl liqiangxl commented Jan 5, 2026

Two minor change:

  1. Rename from tutorial_multidevice to test_multidevice_tutorial
  2. Noticed 2 test failures in a local node with 1 gpu. Revised to skip these two tests if there is only 1 gpu. Do we know why CI didn’t catch this issue? I’m wondering if it might be related to CI consistently running this test on nodes with more than one GPU. @xwang233

After revision: (in a node with 1 gpu)

[  SKIPPED ] 2 tests, listed below:
[  SKIPPED ] MultiDeviceTutorial.SimplePipelining
[  SKIPPED ] MultiDeviceTutorial.HostIrKernekPipelining

Original errs:

[ RUN      ] MultiDeviceTutorial.SimplePipelining
unknown file: Failure
C++ exception with description "Expected (requested_n_gpus)<=(communicator_->size()) . Found 2 vs 1. 
Exception raised from validate at /opt/pytorch/nvfuser/csrc/host_ir/evaluator.cpp:134 (most recent call first):
frame #0: nvfuser::nvfCheckFail(char const*, char const*, long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x110 (0xbbbd9d5e8530 in ./test_tutorial_multidevice)
frame #1: <unknown function> + 0x6e8e58 (0xbbbd9d988e58 in ./test_tutorial_multidevice)
frame #2: <unknown function> + 0x6ea55c (0xbbbd9d98a55c in ./test_tutorial_multidevice)
frame #3: <unknown function> + 0x971fb8 (0xbbbd9dc11fb8 in ./test_tutorial_multidevice)
frame #4: <unknown function> + 0xddcb84 (0xbbbd9e07cb84 in ./test_tutorial_multidevice)
frame #5: <unknown function> + 0xe2f6a0 (0xbbbd9e0cf6a0 in ./test_tutorial_multidevice)
frame #6: <unknown function> + 0xe15a94 (0xbbbd9e0b5a94 in ./test_tutorial_multidevice)
frame #7: <unknown function> + 0xe15f88 (0xbbbd9e0b5f88 in ./test_tutorial_multidevice)
frame #8: <unknown function> + 0xe16584 (0xbbbd9e0b6584 in ./test_tutorial_multidevice)
frame #9: <unknown function> + 0xe23830 (0xbbbd9e0c3830 in ./test_tutorial_multidevice)
frame #10: <unknown function> + 0xe16760 (0xbbbd9e0b6760 in ./test_tutorial_multidevice)
frame #11: <unknown function> + 0x351104 (0xbbbd9d5f1104 in ./test_tutorial_multidevice)
frame #12: <unknown function> + 0x284c4 (0xfc5a4eb684c4 in /usr/lib/aarch64-linux-gnu/libc.so.6)
frame #13: __libc_start_main + 0x98 (0xfc5a4eb68598 in /usr/lib/aarch64-linux-gnu/libc.so.6)
frame #14: <unknown function> + 0x36bd70 (0xbbbd9d60bd70 in ./test_tutorial_multidevice)
" thrown in the test body.

To reproduce: NVFUSER_TEST_RANDOM_SEED=1767623470 NVFUSER_TEST_ATEN_RANDOM_SEED=0 test_nvfuser --gtest_filter='MultiDeviceTutorial.SimplePipelining'
[  FAILED  ] MultiDeviceTutorial.SimplePipelining (0 ms)
[ RUN      ] MultiDeviceTutorial.HostIrKernekPipelining
[gb-nvl-118-compute03:111994:0:111994] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)
==== backtrace (tid: 111994) ====
 0  /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2cc) [0xfc5a14ff19fc]
 1  /opt/hpcx/ucx/lib/libucs.so.0(+0x31bac) [0xfc5a14ff1bac]
 2  /opt/hpcx/ucx/lib/libucs.so.0(+0x31ed8) [0xfc5a14ff1ed8]
 3  linux-vdso.so.1(__kernel_rt_sigreturn+0) [0xfc5a7edc0968]
 4  ./test_tutorial_multidevice(+0x6f38b4) [0xbbbd9d9938b4]
 5  ./test_tutorial_multidevice(+0xde1ab4) [0xbbbd9e081ab4]
 6  ./test_tutorial_multidevice(+0xe2f6a0) [0xbbbd9e0cf6a0]
 7  ./test_tutorial_multidevice(+0xe15a94) [0xbbbd9e0b5a94]
 8  ./test_tutorial_multidevice(+0xe15f88) [0xbbbd9e0b5f88]
 9  ./test_tutorial_multidevice(+0xe16584) [0xbbbd9e0b6584]
10  ./test_tutorial_multidevice(+0xe23830) [0xbbbd9e0c3830]
11  ./test_tutorial_multidevice(+0xe16760) [0xbbbd9e0b6760]
12  ./test_tutorial_multidevice(+0x351104) [0xbbbd9d5f1104]
13  /usr/lib/aarch64-linux-gnu/libc.so.6(+0x284c4) [0xfc5a4eb684c4]
14  /usr/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98) [0xfc5a4eb68598]
15  ./test_tutorial_multidevice(+0x36bd70) [0xbbbd9d60bd70]

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 5, 2026

Greptile Summary

Renames the test binary from tutorial_multidevice to test_multidevice_tutorial for consistency with naming conventions.

Key Changes:

  • Renamed test binary in CMakeLists.txt and manual_ci.sh
  • Updated test name references in documentation and skip messages
  • Fixed test name typo: "Kernek" → "Kernel"
  • Added SKIP_IF_NOT_ENOUGH_DEVICES check to SimplePipelining test to properly skip when only 1 GPU is available
  • Disabled HostIrKernelPipelining test entirely with DISABLED_ prefix

Critical Issue:
The HostIrKernelPipelining test is being disabled rather than fixed. The test segfaults with 1 GPU, but inspection shows it passes communicator=nullptr to HostIrEvaluator, meaning it shouldn't require multiple GPUs. Disabling the test hides a real bug in the HostIR pipeline logic that needs investigation rather than suppression.

Confidence Score: 2/5

  • This PR has one critical issue that masks a real bug by disabling a test
  • The renaming changes are safe and straightforward, but disabling the HostIrKernelPipelining test instead of fixing the underlying segfault hides a real bug. The test doesn't require multiple GPUs (passes communicator=nullptr), so the segfault indicates a genuine issue in the HostIR execution logic that should be investigated.
  • Pay attention to tests/cpp/test_multidevice_tutorial.cpp - the disabled test needs proper investigation and fix

Important Files Changed

Filename Overview
tests/cpp/test_multidevice_tutorial.cpp Added GPU count check for SimplePipelining, disabled HostIrKernelPipelining test entirely instead of fixing the underlying issue

Sequence Diagram

sequenceDiagram
    participant Test as Test Runner
    participant SimplePipelining as SimplePipelining Test
    participant HostIrKernelPipelining as HostIrKernelPipelining Test
    participant Macro as SKIP_IF_NOT_ENOUGH_DEVICES
    participant Executor as MultiDeviceExecutor
    
    Note over Test: Single GPU Environment
    
    Test->>SimplePipelining: Run test
    SimplePipelining->>Macro: Check device count against fusion requirements
    alt Has 2+ GPUs
        Macro->>SimplePipelining: Continue
        SimplePipelining->>Executor: Create MultiDeviceExecutor
        Executor->>SimplePipelining: Execute test
    else Only 1 GPU
        Macro->>SimplePipelining: GTEST_SKIP
        SimplePipelining->>Test: Test skipped
    end
    
    Test->>HostIrKernelPipelining: Run test
    Note over HostIrKernelPipelining: DISABLED_ prefix prevents execution
    HostIrKernelPipelining->>Test: Test disabled (not run)
    Note right of HostIrKernelPipelining: Underlying segfault bug<br/>remains unaddressed
Loading

@liqiangxl
Copy link
Collaborator Author

!test

@github-actions
Copy link

github-actions bot commented Jan 5, 2026

Review updated until commit 9386247

Auto-merge Status

❌ Internal CI is finished (nvfuser-ci status not found)
✅ No failed checks
❌ PR is mergeable (blocked)
ℹ️ PR mergeable_state: blocked

Description

  • Rename test binary from tutorial_multidevice to test_multidevice_tutorial for consistency

  • Add GPU availability check to skip SimplePipelining test on single-GPU systems

  • Disable HostIrKernelPipelining test and fix typo in test name

  • Update build configuration and CI scripts to use new test binary name

Changes walkthrough

Relevant files
Enhancement
test_multidevice_tutorial.cpp
Update test names and add GPU availability checks               

tests/cpp/test_multidevice_tutorial.cpp

  • Rename command references from tutorial_multidevice to
    test_multidevice_tutorial
  • Add SKIP_IF_NOT_ENOUGH_DEVICES(fusion) to SimplePipelining test
  • Disable HostIrKernelPipelining test and fix typo in name
  • +4/-3     
    Configuration changes
    manual_ci.sh
    Update CI script with new test binary name                             

    manual_ci.sh

  • Update MPI_TESTS array to use test_multidevice_tutorial instead of
    tutorial_multidevice
  • +1/-1     
    CMakeLists.txt
    Update build configuration for renamed test                           

    CMakeLists.txt

  • Change add_test_without_main target from tutorial_multidevice to
    test_multidevice_tutorial
  • Update TEST_BINARIES list to use new test name
  • +2/-2     

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review
    Test naming consistency

    The PR fixes a typo in the test name from 'HostIrKernekPipelining' to 'DISABLED_HostIrKernelPipelining' (line 996), but the original typo 'Kernek' should be corrected to 'Kernel' for consistency with proper naming conventions.

    TEST_F(MultiDeviceTutorial, DISABLED_HostIrKernelPipelining) {
    Test skipping logic

    The SKIP_IF_NOT_ENOUGH_DEVICES macro is added at line 314 for SimplePipelining test, but the PR description mentions two tests failing on single-GPU nodes. Verify that HostIrKernelPipelining (now disabled) and any other affected tests have appropriate device checking.

    SKIP_IF_NOT_ENOUGH_DEVICES(fusion);

    @xwang233
    Copy link
    Collaborator

    xwang233 commented Jan 5, 2026

    For binary test, we only run those binaries that start with test_

    @liqiangxl liqiangxl requested a review from wujingyue January 5, 2026 18:43
    // To do so, we will be using new Host IRs: Stream (a Val), SetStream, ForLoop.
    TEST_F(MultiDeviceTutorial, HostIrKernekPipelining) {
    if (communicator_->size() < 2) {
    GTEST_SKIP() << "Need at least 2 devices to run this test";
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Can you instead call

    const auto num_devices = communicator_->size(); \
    right before runWithInput? That gives max coverage -- the HostIrContainer creation doesn't need multiple GPUs.

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Moved check to right before runWithInput

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    My message must have landed poorly -- I meant the SKIP_IF_NOT_ENOUGH_DEVICES macro:

    #define SKIP_IF_NOT_ENOUGH_DEVICES(fusion) \

    // Stages.
    TEST_F(MultiDeviceTutorial, SimplePipelining) {
    if (communicator_->size() < 2) {
    GTEST_SKIP() << "Need at least 2 devices to run this test";
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ditto

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Can't move to right before runWithInput. In this case HostIrEvaluator --> HostIrEvaluator::validate() --> NVF_CHECK_LE(requested_n_gpus, communicator_->size()); where requested_n_gpus = 2

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Then move it before HostIrEvaluator is constructed.

    The reason I'm proposing this change is to avoid coupling. If/when the input fusion is changed to require 4 GPUs, SKIP_IF_NOT_ENOUGH_DEVICES will work just fine without any extra change.

    @wujingyue
    Copy link
    Collaborator

    2. Do we know why CI didn’t catch this issue?

    http://nv/e-0

    I don't think CI runs multidevice_tutorial at all. I think jit.xml should be changed to run all test_multidevice_*

    @liqiangxl
    Copy link
    Collaborator Author

    1. Do we know why CI didn’t catch this issue?

    http://nv/e-0

    I don't think CI runs multidevice_tutorial at all. I think jit.xml should be changed to run all test_multidevice_*

    good point. I renamed to test_multidevice_tutorial

    @liqiangxl
    Copy link
    Collaborator Author

    1. Do we know why CI didn’t catch this issue?

    http://nv/e-0
    I don't think CI runs multidevice_tutorial at all. I think jit.xml should be changed to run all test_multidevice_*

    good point. I renamed to test_multidevice_tutorial

    @xwang233

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    3 files reviewed, 1 comment

    Edit Code Review Agent Settings | Greptile

    Comment on lines 1128 to 1130
    if (communicator_->size() < 2) {
    GTEST_SKIP() << "Need at least 2 devices to run this test";
    }
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    logic: GPU count check is too late - placed after HostIrEvaluator construction on line 1123-1126. When hic is moved into the evaluator, it's consumed. If the test then skips, the HostIrContainer is already destroyed. Move this check before line 1123.

    Suggested change
    if (communicator_->size() < 2) {
    GTEST_SKIP() << "Need at least 2 devices to run this test";
    }
    if (communicator_->size() < 2) {
    GTEST_SKIP() << "Need at least 2 devices to run this test";
    }
    auto outputs = hie.runWithInput({{tv0, aten_tv0}, {tv2, aten_tv2}});

    @xwang233
    Copy link
    Collaborator

    xwang233 commented Jan 5, 2026

    !test

    @liqiangxl liqiangxl changed the title rename to test_tutorial_multidevice rename to test_multidevice_tutorial Jan 5, 2026
    @liqiangxl
    Copy link
    Collaborator Author

    @wujingyue CI script is updated, TEST_F(MultiDeviceTutorial, HostIrKernekPipelining) failed in CI with mpirun -np 3, there may have a real bug. err msg Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8), see [CI link](http://nv/e-1).

    @liqiangxl
    Copy link
    Collaborator Author

    !test

    @wujingyue
    Copy link
    Collaborator

    @wujingyue CI script is updated, TEST_F(MultiDeviceTutorial, HostIrKernekPipelining) failed in CI with mpirun -np 3, there may have a real bug. err msg Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8), see [CI link](http://nv/e-1).

    No problem -- disable that test and I'll take a look.

    Copy link
    Collaborator

    @wujingyue wujingyue left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    undo LGTM

    */
    // To do so, we will be using new Host IRs: Stream (a Val), SetStream, ForLoop.
    TEST_F(MultiDeviceTutorial, HostIrKernekPipelining) {
    GTEST_SKIP() << "Caught signal 11 (Segmentation fault: address not mapped to "
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Don't use GTEST_SKIP to disable. Use DISABLE_ so people can at least https://google.github.io/googletest/advanced.html#temporarily-enabling-disabled-tests. In contrast, there's no way to "unskip" a GTEST_SKIP.

    @wujingyue
    Copy link
    Collaborator

    !test

    @wujingyue wujingyue added the enable-auto-merge Auto-merge a PR when: 1) PR mergeable 2) Internal CI complete 3) No failures label Jan 5, 2026
    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    3 files reviewed, 1 comment

    Edit Code Review Agent Settings | Greptile

    Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
    @wujingyue
    Copy link
    Collaborator

    !test

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    3 files reviewed, 1 comment

    Edit Code Review Agent Settings | Greptile

    @wujingyue wujingyue merged commit 30e3a31 into main Jan 6, 2026
    58 of 59 checks passed
    @wujingyue wujingyue deleted the llu/rename_test branch January 6, 2026 01:19
    @github-actions github-actions bot removed the enable-auto-merge Auto-merge a PR when: 1) PR mergeable 2) Internal CI complete 3) No failures label Jan 6, 2026
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    3 participants