`hir::Allocate` node by Priya2698 · Pull Request #6000 · NVIDIA/Fuser

Priya2698 · 2026-02-21T20:51:31Z

Creating a new hir::Allocate node that always allocates a new tensor. This is required to create new buffers per stream instead of reusing across streams which will require synchronization.

I am not modifying kir::Allocate handling. That caused errors with MultiDeviceExecutor tests.

github-actions · 2026-02-21T20:52:26Z

Review updated until commit 48838a9

Description

Introduces a new hir::Allocate node for per-stream tensor allocation instead of reusing buffers across streams
Replaces kir::Allocate with hir::Allocate in host IR lowering, evaluator, and JIT compilation
Adds hir::Allocate handler in HostIrEvaluator for runtime tensor allocation
Updates tests and dispatch infrastructure to support the new allocation node

Changes walkthrough

	Relevant files

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Potential Device Issue When communicator_ is null, the code defaults to device "cuda:0" which may not be correct. This fallback behavior should be validated to ensure it handles all multi-device scenarios properly, especially when no communicator is available. c10::Device device = communicator_ ? communicator_->device() : at::Device("cuda:0"); Const Member Change Changed my_local_device_index_ from const to non-const member. This change should be reviewed to ensure proper initialization and thread-safety guarantees are maintained. int64_t my_local_device_index_; CacheId Error Handling Changed from checking has_value() before dereferencing to using valueOrError() which throws on error. This changes error handling behavior - please verify this is the intended behavior and that proper error messages are provided. auto cache_id = valueOrError(args.getCacheId()); expr_evaluator_.bind("cacheId", static_cast<int64_t>(cache_id));

Test failures

(Medium, 27) NVFuser HostIrJitTest crashes: "Handle not overriden for Allocate" across multiple suites

Test Name	A100	GB200	H100	Source
HostIrJitTest.AllocationDomainReorder	❌	❌	❌	Link
HostIrJitTest.BroadcastTest	❌	❌	❌	Link
HostIrJitTest.Deallocate	❌	❌	❌	Link
HostIrJitTest.DynamicSizedTensorAllocate	❌	❌	❌	Link
HostIrJitTest.LaunchKernel	❌	❌	❌	Link
HostIrJitTest.Linear	❌	❌	❌	Link
HostIrJitTest.Matmul	❌	❌	❌	Link
HostIrJitTest.Permute	❌	❌	❌	Link
HostIrJitTest.Reorder	❌	❌	❌	Link

(Medium, 4) nvFuser multidevice communication equality mismatches in test_multidevice_lower_communication.cpp

Test Name A100 (dist.) GB200 (dist.) Source

LowerGatherTest.InMesh_1_2_OutMesh_0_2_HostIr ❌ ❌ Link

LowerSendRecvTest.InMesh_1_2_OutMesh_0_1_HostIr ❌ ❌ Link

Priya2698 · 2026-02-21T20:53:47Z

!test

greptile-apps · 2026-02-21T20:54:47Z

Greptile Summary

This PR introduces hir::Allocate, a new host-IR expression node that always allocates a fresh tensor (rather than potentially reusing one via a cache). It systematically replaces kir::Allocate across host-IR lowering, allocation/deallocation insertion, the evaluator, and the LLVM JIT compiler, while deliberately leaving kir::Allocate handling intact for the MultiDeviceExecutor path.

Key changes:

hir::Allocate node (ir.h / ir.cpp): registers the TV as an addInput (intentional design choice), carries memoryType and zeroInit as data attributes, and is restricted to HostIrContainer via a constructor guard.
allocate_and_deallocate.cpp: the kir::Allocate special-cases in the LCA computation and checkMemoryLeak are removed; the TV is now naturally tracked via hir::Allocate's inputs(), simplifying the pass.
lowering.cpp: all sites switch from kir::Allocate to hir::Allocate; the Communication-segment path allocates outputs with proper index access validation needed before async conversion.
evaluator.cpp / jit.cpp: new dispatch handlers allocate tensors via at::native::empty_strided_cuda. The evaluator path handles zeroInit; the JIT path does not (tracked in prior review threads).
Misc cleanups: std::endl → '\n', valueOrError, std::ranges::find_if, structured-binding for SymmetricMemoryHandle lookups.

The core node implementation, dispatch wiring, allocate/deallocate pass, and evaluator handler are well-implemented. The main concern is in lowering.cpp around the unchecked index access in the Communication-segment block.

Confidence Score: 3/5

The PR is functionally correct but has a defensive programming issue with unchecked casts in the Communication-segment lowering that should be addressed.
The core implementation of hir::Allocate, the evaluator path, and the allocate/deallocate pass are well-executed. The primary concern is in lowering.cpp where unchecked index accesses and casts occur before proper validation, which could produce unhelpful error messages if the assumptions about expression structure are violated. This is a robustness issue rather than a functional bug, but worth addressing before merge to ensure better diagnostics in edge cases.
csrc/host_ir/lowering.cpp requires validation of index access before the unchecked as<TensorView>() casts.

_{Last reviewed commit: 48838a9}

greptile-apps

_{10 files reviewed, 7 comments}

_{Edit Code Review Agent Settings | Greptile}

csrc/host_ir/ir.cpp

csrc/host_ir/evaluator.cpp

greptile-apps

_{7 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Priya2698 · 2026-02-22T03:04:27Z

!test

greptile-apps

_{9 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{10 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

csrc/host_ir/jit.cpp

csrc/host_ir/ir.cpp

csrc/host_ir/jit.cpp

Priya2698 · 2026-02-24T20:04:08Z

Thanks for the early feedback @wujingyue.
This PR is blocked on #6007. I will make a workaround in this PR, if that takes too long.

Priya2698 · 2026-03-03T19:35:07Z

!test

csrc/host_ir/evaluator.h

csrc/host_ir/ir.h

csrc/host_ir/evaluator.cpp

greptile-apps · 2026-03-03T20:08:18Z

Additional Comments (1)

csrc/host_ir/allocate_and_deallocate.cpp
checkMemoryLeak now over-counts "allocated" TVs

Previously, only the explicitly kir::Allocate-d tensor was added to allocated. Now, the generic loop filterByType<TensorView>(e->inputs()) adds every input TV of every expression, regardless of whether it was ever explicitly allocated. This includes TVs that are fusion intermediates passed as inputs to operations like PostOnStream — they get added to allocated even if they were never the subject of an hir::Allocate.

For the memory-leak check to remain accurate, it should only flag TVs that went through an hir::Allocate allocation step. Relying on the "inputs / outputs of all expressions" heuristic gives a weaker guarantee: it can miss cases where a TV was allocated but never used (it would not appear in allocated, and thus no leak is detected).

Consider adding a specific case for hir::Allocate in the pre-traversal function (analogous to what the old kir::Allocate code did) to precisely track which TVs have live allocations:

if (auto* alloc = dynamic_cast<hir::Allocate*>(e)) {
    allocated.insert(alloc->in());
}

and removing the over-broad filterByType sweep, or at least documenting why the broader sweep is acceptable here.

greptile-apps · 2026-03-03T22:54:39Z

Additional Comments (1)

csrc/host_ir/allocate_and_deallocate.cpp, line 347
Silent memory-leak blind spot for residual kir::Allocate nodes

Both checkMemoryLeak and LowestCommonAncestor::computeLcaMap() now rely exclusively on the general e->inputs() / e->outputs() loops to track buffer tensors. This is correct for hir::Allocate because the buffer TV is registered via addInput. However, kir::Allocate stores its buffer via addAttribute (not addInput/addOutput), so it is invisible to those loops.

The evaluator still contains handle(kir::Allocate*), and the PR description explicitly says that modifying kir::Allocate handling "caused errors with MultiDeviceExecutor tests," implying that kir::Allocate nodes can still reach HostIrContainer at runtime. If such a node is present when AllocateAndDeallocate::runPass executes:

LowestCommonAncestor will not track the buffer TV → insertDeallocations will not insert a hir::Deallocate for it.
checkMemoryLeak will not add it to allocated → the memory leak goes undetected.

If the pass is never applied to containers that still hold kir::Allocate nodes, this is a non-issue. It would be worth adding an assertion at the start of insertAllocations / checkMemoryLeak that no kir::Allocate nodes are present, to surface any accidental regression clearly:

// At the top of insertDeallocations / checkMemoryLeak:
for (auto* expr : hic.topLevelExprs()) {
  NVF_ERROR(
      !expr->isA<kir::Allocate>(),
      "kir::Allocate found in HostIrContainer; use hir::Allocate instead.");
}

greptile-apps · 2026-03-03T23:20:21Z

Additional Comments (1)

csrc/host_ir/lowering.cpp, line 187
Unchecked index access and cast before validation

The refactored code calls e->input(0)->as<TensorView>() and e->output(0)->as<TensorView>() at lines 186-187, but the validation that these inputs/outputs are TensorViews happens later inside convertSingleOpToCommunication(e, device_id) at line 571. If e doesn't meet the validation criteria, the as<>() casts will fail with an unhelpful assertion before the proper error message can be printed.

Since Communication segments are guaranteed by the scheduler to contain specific expression types, consider adding explicit validation before the unchecked casts to provide better error messages:

NVF_ERROR(
    !e->inputs().empty() && e->inputs().at(0)->isA<TensorView>(),
    "Communication expression must have a TensorView as its first input");
NVF_ERROR(
    !e->outputs().empty() && e->outputs().at(0)->isA<TensorView>(),
    "Communication expression must have a TensorView as its first output");
TensorView* in = e->input(0)->as<TensorView>();
TensorView* out = e->output(0)->as<TensorView>();

Priya2698 · 2026-03-03T23:25:15Z

!test

host IR allocate node

1e6f3f6

Priya2698 changed the title ~~host IR allocate node~~ hir::Allocate node Feb 21, 2026

greptile-apps bot reviewed Feb 21, 2026

View reviewed changes

csrc/host_ir/ir.cpp Outdated Show resolved Hide resolved

csrc/host_ir/evaluator.cpp Show resolved Hide resolved

csrc/host_ir/evaluator.cpp Show resolved Hide resolved

inline string

872aad1

greptile-apps bot reviewed Feb 21, 2026

View reviewed changes

inline string, change some tests

11c6c4e

greptile-apps bot reviewed Feb 22, 2026

View reviewed changes

handle hir::allocate in jit

e2db707

greptile-apps bot reviewed Feb 24, 2026

View reviewed changes

csrc/host_ir/jit.cpp Outdated Show resolved Hide resolved

wujingyue approved these changes Feb 24, 2026

View reviewed changes

csrc/host_ir/ir.cpp Outdated Show resolved Hide resolved

csrc/host_ir/jit.cpp Outdated Show resolved Hide resolved

Priya2698 added 4 commits March 3, 2026 10:54

Merge remote-tracking branch 'origin/main' into pm/hir_allocate

9d9ea76

clangtidy

d993146

fix a test

9bbc62f

remove as<TensorView> as not needed

6ecfdbc

greptile-apps bot reviewed Mar 3, 2026

View reviewed changes

csrc/host_ir/evaluator.h Show resolved Hide resolved

clangtidy

e0e9e3d

greptile-apps bot reviewed Mar 3, 2026

View reviewed changes

csrc/host_ir/ir.h Show resolved Hide resolved

csrc/host_ir/evaluator.cpp Show resolved Hide resolved

fix jit tests

2027c73

fix for multiple comms per expr

48838a9

Priya2698 merged commit 087b61e into main Mar 4, 2026
52 checks passed

Priya2698 deleted the pm/hir_allocate branch March 4, 2026 01:38

Conversation

Priya2698 commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 21, 2026 • edited by xwang233 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Test failures

Uh oh!

Priya2698 commented Feb 21, 2026

Uh oh!

greptile-apps bot commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Priya2698 commented Feb 22, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Priya2698 commented Feb 24, 2026

Uh oh!

Priya2698 commented Mar 3, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Mar 3, 2026

Uh oh!

greptile-apps bot commented Mar 3, 2026

Uh oh!

greptile-apps bot commented Mar 3, 2026

Uh oh!

Priya2698 commented Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Priya2698 commented Feb 21, 2026 •

edited

Loading

github-actions bot commented Feb 21, 2026 •

edited by xwang233

Loading

greptile-apps bot commented Feb 21, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading