Skip to content

hir::Allocate node#6000

Merged
Priya2698 merged 11 commits intomainfrom
pm/hir_allocate
Mar 4, 2026
Merged

hir::Allocate node#6000
Priya2698 merged 11 commits intomainfrom
pm/hir_allocate

Conversation

@Priya2698
Copy link
Collaborator

@Priya2698 Priya2698 commented Feb 21, 2026

Creating a new hir::Allocate node that always allocates a new tensor. This is required to create new buffers per stream instead of reusing across streams which will require synchronization.

I am not modifying kir::Allocate handling. That caused errors with MultiDeviceExecutor tests.

@Priya2698 Priya2698 changed the title host IR allocate node hir::Allocate node Feb 21, 2026
@github-actions
Copy link

github-actions bot commented Feb 21, 2026

Review updated until commit 48838a9

Description

  • Introduces a new hir::Allocate node for per-stream tensor allocation instead of reusing buffers across streams

  • Replaces kir::Allocate with hir::Allocate in host IR lowering, evaluator, and JIT compilation

  • Adds hir::Allocate handler in HostIrEvaluator for runtime tensor allocation

  • Updates tests and dispatch infrastructure to support the new allocation node

Changes walkthrough

Relevant files

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Potential Device Issue

When communicator_ is null, the code defaults to device "cuda:0" which may not be
correct. This fallback behavior should be validated to ensure it handles all
multi-device scenarios properly, especially when no communicator is available.

c10::Device device =
    communicator_ ? communicator_->device() : at::Device("cuda:0");
Const Member Change

Changed my_local_device_index_ from const to non-const member. This change should be
reviewed to ensure proper initialization and thread-safety guarantees are maintained.

int64_t my_local_device_index_;
CacheId Error Handling

Changed from checking has_value() before dereferencing to using valueOrError() which
throws on error. This changes error handling behavior - please verify this is the
intended behavior and that proper error messages are provided.

auto cache_id = valueOrError(args.getCacheId());
expr_evaluator_.bind("cacheId", static_cast<int64_t>(cache_id));

Test failures

  • (Medium, 27) NVFuser HostIrJitTest crashes: "Handle not overriden for Allocate" across multiple suites

    Test Name A100 GB200 H100 Source
    HostIrJitTest.AllocationDomainReorder Link
    HostIrJitTest.BroadcastTest Link
    HostIrJitTest.Deallocate Link
    HostIrJitTest.DynamicSizedTensorAllocate Link
    HostIrJitTest.LaunchKernel Link
    HostIrJitTest.Linear Link
    HostIrJitTest.Matmul Link
    HostIrJitTest.Permute Link
    HostIrJitTest.Reorder Link
  • (Medium, 4) nvFuser multidevice communication equality mismatches in test_multidevice_lower_communication.cpp

    Test Name A100 (dist.) GB200 (dist.) Source
    LowerGatherTest.InMesh_1_2_OutMesh_0_2_HostIr Link
    LowerSendRecvTest.InMesh_1_2_OutMesh_0_1_HostIr Link

@Priya2698
Copy link
Collaborator Author

!test

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 21, 2026

Greptile Summary

This PR introduces hir::Allocate, a new host-IR expression node that always allocates a fresh tensor (rather than potentially reusing one via a cache). It systematically replaces kir::Allocate across host-IR lowering, allocation/deallocation insertion, the evaluator, and the LLVM JIT compiler, while deliberately leaving kir::Allocate handling intact for the MultiDeviceExecutor path.

Key changes:

  • hir::Allocate node (ir.h / ir.cpp): registers the TV as an addInput (intentional design choice), carries memoryType and zeroInit as data attributes, and is restricted to HostIrContainer via a constructor guard.
  • allocate_and_deallocate.cpp: the kir::Allocate special-cases in the LCA computation and checkMemoryLeak are removed; the TV is now naturally tracked via hir::Allocate's inputs(), simplifying the pass.
  • lowering.cpp: all sites switch from kir::Allocate to hir::Allocate; the Communication-segment path allocates outputs with proper index access validation needed before async conversion.
  • evaluator.cpp / jit.cpp: new dispatch handlers allocate tensors via at::native::empty_strided_cuda. The evaluator path handles zeroInit; the JIT path does not (tracked in prior review threads).
  • Misc cleanups: std::endl'\n', valueOrError, std::ranges::find_if, structured-binding for SymmetricMemoryHandle lookups.

The core node implementation, dispatch wiring, allocate/deallocate pass, and evaluator handler are well-implemented. The main concern is in lowering.cpp around the unchecked index access in the Communication-segment block.

Confidence Score: 3/5

  • The PR is functionally correct but has a defensive programming issue with unchecked casts in the Communication-segment lowering that should be addressed.
  • The core implementation of hir::Allocate, the evaluator path, and the allocate/deallocate pass are well-executed. The primary concern is in lowering.cpp where unchecked index accesses and casts occur before proper validation, which could produce unhelpful error messages if the assumptions about expression structure are violated. This is a robustness issue rather than a functional bug, but worth addressing before merge to ensure better diagnostics in edge cases.
  • csrc/host_ir/lowering.cpp requires validation of index access before the unchecked as<TensorView>() casts.

Last reviewed commit: 48838a9

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, 7 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@Priya2698
Copy link
Collaborator Author

!test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@Priya2698
Copy link
Collaborator Author

Thanks for the early feedback @wujingyue.
This PR is blocked on #6007. I will make a workaround in this PR, if that takes too long.

@Priya2698
Copy link
Collaborator Author

!test

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 3, 2026

Additional Comments (1)

csrc/host_ir/allocate_and_deallocate.cpp
checkMemoryLeak now over-counts "allocated" TVs

Previously, only the explicitly kir::Allocate-d tensor was added to allocated. Now, the generic loop filterByType<TensorView>(e->inputs()) adds every input TV of every expression, regardless of whether it was ever explicitly allocated. This includes TVs that are fusion intermediates passed as inputs to operations like PostOnStream — they get added to allocated even if they were never the subject of an hir::Allocate.

For the memory-leak check to remain accurate, it should only flag TVs that went through an hir::Allocate allocation step. Relying on the "inputs / outputs of all expressions" heuristic gives a weaker guarantee: it can miss cases where a TV was allocated but never used (it would not appear in allocated, and thus no leak is detected).

Consider adding a specific case for hir::Allocate in the pre-traversal function (analogous to what the old kir::Allocate code did) to precisely track which TVs have live allocations:

if (auto* alloc = dynamic_cast<hir::Allocate*>(e)) {
    allocated.insert(alloc->in());
}

and removing the over-broad filterByType sweep, or at least documenting why the broader sweep is acceptable here.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 3, 2026

Additional Comments (1)

csrc/host_ir/allocate_and_deallocate.cpp, line 347
Silent memory-leak blind spot for residual kir::Allocate nodes

Both checkMemoryLeak and LowestCommonAncestor::computeLcaMap() now rely exclusively on the general e->inputs() / e->outputs() loops to track buffer tensors. This is correct for hir::Allocate because the buffer TV is registered via addInput. However, kir::Allocate stores its buffer via addAttribute (not addInput/addOutput), so it is invisible to those loops.

The evaluator still contains handle(kir::Allocate*), and the PR description explicitly says that modifying kir::Allocate handling "caused errors with MultiDeviceExecutor tests," implying that kir::Allocate nodes can still reach HostIrContainer at runtime. If such a node is present when AllocateAndDeallocate::runPass executes:

  • LowestCommonAncestor will not track the buffer TV → insertDeallocations will not insert a hir::Deallocate for it.
  • checkMemoryLeak will not add it to allocated → the memory leak goes undetected.

If the pass is never applied to containers that still hold kir::Allocate nodes, this is a non-issue. It would be worth adding an assertion at the start of insertAllocations / checkMemoryLeak that no kir::Allocate nodes are present, to surface any accidental regression clearly:

// At the top of insertDeallocations / checkMemoryLeak:
for (auto* expr : hic.topLevelExprs()) {
  NVF_ERROR(
      !expr->isA<kir::Allocate>(),
      "kir::Allocate found in HostIrContainer; use hir::Allocate instead.");
}

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 3, 2026

Additional Comments (1)

csrc/host_ir/lowering.cpp, line 187
Unchecked index access and cast before validation

The refactored code calls e->input(0)->as<TensorView>() and e->output(0)->as<TensorView>() at lines 186-187, but the validation that these inputs/outputs are TensorViews happens later inside convertSingleOpToCommunication(e, device_id) at line 571. If e doesn't meet the validation criteria, the as<>() casts will fail with an unhelpful assertion before the proper error message can be printed.

Since Communication segments are guaranteed by the scheduler to contain specific expression types, consider adding explicit validation before the unchecked casts to provide better error messages:

NVF_ERROR(
    !e->inputs().empty() && e->inputs().at(0)->isA<TensorView>(),
    "Communication expression must have a TensorView as its first input");
NVF_ERROR(
    !e->outputs().empty() && e->outputs().at(0)->isA<TensorView>(),
    "Communication expression must have a TensorView as its first output");
TensorView* in = e->input(0)->as<TensorView>();
TensorView* out = e->output(0)->as<TensorView>();

@Priya2698
Copy link
Collaborator Author

!test

@Priya2698 Priya2698 merged commit 087b61e into main Mar 4, 2026
52 checks passed
@Priya2698 Priya2698 deleted the pm/hir_allocate branch March 4, 2026 01:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants