`insertDeallocate` inspects inner scopes by Priya2698 · Pull Request #6007 · NVIDIA/Fuser

Priya2698 · 2026-02-24T05:28:31Z

No description provided.

github-actions · 2026-02-24T05:29:28Z

Review updated until commit 2376857

Description

Add post-dominator tree analysis to find optimal TensorView deallocation points
Implement LowestCommonAncestor class to compute latest safe deallocation location
Add memory leak detection pass to verify all allocations are properly freed
Refactor Node class and depthFirstTraverse to be reusable across tree types

Changes walkthrough

Relevant files

Enhancement

allocate_and_deallocate.cpp `Add post-dominator tree analysis for deallocation` csrc/host_ir/allocate_and_deallocate.cpp Add PostDominatorTree class for reverse-order tree traversal Implement LowestCommonAncestor to find latest safe deallocation point Add checkMemoryLeak function to verify no memory leaks exist Refactor Node class to standalone with parent pointer Extract depthFirstTraverse as reusable function Replace linear last-use scanning with post-dominator tree analysis	+234/-99

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Namespace inconsistency for scope types

In the PostDominatorTree::build function (lines 166-172), there is an inconsistency in namespace usage: hir::ForLoop is used for for loops but kir::IfThenElse is used for if-then-else. This should be verified to ensure both are using the correct types. If both should be from the same namespace (hir or kir), this could be a bug.

if (auto* loop = dynamic_cast<hir::ForLoop*>(e)) {
  build(loop->body(), &node);
}
if (auto* ite = dynamic_cast<kir::IfThenElse*>(e)) {
  build(ite->thenBody(), &node);
  build(ite->elseBody(), &node);
}

Missing inner scope handling in allocation insertion

The DominatorTree::build function (lines 108-134) only traverses inner scopes (ForLoop, IfThenElse) in the PostDominatorTree but not in the DominatorTree used for insertAllocations. This could lead to incorrect allocation placement when tensors are defined inside inner scopes. The DominatorTree should also handle inner scopes for consistency.

void build(Scope& scope, Node* parent) {
  for (auto scope_it = scope.exprs().begin(); scope_it != scope.exprs().end();
       ++scope_it) {
    Expr* e = *scope_it;
    auto [node_it, inserted] =
        nodes_.try_emplace(e, &scope, scope_it, parent);
    NVF_ERROR(inserted);
    Node& node = node_it->second;
    if (parent != nullptr) {
      parent->addChild(&node);
    }

    if (auto* loop = dynamic_cast<hir::ForLoop*>(e)) {
      // `e`, the ForLoop, dominates its body. However, the body doesn't
      // dominate the instruction after the loop, because the loop could be
      // executed zero times.
      build(loop->body(), &node);
    }

    if (auto* ite = dynamic_cast<kir::IfThenElse*>(e)) {
      build(ite->thenBody(), &node);
      build(ite->elseBody(), &node);
    }

    parent = &node;
  }
}

Consider adding validation for LCA computation

The LowestCommonAncestor class computes LCA for TensorViews but doesn't validate that the found LCA node's scope actually exists or is valid. Consider adding additional validation to ensure the deallocation insertion point is always valid.

for (const auto& [tv, lca_node] : lcas.getLcaMap()) {
  if (tv->isFusionInput() || tv->isFusionOutput()) {
    continue;
  }
  NVF_ERROR(
      lca_node != nullptr,
      "Could not find least common ancestor for all uses of ",
      tv);
  auto* deallocate = IrBuilder::create<hir::Deallocate>(tv);
  lca_node->scope()->insert(std::next(lca_node->iterator()), deallocate);
}

Test failures

(Medium, 1) nvFuser HopperMatmulTest matmul accuracy mismatch on H100

Test Name H100 Source

HopperMatmulTest.PingPongPersistent ❌ Link

…located when not inside for loop

Priya2698 · 2026-02-24T05:44:45Z

!test

csrc/host_ir/allocate_and_deallocate.cpp

Priya2698 · 2026-02-24T21:43:41Z

!test

Priya2698 · 2026-02-24T21:46:19Z

!test

greptile-apps · 2026-02-25T00:09:18Z

Greptile Summary

This PR replaces the simple linear scan in insertDeallocations with a Post-Dominator Tree + Lowest Common Ancestor (LCA) approach, allowing deallocation points to be placed correctly even when a tensor's last use is inside an inner scope (loop body or if-then-else branch). It also adds a checkMemoryLeak validation pass that verifies every non-fusion-I/O TensorView that is allocated or used is matched by a corresponding hir::Deallocate.

Key structural changes:

DominatorTree::Node is extracted into a standalone Node class that now carries a parent_ pointer, enabling ancestor traversal.
depthFirstTraverse is promoted to a free function reusable by both trees.
PostDominatorTree is introduced; it builds the tree by iterating scopes in reverse execution order, placing each expression's PDT parent at the immediately-following expression.
LowestCommonAncestor walks the PDT in DFS order, tracking depth via pre/post callbacks, and accumulates per-TV LCA nodes across all kir::Allocate buffers, inputs, and outputs.
insertDeallocations now inserts each hir::Deallocate immediately after its TV's LCA node, which correctly handles TVs used only inside inner scopes.
Two robustness issues: findLca's climbing loop has no null-parent guard (safe for a valid tree but undefended against a malformed one), and the NVF_ERROR(lca_node != nullptr) in insertDeallocations is unreachable by construction.

Confidence Score: 4/5

Safe to merge with the noted robustness concerns addressed; core LCA-based deallocation logic is correct.
The post-dominator tree construction and LCA algorithm are algorithmically correct for linear sequences, loops, and if-then-else branches. The checkMemoryLeak pass provides a runtime correctness guarantee. The two flagged issues (missing null guard in findLca and unreachable NVF_ERROR) are minor: the null guard concerns a code path that cannot be reached with a valid IR, and the dead assertion is harmless. No data-safety or memory-safety regressions were identified.
csrc/host_ir/allocate_and_deallocate.cpp — specifically the findLca climbing loop (lines 285–296) and the NVF_ERROR(lca_node != nullptr) guard (lines 322–325).

Important Files Changed

Filename	Overview
csrc/host_ir/allocate_and_deallocate.cpp	Major refactor of insertDeallocations to use a PostDominatorTree + LCA approach that correctly handles TVs whose last use is inside an inner scope (loop body, if-then-else branch). Also adds a checkMemoryLeak validation pass. Core algorithms are sound; two minor robustness concerns flagged: missing null guard in findLca's climbing loop, and an unreachable NVF_ERROR.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[AllocateAndDeallocate::runPass] --> B[insertAllocations]
    A --> C[insertDeallocations]
    A --> D[checkMemoryLeak]

    B --> B1[DominatorTree\nforward DFS order]
    B1 --> B2[pre_fn: insert kir::Allocate\nbefore expr if output needs prealloc]
    B2 --> B3[post_fn: remove output from\ndefined set on scope exit]

    C --> C1[PostDominatorTree\nreverse execution order]
    C1 --> C2[LowestCommonAncestor\ncompute LCA per TensorView]
    C2 --> C3{tv is fusion\ninput or output?}
    C3 -- Yes --> C4[skip]
    C3 -- No --> C5[insert hir::Deallocate\nimmediately after LCA node]

    D --> D1[Build new PostDominatorTree\non modified container]
    D1 --> D2[DFS: pre_fn adds TVs\nto allocated set]
    D2 --> D3[post_fn: erase TV\nwhen Deallocate hit]
    D3 --> D4{remaining allocated\nare all fusion I/O?}
    D4 -- Yes --> D5[pass]
    D4 -- No --> D6[NVF_ERROR: memory leak]

_{Last reviewed commit: 2376857}

greptile-apps

_{3 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

csrc/host_ir/allocate_and_deallocate.cpp

greptile-apps · 2026-02-25T23:40:05Z

Additional Comments (1)

csrc/host_ir/allocate_and_deallocate.cpp, line 269
Add null check before dereferencing definition(). While fusion inputs/outputs are filtered above, a TensorView may not have a definition in other cases (e.g., allocated but unused tensors).

  if (tv->definition() != nullptr && tv->definition()->isA<ShardByStream>()) {

greptile-apps · 2026-02-26T23:53:33Z

Additional Comments (3)

csrc/host_ir/allocate_and_deallocate.cpp, line 45
member variable is iter_ (line 66), not iterator_

    return iter_;

csrc/host_ir/allocate_and_deallocate.cpp, line 49
member variable is iter_, not iterator_

    return *iter_;

csrc/host_ir/allocate_and_deallocate.cpp, line 194
it is a reverse iterator (std::reverse_iterator<ExprList::const_iterator>) but Node constructor expects forward iterator type Scope::Iterator (ExprList::const_iterator). Convert using .base() or use forward iteration with different tree construction logic

greptile-apps · 2026-02-27T00:17:06Z

Additional Comments (2)

csrc/host_ir/allocate_and_deallocate.cpp, line 239
iter() method does not exist on Node

The Node class (defined at line 30) exposes only iterator(), not iter(). This call will fail to compile.

The diff shows this was introduced by this PR, changing node->iterator() → node->iter(), but no corresponding rename was made in the Node class definition.

            node->scope()->insert(node->iterator(), allocate);

csrc/host_ir/allocate_and_deallocate.cpp, line 181
Reverse iterator passed where Scope::Iterator (forward iterator) is expected

scope.exprs() returns const ExprList& (see Scope::exprs() in ir/internal_nodes.h), so rbegin() yields a std::list<Expr*>::const_reverse_iterator. However, the Node constructor signature is:

Node(Scope* scope, Scope::Iterator iterator, const Node* parent)

where Scope::Iterator = ExprList::const_iterator. There is no implicit conversion from const_reverse_iterator to const_iterator, so this will fail to compile.

Beyond the type error, even if the iterator were stored, insertDeallocations calls:

lca_node->scope()->insert(std::next(lca_node->iterator()), deallocate);

std::next on a reverse iterator advances backward in the list, so the insertion point would be wrong (before rather than after the LCA node's expression in program order).

The fix is to convert to a forward iterator before storing. The element pointed to by reverse iterator it can be obtained as a forward iterator via std::prev(it.base()):

    for (auto it = exprs.rbegin(); it != exprs.rend(); ++it) {
      Expr* e = *it;
      Scope::Iterator fwd_it = std::prev(it.base());
      auto [node_it, inserted] = nodes_.try_emplace(e, &scope, fwd_it, parent);

tests/cpp/test_host_ir_passes.cpp

Priya2698 · 2026-02-27T01:00:36Z

!test

Priya2698 · 2026-02-27T23:46:44Z

!test

greptile-apps · 2026-02-27T23:48:17Z

Additional Comments (1)

tests/cpp/test_host_ir_passes.cpp, line 49
Missing recursive check for kir::IfThenElse expressions. The function checks ForLoop bodies but not IfThenElse then/else bodies. If allocations exist inside if-then-else blocks, they won't be validated for proper deallocation.

    if (auto* loop = dynamic_cast<hir::ForLoop*>(e)) {
      collectPersistentTensorViews(loop->body(), allocated);
    }
    if (auto* ite = dynamic_cast<kir::IfThenElse*>(e)) {
      collectPersistentTensorViews(ite->thenBody(), allocated);
      collectPersistentTensorViews(ite->elseBody(), allocated);
    }

wujingyue

Almost there. I'm reviewing the LCA part...

csrc/host_ir/allocate_and_deallocate.cpp

wujingyue

LGTM otherwise

csrc/host_ir/allocate_and_deallocate.cpp

tests/cpp/test_host_ir_passes.cpp

Made-with: Cursor

greptile-apps · 2026-02-28T04:56:54Z

Additional Comments (1)

tests/cpp/test_host_ir_passes.cpp, line 49
also handle kir::IfThenElse to match the main code's scope traversal

    if (auto* loop = dynamic_cast<hir::ForLoop*>(e)) {
      collectPersistentTensorViews(loop->body(), allocated);
    }
    if (auto* ite = dynamic_cast<kir::IfThenElse*>(e)) {
      collectPersistentTensorViews(ite->thenBody(), allocated);
      collectPersistentTensorViews(ite->elseBody(), allocated);
    }

Priya2698 · 2026-03-03T00:23:39Z

!test

wujingyue · 2026-03-03T00:28:48Z

csrc/host_ir/allocate_and_deallocate.cpp

+    computeLcaMap();
+  }
+
+  const std::unordered_map<TensorView*, const Node*>& getLcaMap() const {


This still exposes the map. Consider getLca that takes a TV and returns a Node

I attempted using hic.vals() first, but since that contains tvs inside kernels, there is no LCA for them.
One way is to ignore any tv that does not have a LCA, but that would miss any actual error in traversal / recording LCAs. I find using the LCA map more understandable that skipping over some tensorviews.

Am I missing any other method here?

greptile-apps · 2026-03-03T00:33:55Z

Additional Comments (4)

csrc/host_ir/allocate_and_deallocate.cpp, line 313
Precondition check doesn't cover inner scopes

The guard validates only top-level expressions for pre-existing hir::Deallocate nodes, but insertDeallocations now also inserts deallocs inside inner scopes (loop bodies, if/else branches). If this pass were to run twice (or if any upstream pass already placed a Deallocate inside a scope), the duplicate would silently escape detection and a TV could be freed twice at runtime.

Additionally, checkMemoryLeak (which runs after insertion) uses a set-erase pattern that is idempotent — erasing an already-absent key is a no-op — so double-deallocation would not be caught by the post-pass verification either.

The check should recursively walk inner scopes, consistent with where deallocs are now being inserted:

// Recursive helper to assert no pre-existing deallocates in any scope
std::function<void(const Scope&)> assertNoDeallocate;
assertNoDeallocate = [&](const Scope& scope) {
  for (Expr* expr : scope.exprs()) {
    NVF_ERROR(
        !expr->isA<hir::Deallocate>(),
        "Expected hostir container to not have deallocate, but found one anyways: ",
        expr);
    if (auto* loop = dynamic_cast<hir::ForLoop*>(expr)) {
      assertNoDeallocate(loop->body());
    }
    if (auto* ite = dynamic_cast<kir::IfThenElse*>(expr)) {
      assertNoDeallocate(ite->thenBody());
      assertNoDeallocate(ite->elseBody());
    }
  }
};
assertNoDeallocate(hic.topLevel());

csrc/host_ir/allocate_and_deallocate.cpp, line 297
Null parent dereference in findLca when nodes share the root

The while (a != b) loop calls a = a->parent() and b = b->parent() unconditionally. The root node of the PostDominatorTree has parent_ = nullptr (set when the top-level build call processes exprs.back() with parent = nullptr). If a and b reach the root and are still not equal — which could happen if the nodes_ map is keyed on const Expr* but the two tree roots diverge due to two separate build calls somehow sharing depth-0 nodes — the next iteration would call nullptr->parent().

While the tree is correctly structured as a single rooted tree for any valid HostIrContainer, the absence of a defensive check here is fragile. Adding a null-guard makes the invariant explicit and protects against future edge cases:

while (a != b) {
  NVF_ERROR(a != nullptr && b != nullptr,
      "findLca reached null parent before finding a common ancestor");
  a = a->parent();
  b = b->parent();
}

csrc/host_ir/allocate_and_deallocate.cpp, line 270
Including TV outputs in LCA may over-constrain deallocation

The computeLcaMap includes TV outputs — i.e., expressions that produce a TV — as use-points for LCA purposes. This means, for a TV that is produced by e_produce and consumed by e_consume, both nodes anchor the LCA computation. For sequential code this is harmless (the LCA ends up at e_consume, which is later in the post-dominator order).

However, consider a TV that is the output of the last expression before a branch but is consumed only in one branch. Including the producer in the LCA shifts the insertion point to at least the producer level even if the consumer would have been sufficient. While not wrong, it could cause deallocation to be deferred beyond the strictly necessary point when e_produce is structurally deeper (earlier) than the consumer in the post-dominator tree.

Is including outputs here intentional, or was the intent to only track the production point via the kir::Allocate special-case?

csrc/host_ir/allocate_and_deallocate.cpp, line 356
checkMemoryLeak detects dealloc in post_fn but hir::Deallocate stores its TV as an attribute, not an input

From ir.cpp, hir::Deallocate::Deallocate calls addAttribute(tv) rather than addInput/addOutput. This means:

ir_utils::filterByType<TensorView>(e->inputs()) in pre_fn will not see the Deallocate's buffer.
ir_utils::filterByType<TensorView>(e->outputs()) likewise will not see it.

So pre_fn never adds the TV to allocated on behalf of a Deallocate node. This is actually fine for correctness — the TV was already inserted via the kir::Allocate special case or its consumer expressions.

However, the same attribute-vs-input distinction means that, in LowestCommonAncestor::computeLcaMap, any pre-existing hir::Deallocate nodes (if they existed despite the precondition check) would be invisible to the LCA computation. The LCA analysis could then place a second dealloc at a point that conflicts with the first, without any warning. This is another reason the precondition check needs to cover inner scopes (see the earlier comment on lines 307-313).

Priya2698 · 2026-03-03T04:10:21Z

!test

greptile-apps · 2026-03-03T04:19:23Z

Additional Comments (2)

csrc/host_ir/allocate_and_deallocate.cpp, line 296
Potential null dereference in findLca climbing loop

In the final while (a != b) loop (lines 293–296), both a = a->parent() and b = b->parent() are called without checking the return value for nullptr. The Node::parent() method returns const Node*, and the root node of the tree has parent_ = nullptr.

In a valid, single-rooted PDT this is unreachable because any two nodes at the same depth must share the same root as a common ancestor. However, if the tree is somehow malformed (e.g. a bug in PostDominatorTree::build causes disconnected components), both pointers would climb past the root to nullptr and the nullptr != nullptr comparison would exit the loop — returning nullptr. The function has early-exit guards for nullptr inputs (lines 277–282), but those don't protect the climbing loop itself.

A defensive assertion or early-exit would harden this:

  while (a != b) {
    NVF_ERROR(a != nullptr && b != nullptr,
              "findLca: reached tree root without converging; "
              "PostDominatorTree may have multiple roots");
    a = a->parent();
    b = b->parent();
  }

csrc/host_ir/allocate_and_deallocate.cpp, line 328
lca_node != nullptr check is unreachable by construction

The NVF_ERROR(lca_node != nullptr, ...) on lines 322–325 can never trigger for any TV that is present in getLcaMap(). In computeLcaMap, lca_[tv] is always set via findLca(lca_[tv], node). The very first call is findLca(nullptr, node), which immediately returns node (a non-null pointer). All subsequent calls take two non-null pointers and return a non-null ancestor. Therefore no entry in the map can hold a nullptr value.

The guard is harmless but may mislead readers into thinking there is a real code path that produces a null LCA. Consider removing it or replacing it with a comment explaining why the null case is unreachable, e.g.:

// `lca_node` is always non-null: computeLcaMap initialises every entry
// with findLca(nullptr, node) which returns node, and subsequent calls
// return a non-null common ancestor.
NVF_ERROR(lca_node != nullptr, ...);

deallocate

f2ac6bb

deallocate for all tvs in host ir -- expr evaluated tvs may not be al…

678f6ba

…located when not inside for loop

Priya2698 commented Feb 24, 2026

View reviewed changes

csrc/host_ir/allocate_and_deallocate.cpp Show resolved Hide resolved

Priya2698 commented Feb 24, 2026

View reviewed changes

csrc/host_ir/allocate_and_deallocate.cpp Outdated Show resolved Hide resolved

Priya2698 mentioned this pull request Feb 24, 2026

hir::Allocate node #6000

Merged

Priya2698 added 3 commits February 24, 2026 12:42

tighten the test, add conditions for deallocation

fd49660

bad commit

a92f4db

remove evaluate condition

e496482

Priya2698 commented Feb 24, 2026

View reviewed changes

csrc/host_ir/allocate_and_deallocate.cpp Show resolved Hide resolved

Priya2698 added 2 commits February 24, 2026 13:30

clangtidy

8219fe4

move jit changes to separate PR to address clangtidy

0cd2f48

undo a change

aec9e85

Priya2698 marked this pull request as ready for review February 25, 2026 00:05

Priya2698 requested a review from wujingyue February 25, 2026 00:06

greptile-apps bot reviewed Feb 25, 2026

View reviewed changes

csrc/host_ir/allocate_and_deallocate.cpp Outdated Show resolved Hide resolved

csrc/host_ir/allocate_and_deallocate.cpp Outdated Show resolved Hide resolved

wujingyue reviewed Feb 25, 2026

View reviewed changes

csrc/host_ir/allocate_and_deallocate.cpp Show resolved Hide resolved

csrc/host_ir/allocate_and_deallocate.cpp Outdated Show resolved Hide resolved

post dominator

a338a1c

Priya2698 commented Feb 25, 2026

View reviewed changes

csrc/host_ir/allocate_and_deallocate.cpp Show resolved Hide resolved

missing comments

80a701c

decouple build and traversal of post dominator tree

d66d161

fix naming, extract dfs

dfd195c

update the tests

a16de22

remove unused imports

6bf14a6

Priya2698 commented Feb 27, 2026

View reviewed changes

tests/cpp/test_host_ir_passes.cpp Outdated Show resolved Hide resolved

Priya2698 added 2 commits February 26, 2026 16:58

narrow an import

d897e7c

remove comment

08a5999

Priya2698 requested a review from wujingyue February 27, 2026 01:00

tighten the test

52f73dc

wujingyue reviewed Feb 28, 2026

View reviewed changes

Priya2698 and others added 3 commits February 27, 2026 20:36

Merge branch 'main' into pm/deallocate

c72b706

review comments

b72eefb

Made-with: Cursor

review comments

2078dbc

check memory leak in the pass, lca to class

3f78193

Priya2698 requested a review from wujingyue March 3, 2026 00:22

wujingyue approved these changes Mar 3, 2026

View reviewed changes

clangtidy

2376857

Priya2698 merged commit e29c17b into main Mar 3, 2026
52 checks passed

Priya2698 deleted the pm/deallocate branch March 3, 2026 18:52

Conversation

Priya2698 commented Feb 24, 2026

Uh oh!

github-actions bot commented Feb 24, 2026 • edited by xwang233 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Test failures

Uh oh!

Priya2698 commented Feb 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Priya2698 commented Feb 24, 2026

Uh oh!

Priya2698 commented Feb 24, 2026

Uh oh!

greptile-apps bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Feb 25, 2026

Uh oh!

greptile-apps bot commented Feb 26, 2026

Uh oh!

greptile-apps bot commented Feb 27, 2026

Uh oh!

Uh oh!

Priya2698 commented Feb 27, 2026

Uh oh!

Priya2698 commented Feb 27, 2026

Uh oh!

greptile-apps bot commented Feb 27, 2026

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Feb 28, 2026

Uh oh!

Priya2698 commented Mar 3, 2026

Uh oh!

wujingyue Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Priya2698 Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Mar 3, 2026

Uh oh!

Priya2698 commented Mar 3, 2026

Uh oh!

greptile-apps bot commented Mar 3, 2026

github-actions bot commented Feb 24, 2026 •

edited by xwang233

Loading

greptile-apps bot commented Feb 25, 2026 •

edited

Loading