Skip to content

insertDeallocate inspects inner scopes#6007

Merged
Priya2698 merged 22 commits intomainfrom
pm/deallocate
Mar 3, 2026
Merged

insertDeallocate inspects inner scopes#6007
Priya2698 merged 22 commits intomainfrom
pm/deallocate

Conversation

@Priya2698
Copy link
Collaborator

No description provided.

@github-actions
Copy link

github-actions bot commented Feb 24, 2026

Review updated until commit 2376857

Description

  • Add post-dominator tree analysis to find optimal TensorView deallocation points

  • Implement LowestCommonAncestor class to compute latest safe deallocation location

  • Add memory leak detection pass to verify all allocations are properly freed

  • Refactor Node class and depthFirstTraverse to be reusable across tree types

Changes walkthrough

Relevant files
Enhancement
allocate_and_deallocate.cpp
Add post-dominator tree analysis for deallocation               

csrc/host_ir/allocate_and_deallocate.cpp

  • Add PostDominatorTree class for reverse-order tree traversal
  • Implement LowestCommonAncestor to find latest safe deallocation point
  • Add checkMemoryLeak function to verify no memory leaks exist
  • Refactor Node class to standalone with parent pointer
  • Extract depthFirstTraverse as reusable function
  • Replace linear last-use scanning with post-dominator tree analysis
  • +234/-99

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review
    Namespace inconsistency for scope types

    In the PostDominatorTree::build function (lines 166-172), there is an inconsistency in namespace usage: hir::ForLoop is used for for loops but kir::IfThenElse is used for if-then-else. This should be verified to ensure both are using the correct types. If both should be from the same namespace (hir or kir), this could be a bug.

    if (auto* loop = dynamic_cast<hir::ForLoop*>(e)) {
      build(loop->body(), &node);
    }
    if (auto* ite = dynamic_cast<kir::IfThenElse*>(e)) {
      build(ite->thenBody(), &node);
      build(ite->elseBody(), &node);
    }
    Missing inner scope handling in allocation insertion

    The DominatorTree::build function (lines 108-134) only traverses inner scopes (ForLoop, IfThenElse) in the PostDominatorTree but not in the DominatorTree used for insertAllocations. This could lead to incorrect allocation placement when tensors are defined inside inner scopes. The DominatorTree should also handle inner scopes for consistency.

    void build(Scope& scope, Node* parent) {
      for (auto scope_it = scope.exprs().begin(); scope_it != scope.exprs().end();
           ++scope_it) {
        Expr* e = *scope_it;
        auto [node_it, inserted] =
            nodes_.try_emplace(e, &scope, scope_it, parent);
        NVF_ERROR(inserted);
        Node& node = node_it->second;
        if (parent != nullptr) {
          parent->addChild(&node);
        }
    
        if (auto* loop = dynamic_cast<hir::ForLoop*>(e)) {
          // `e`, the ForLoop, dominates its body. However, the body doesn't
          // dominate the instruction after the loop, because the loop could be
          // executed zero times.
          build(loop->body(), &node);
        }
    
        if (auto* ite = dynamic_cast<kir::IfThenElse*>(e)) {
          build(ite->thenBody(), &node);
          build(ite->elseBody(), &node);
        }
    
        parent = &node;
      }
    }
    Consider adding validation for LCA computation

    The LowestCommonAncestor class computes LCA for TensorViews but doesn't validate that the found LCA node's scope actually exists or is valid. Consider adding additional validation to ensure the deallocation insertion point is always valid.

    for (const auto& [tv, lca_node] : lcas.getLcaMap()) {
      if (tv->isFusionInput() || tv->isFusionOutput()) {
        continue;
      }
      NVF_ERROR(
          lca_node != nullptr,
          "Could not find least common ancestor for all uses of ",
          tv);
      auto* deallocate = IrBuilder::create<hir::Deallocate>(tv);
      lca_node->scope()->insert(std::next(lca_node->iterator()), deallocate);
    }

    Test failures

    • (Medium, 1) nvFuser HopperMatmulTest matmul accuracy mismatch on H100

      Test Name H100 Source
      HopperMatmulTest.PingPongPersistent Link

    @Priya2698
    Copy link
    Collaborator Author

    !test

    @Priya2698 Priya2698 mentioned this pull request Feb 24, 2026
    @Priya2698
    Copy link
    Collaborator Author

    !test

    @Priya2698
    Copy link
    Collaborator Author

    !test

    @Priya2698 Priya2698 marked this pull request as ready for review February 25, 2026 00:05
    @Priya2698 Priya2698 requested a review from wujingyue February 25, 2026 00:06
    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Feb 25, 2026

    Greptile Summary

    This PR replaces the simple linear scan in insertDeallocations with a Post-Dominator Tree + Lowest Common Ancestor (LCA) approach, allowing deallocation points to be placed correctly even when a tensor's last use is inside an inner scope (loop body or if-then-else branch). It also adds a checkMemoryLeak validation pass that verifies every non-fusion-I/O TensorView that is allocated or used is matched by a corresponding hir::Deallocate.

    Key structural changes:

    • DominatorTree::Node is extracted into a standalone Node class that now carries a parent_ pointer, enabling ancestor traversal.
    • depthFirstTraverse is promoted to a free function reusable by both trees.
    • PostDominatorTree is introduced; it builds the tree by iterating scopes in reverse execution order, placing each expression's PDT parent at the immediately-following expression.
    • LowestCommonAncestor walks the PDT in DFS order, tracking depth via pre/post callbacks, and accumulates per-TV LCA nodes across all kir::Allocate buffers, inputs, and outputs.
    • insertDeallocations now inserts each hir::Deallocate immediately after its TV's LCA node, which correctly handles TVs used only inside inner scopes.
    • Two robustness issues: findLca's climbing loop has no null-parent guard (safe for a valid tree but undefended against a malformed one), and the NVF_ERROR(lca_node != nullptr) in insertDeallocations is unreachable by construction.

    Confidence Score: 4/5

    • Safe to merge with the noted robustness concerns addressed; core LCA-based deallocation logic is correct.
    • The post-dominator tree construction and LCA algorithm are algorithmically correct for linear sequences, loops, and if-then-else branches. The checkMemoryLeak pass provides a runtime correctness guarantee. The two flagged issues (missing null guard in findLca and unreachable NVF_ERROR) are minor: the null guard concerns a code path that cannot be reached with a valid IR, and the dead assertion is harmless. No data-safety or memory-safety regressions were identified.
    • csrc/host_ir/allocate_and_deallocate.cpp — specifically the findLca climbing loop (lines 285–296) and the NVF_ERROR(lca_node != nullptr) guard (lines 322–325).

    Important Files Changed

    Filename Overview
    csrc/host_ir/allocate_and_deallocate.cpp Major refactor of insertDeallocations to use a PostDominatorTree + LCA approach that correctly handles TVs whose last use is inside an inner scope (loop body, if-then-else branch). Also adds a checkMemoryLeak validation pass. Core algorithms are sound; two minor robustness concerns flagged: missing null guard in findLca's climbing loop, and an unreachable NVF_ERROR.

    Flowchart

    %%{init: {'theme': 'neutral'}}%%
    flowchart TD
        A[AllocateAndDeallocate::runPass] --> B[insertAllocations]
        A --> C[insertDeallocations]
        A --> D[checkMemoryLeak]
    
        B --> B1[DominatorTree\nforward DFS order]
        B1 --> B2[pre_fn: insert kir::Allocate\nbefore expr if output needs prealloc]
        B2 --> B3[post_fn: remove output from\ndefined set on scope exit]
    
        C --> C1[PostDominatorTree\nreverse execution order]
        C1 --> C2[LowestCommonAncestor\ncompute LCA per TensorView]
        C2 --> C3{tv is fusion\ninput or output?}
        C3 -- Yes --> C4[skip]
        C3 -- No --> C5[insert hir::Deallocate\nimmediately after LCA node]
    
        D --> D1[Build new PostDominatorTree\non modified container]
        D1 --> D2[DFS: pre_fn adds TVs\nto allocated set]
        D2 --> D3[post_fn: erase TV\nwhen Deallocate hit]
        D3 --> D4{remaining allocated\nare all fusion I/O?}
        D4 -- Yes --> D5[pass]
        D4 -- No --> D6[NVF_ERROR: memory leak]
    
    Loading

    Last reviewed commit: 2376857

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    3 files reviewed, 2 comments

    Edit Code Review Agent Settings | Greptile

    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Feb 25, 2026

    Additional Comments (1)

    csrc/host_ir/allocate_and_deallocate.cpp, line 269
    Add null check before dereferencing definition(). While fusion inputs/outputs are filtered above, a TensorView may not have a definition in other cases (e.g., allocated but unused tensors).

      if (tv->definition() != nullptr && tv->definition()->isA<ShardByStream>()) {
    

    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Feb 26, 2026

    Additional Comments (3)

    csrc/host_ir/allocate_and_deallocate.cpp, line 45
    member variable is iter_ (line 66), not iterator_

        return iter_;
    

    csrc/host_ir/allocate_and_deallocate.cpp, line 49
    member variable is iter_, not iterator_

        return *iter_;
    

    csrc/host_ir/allocate_and_deallocate.cpp, line 194
    it is a reverse iterator (std::reverse_iterator<ExprList::const_iterator>) but Node constructor expects forward iterator type Scope::Iterator (ExprList::const_iterator). Convert using .base() or use forward iteration with different tree construction logic

    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Feb 27, 2026

    Additional Comments (2)

    csrc/host_ir/allocate_and_deallocate.cpp, line 239
    iter() method does not exist on Node

    The Node class (defined at line 30) exposes only iterator(), not iter(). This call will fail to compile.

    The diff shows this was introduced by this PR, changing node->iterator()node->iter(), but no corresponding rename was made in the Node class definition.

                node->scope()->insert(node->iterator(), allocate);
    

    csrc/host_ir/allocate_and_deallocate.cpp, line 181
    Reverse iterator passed where Scope::Iterator (forward iterator) is expected

    scope.exprs() returns const ExprList& (see Scope::exprs() in ir/internal_nodes.h), so rbegin() yields a std::list<Expr*>::const_reverse_iterator. However, the Node constructor signature is:

    Node(Scope* scope, Scope::Iterator iterator, const Node* parent)

    where Scope::Iterator = ExprList::const_iterator. There is no implicit conversion from const_reverse_iterator to const_iterator, so this will fail to compile.

    Beyond the type error, even if the iterator were stored, insertDeallocations calls:

    lca_node->scope()->insert(std::next(lca_node->iterator()), deallocate);

    std::next on a reverse iterator advances backward in the list, so the insertion point would be wrong (before rather than after the LCA node's expression in program order).

    The fix is to convert to a forward iterator before storing. The element pointed to by reverse iterator it can be obtained as a forward iterator via std::prev(it.base()):

        for (auto it = exprs.rbegin(); it != exprs.rend(); ++it) {
          Expr* e = *it;
          Scope::Iterator fwd_it = std::prev(it.base());
          auto [node_it, inserted] = nodes_.try_emplace(e, &scope, fwd_it, parent);

    @Priya2698
    Copy link
    Collaborator Author

    !test

    @Priya2698 Priya2698 requested a review from wujingyue February 27, 2026 01:00
    @Priya2698
    Copy link
    Collaborator Author

    !test

    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Feb 27, 2026

    Additional Comments (1)

    tests/cpp/test_host_ir_passes.cpp, line 49
    Missing recursive check for kir::IfThenElse expressions. The function checks ForLoop bodies but not IfThenElse then/else bodies. If allocations exist inside if-then-else blocks, they won't be validated for proper deallocation.

        if (auto* loop = dynamic_cast<hir::ForLoop*>(e)) {
          collectPersistentTensorViews(loop->body(), allocated);
        }
        if (auto* ite = dynamic_cast<kir::IfThenElse*>(e)) {
          collectPersistentTensorViews(ite->thenBody(), allocated);
          collectPersistentTensorViews(ite->elseBody(), allocated);
        }
    

    Copy link
    Collaborator

    @wujingyue wujingyue left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Almost there. I'm reviewing the LCA part...

    Copy link
    Collaborator

    @wujingyue wujingyue left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    LGTM otherwise

    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Feb 28, 2026

    Additional Comments (1)

    tests/cpp/test_host_ir_passes.cpp, line 49
    also handle kir::IfThenElse to match the main code's scope traversal

        if (auto* loop = dynamic_cast<hir::ForLoop*>(e)) {
          collectPersistentTensorViews(loop->body(), allocated);
        }
        if (auto* ite = dynamic_cast<kir::IfThenElse*>(e)) {
          collectPersistentTensorViews(ite->thenBody(), allocated);
          collectPersistentTensorViews(ite->elseBody(), allocated);
        }
    

    @Priya2698 Priya2698 requested a review from wujingyue March 3, 2026 00:22
    @Priya2698
    Copy link
    Collaborator Author

    !test

    computeLcaMap();
    }

    const std::unordered_map<TensorView*, const Node*>& getLcaMap() const {
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    This still exposes the map. Consider getLca that takes a TV and returns a Node

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I attempted using hic.vals() first, but since that contains tvs inside kernels, there is no LCA for them.
    One way is to ignore any tv that does not have a LCA, but that would miss any actual error in traversal / recording LCAs. I find using the LCA map more understandable that skipping over some tensorviews.

    Am I missing any other method here?

    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Mar 3, 2026

    Additional Comments (4)

    csrc/host_ir/allocate_and_deallocate.cpp, line 313
    Precondition check doesn't cover inner scopes

    The guard validates only top-level expressions for pre-existing hir::Deallocate nodes, but insertDeallocations now also inserts deallocs inside inner scopes (loop bodies, if/else branches). If this pass were to run twice (or if any upstream pass already placed a Deallocate inside a scope), the duplicate would silently escape detection and a TV could be freed twice at runtime.

    Additionally, checkMemoryLeak (which runs after insertion) uses a set-erase pattern that is idempotent — erasing an already-absent key is a no-op — so double-deallocation would not be caught by the post-pass verification either.

    The check should recursively walk inner scopes, consistent with where deallocs are now being inserted:

    // Recursive helper to assert no pre-existing deallocates in any scope
    std::function<void(const Scope&)> assertNoDeallocate;
    assertNoDeallocate = [&](const Scope& scope) {
      for (Expr* expr : scope.exprs()) {
        NVF_ERROR(
            !expr->isA<hir::Deallocate>(),
            "Expected hostir container to not have deallocate, but found one anyways: ",
            expr);
        if (auto* loop = dynamic_cast<hir::ForLoop*>(expr)) {
          assertNoDeallocate(loop->body());
        }
        if (auto* ite = dynamic_cast<kir::IfThenElse*>(expr)) {
          assertNoDeallocate(ite->thenBody());
          assertNoDeallocate(ite->elseBody());
        }
      }
    };
    assertNoDeallocate(hic.topLevel());

    csrc/host_ir/allocate_and_deallocate.cpp, line 297
    Null parent dereference in findLca when nodes share the root

    The while (a != b) loop calls a = a->parent() and b = b->parent() unconditionally. The root node of the PostDominatorTree has parent_ = nullptr (set when the top-level build call processes exprs.back() with parent = nullptr). If a and b reach the root and are still not equal — which could happen if the nodes_ map is keyed on const Expr* but the two tree roots diverge due to two separate build calls somehow sharing depth-0 nodes — the next iteration would call nullptr->parent().

    While the tree is correctly structured as a single rooted tree for any valid HostIrContainer, the absence of a defensive check here is fragile. Adding a null-guard makes the invariant explicit and protects against future edge cases:

    while (a != b) {
      NVF_ERROR(a != nullptr && b != nullptr,
          "findLca reached null parent before finding a common ancestor");
      a = a->parent();
      b = b->parent();
    }

    csrc/host_ir/allocate_and_deallocate.cpp, line 270
    Including TV outputs in LCA may over-constrain deallocation

    The computeLcaMap includes TV outputs — i.e., expressions that produce a TV — as use-points for LCA purposes. This means, for a TV that is produced by e_produce and consumed by e_consume, both nodes anchor the LCA computation. For sequential code this is harmless (the LCA ends up at e_consume, which is later in the post-dominator order).

    However, consider a TV that is the output of the last expression before a branch but is consumed only in one branch. Including the producer in the LCA shifts the insertion point to at least the producer level even if the consumer would have been sufficient. While not wrong, it could cause deallocation to be deferred beyond the strictly necessary point when e_produce is structurally deeper (earlier) than the consumer in the post-dominator tree.

    Is including outputs here intentional, or was the intent to only track the production point via the kir::Allocate special-case?


    csrc/host_ir/allocate_and_deallocate.cpp, line 356
    checkMemoryLeak detects dealloc in post_fn but hir::Deallocate stores its TV as an attribute, not an input

    From ir.cpp, hir::Deallocate::Deallocate calls addAttribute(tv) rather than addInput/addOutput. This means:

    • ir_utils::filterByType<TensorView>(e->inputs()) in pre_fn will not see the Deallocate's buffer.
    • ir_utils::filterByType<TensorView>(e->outputs()) likewise will not see it.

    So pre_fn never adds the TV to allocated on behalf of a Deallocate node. This is actually fine for correctness — the TV was already inserted via the kir::Allocate special case or its consumer expressions.

    However, the same attribute-vs-input distinction means that, in LowestCommonAncestor::computeLcaMap, any pre-existing hir::Deallocate nodes (if they existed despite the precondition check) would be invisible to the LCA computation. The LCA analysis could then place a second dealloc at a point that conflicts with the first, without any warning. This is another reason the precondition check needs to cover inner scopes (see the earlier comment on lines 307-313).

    @Priya2698
    Copy link
    Collaborator Author

    !test

    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Mar 3, 2026

    Additional Comments (2)

    csrc/host_ir/allocate_and_deallocate.cpp, line 296
    Potential null dereference in findLca climbing loop

    In the final while (a != b) loop (lines 293–296), both a = a->parent() and b = b->parent() are called without checking the return value for nullptr. The Node::parent() method returns const Node*, and the root node of the tree has parent_ = nullptr.

    In a valid, single-rooted PDT this is unreachable because any two nodes at the same depth must share the same root as a common ancestor. However, if the tree is somehow malformed (e.g. a bug in PostDominatorTree::build causes disconnected components), both pointers would climb past the root to nullptr and the nullptr != nullptr comparison would exit the loop — returning nullptr. The function has early-exit guards for nullptr inputs (lines 277–282), but those don't protect the climbing loop itself.

    A defensive assertion or early-exit would harden this:

      while (a != b) {
        NVF_ERROR(a != nullptr && b != nullptr,
                  "findLca: reached tree root without converging; "
                  "PostDominatorTree may have multiple roots");
        a = a->parent();
        b = b->parent();
      }

    csrc/host_ir/allocate_and_deallocate.cpp, line 328
    lca_node != nullptr check is unreachable by construction

    The NVF_ERROR(lca_node != nullptr, ...) on lines 322–325 can never trigger for any TV that is present in getLcaMap(). In computeLcaMap, lca_[tv] is always set via findLca(lca_[tv], node). The very first call is findLca(nullptr, node), which immediately returns node (a non-null pointer). All subsequent calls take two non-null pointers and return a non-null ancestor. Therefore no entry in the map can hold a nullptr value.

    The guard is harmless but may mislead readers into thinking there is a real code path that produces a null LCA. Consider removing it or replacing it with a comment explaining why the null case is unreachable, e.g.:

    // `lca_node` is always non-null: computeLcaMap initialises every entry
    // with findLca(nullptr, node) which returns node, and subsequent calls
    // return a non-null common ancestor.
    NVF_ERROR(lca_node != nullptr, ...);

    @Priya2698 Priya2698 merged commit e29c17b into main Mar 3, 2026
    52 checks passed
    @Priya2698 Priya2698 deleted the pm/deallocate branch March 3, 2026 18:52
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants