Skip to content

Conversation

kurisu6912
Copy link
Collaborator

@kurisu6912 kurisu6912 commented Sep 10, 2025

  • Fix bug 0905: vectorize with broadcasted value
  • fix lint error

Summary by CodeRabbit

  • Performance
    • Improved loop vectorization by using stride/offset analysis and tighter vector-size bounds for more reliable, faster vector loads/stores on contiguous memory.
  • Refactor
    • Simplified vectorization flow by removing dynamic/conditional paths and making vector-size selection deterministic.
  • Documentation
    • Added a public independence-check utility to clarify when index expressions are safe to vectorize.

Copy link
Contributor

coderabbitai bot commented Sep 10, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Removes dynamic/conditional vectorization paths and iter_map usage; introduces stride/offset-based vector-size derivation with independence/contiguity checks (CanProveIndependent, IndiceCanVectorize). Simplifies VectorizeRewriter to a fixed vector_size and updates VectorizeLoop to use the planner when no hint is provided.

Changes

Cohort / File(s) Summary
Planner / rewriter refactor
src/transform/loop_vectorize.cc
Removed dynamic/condition APIs and iter_map initialization; deleted GetDynamic(), GetCondition(), and GetVectorizePlanResult(...). Replaced gcd-heavy dynamic logic with stride/offset-based computations, independence checks, and a shrink-to-fit vector-size loop. Added CanProveIndependent(...) and IndiceCanVectorize(...). Updated VectorizeRewriter constructor to VectorizeRewriter(int vector_size) and simplified inner-loop rewriting to a fixed-vector path.
Public header update
src/transform/loop_vectorize.h
Added declaration: bool CanProveIndependent(const PrimExpr &expr, Var var, arith::Analyzer *analyzer);. No other public declarations changed.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller as Pass/Caller
  participant VL as VectorizeLoop
  participant Planner as VectorizePlanner
  participant Analyzer as arith::Analyzer
  participant Rewriter as VectorizeRewriter

  Caller->>VL: Invoke(loop, vector_size_hint?)
  alt hint <= 0
    VL->>Planner: Analyze loop (compute strides/elem_offset)
    Planner->>Analyzer: Request independence/stride facts
    Analyzer-->>Planner: Facts
    Planner-->>VL: chosen_vector_size
  else hint > 0
    VL-->>VL: use provided vector_size
  end

  VL->>Rewriter: Construct(vector_size)
  Rewriter->>Analyzer: Call CanProveIndependent / IndiceCanVectorize
  alt contiguity & alignment OK
    Rewriter-->>VL: produce vectorized IR (kVectorized or nested vector/outer)
  else fails
    Rewriter-->>VL: shrink vector_size and retry or fallback
  end
  VL-->>Caller: Transformed IR
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • LeiWang1999

Poem

I hop through loops with nimble nose,
I count the strides where memory goes.
No dynamic tails, just steady pace,
I line up lanes and speed the race.
— a rabbit, vectorizing with grace 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title clearly identifies the primary intent: fixing bug 0905 and gives a concise, concrete code example (B[i,j] = c[i] + A[i,j]) that matches the PR’s changes. The diff focuses on loop_vectorize (stride/offset logic, CanProveIndependent, IndiceCanVectorize, and related rewrites) to enable vectorization for broadcasted indices, which directly addresses the bug described. The phrasing is specific and appropriate as a single-sentence PR title for history scanning.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @kurisu6912, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the loop vectorization pass, primarily to fix a bug concerning broadcasted values and to streamline the overall vectorization planning and rewriting process. The changes simplify the internal state management of the vectorization planner and introduce a new utility to improve the analysis of memory access patterns for vectorization.

Highlights

  • Bug Fix: Addresses bug 0905 related to vectorization with broadcasted values.
  • Code Refactoring: Simplifies the loop vectorization logic by removing unnecessary dynamic shape handling and related state.
  • Improved Vectorization Analysis: Enhances the IndiceCanVectorize function with a more robust check for contiguous and aligned memory access.
  • New Utility Function: Introduces CanProveIndependent to determine if an expression's value is independent of a given variable.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@kurisu6912 kurisu6912 changed the title kurisu refactor vectorize patch 1 [Fix] Fix bug 0905: tilelang doesn't vectorize B[i,j] = c[i] + A[i,j] Sep 10, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the loop vectorization logic, primarily to fix an issue with handling broadcasted values. The changes simplify the code by removing the complex and possibly defunct logic for dynamic/conditional vectorization and introducing a cleaner way to detect broadcasts. The new implementation is more readable and maintainable. I have one suggestion to further improve the code by using a standard TVM utility function, which will make it more concise and idiomatic.

Comment on lines 198 to 212
struct FindVarVisitor : ExprVisitor {
Var target;
bool found = false;
FindVarVisitor(Var target) : target(std::move(target)) {}
void run(const PrimExpr &expr) { this->VisitExpr(expr); }
void VisitExpr_(const VarNode *node) final {
if (node == target.get()) {
found = true;
}
}
};
FindVarVisitor visitor(var);
visitor.run(expr);
if (!visitor.found)
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The FindVarVisitor struct and its usage can be simplified by using the tvm::tir::UsesVar utility function. This makes the code more concise and idiomatic to the TVM codebase.

You would need to add the following include at the top of the file:

#include <tvm/tir/analysis.h>
  if (!tvm::tir::UsesVar(expr, [&var](const Var& v) { return v.same_as(var); })) {
    return true;
  }

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 54aaec9 and 1fb176a.

📒 Files selected for processing (2)
  • src/transform/loop_vectorize.cc (5 hunks)
  • src/transform/loop_vectorize.h (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/transform/loop_vectorize.h (1)
src/transform/loop_vectorize.cc (4)
  • CanProveIndependent (195-220)
  • CanProveIndependent (195-196)
  • expr (202-202)
  • expr (202-202)
src/transform/loop_vectorize.cc (2)
src/transform/atomicadd_vectorize.cc (2)
  • indices (83-131)
  • indices (83-83)
src/transform/loop_vectorize_dynamic.cc (18)
  • indices (141-194)
  • indices (141-141)
  • node (80-85)
  • node (80-80)
  • node (92-96)
  • node (92-92)
  • node (98-112)
  • node (98-98)
  • node (114-120)
  • node (114-114)
  • node (122-125)
  • node (122-122)
  • node (127-135)
  • node (127-127)
  • node (263-266)
  • node (263-263)
  • node (280-283)
  • node (280-280)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-test-nvidia
  • GitHub Check: bot-task
🔇 Additional comments (10)
src/transform/loop_vectorize.h (1)

40-43: LGTM! Clear and comprehensive documentation.

The function declaration is well-documented with clear parameter names and documentation explaining its purpose - determining if an expression is independent of a variable.

src/transform/loop_vectorize.cc (9)

114-127: LGTM! Improved stride computation with proper fallback.

The stride computation logic correctly handles both cases: when buffer strides are explicitly provided and when they need to be computed from buffer shapes. The element offset calculation using dot product of indices and strides is mathematically sound.


129-133: LGTM! Independence check prevents unnecessary vectorization.

The early return when an element offset is proven independent of the loop variable is a good optimization that avoids attempting vectorization on expressions that don't benefit from it.


135-137: LGTM! Simplified vectorization bound computation.

The tight vectorization bound computation is cleaner and more straightforward than the previous GCD-based approach, while still ensuring proper alignment with hardware vector capabilities.


139-144: LGTM! Iterative vectorization size reduction.

The while loop that progressively halves the vector size until IndiceCanVectorize succeeds is a robust approach to finding the largest viable vectorization factor.


155-155: LGTM! Simplified constructor signature.

The constructor now takes a simple int vector_size parameter instead of a complex plan result, which aligns with the simplified vectorization approach.


170-183: LGTM! Clear loop transformation logic.

The loop transformation handles both cases appropriately:

  • When extent equals vector size: mark the loop as vectorized
  • Otherwise: split into outer scalar loop and inner vectorized loop with proper variable substitution

195-220: LGTM! Robust independence analysis implementation.

The CanProveIndependent function implements a two-step approach:

  1. Check if the variable exists in the expression using a visitor pattern
  2. Use substitution and equality proving to check mathematical independence

This matches the declaration in the header file and provides a solid foundation for the independence checks.


262-264: LGTM! Clean planner integration.

The VectorizeLoop function now properly uses the VectorizePlanner when no hint is provided, creating a clear separation between planning and rewriting phases.


267-268: LGTM! Simplified rewriter instantiation.

The rewriter is now instantiated with just the vector size, consistent with the simplified approach that removed dynamic/conditional vectorization paths.

Comment on lines 240 to 258
PrimExpr access_pos = analyzer->Simplify(
Substitute(expr, {{var, v0 + v1 * target_vectorized_size}}));
Vectorizer vectorizer(v0, IntImm(v0->dtype, target_vectorized_size));
PrimExpr expr_vectorized = vectorizer.VisitExpr(expr_transformed);

// This simplify is necessary for thread region specified
// optimizations.
expr_vectorized = analyzer->Simplify(expr_vectorized);
auto ramp_node = expr_vectorized.as<RampNode>();
if (!ramp_node) {
// Broadcast value
if (expr_vectorized.dtype().lanes() == 1)
return true;
else
return false;
} else {
return is_one(ramp_node->stride);
// for (int ph_v = target_vectorized_size; ph_v > 1; ph_v /= 2) {
// ph_v: physical load/store vectorized size
// TODO: allow a more generalized vectorize: B[i] = A[i // 2]
auto ph_v = target_vectorized_size;
auto group = target_vectorized_size / ph_v;
// Check if access_pos is contingentous: ap === v0 // group (mod ph_v)
auto is_contingous =
analyzer->CanProveEqual(FloorMod(access_pos, ph_v), FloorDiv(v0, group));
// Check if access is aligned
auto is_aligned =
analyzer->CanProveEqual(FloorMod(Substitute(expr, {{var, 0}}), ph_v), 0);
if (is_contingous && is_aligned) {
return true;
}
// }
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify the contiguity check logic.

The contiguity and alignment checks in IndiceCanVectorize have been significantly simplified. The logic checks:

  • access_pos is contiguous: FloorMod(access_pos, ph_v) == FloorDiv(v0, group)
  • Access is aligned: FloorMod(Substitute(expr, {{var, 0}}), ph_v) == 0

However, there are some concerns:

  1. Line 246: group = target_vectorized_size / ph_v where ph_v = target_vectorized_size results in group = 1, which may not be the intended behavior
  2. The commented-out loop suggests this was meant to handle multiple physical vector sizes, but now only handles one case

Run the following script to check if there are test cases that validate this vectorization logic:


🏁 Script executed:

#!/bin/bash
# Search for tests related to IndiceCanVectorize and vectorization
echo "=== Searching for vectorization tests ==="
fd -e cc -e cpp -e py | xargs rg -l "IndiceCanVectorize|VectorizeLoop" | head -10

echo -e "\n=== Checking test patterns for contiguity/alignment ==="
rg -A5 -B5 "contiguous|aligned" --type cpp --type py | head -20

Length of output: 1758


Restore multi-phase contiguity logic in src/transform/loop_vectorize.cc
The current code sets

auto ph_v = target_vectorized_size;
auto group = target_vectorized_size / ph_v;  // == 1

so the contiguity check

CanProveEqual(FloorMod(access_pos, ph_v), FloorDiv(v0, group))

degenerates and no longer iterates over smaller physical vector widths as intended (see commented-out for on lines 243–246). Reinstate the loop over ph_v (halving each iteration) and recalculate group inside it. Add unit tests for IndiceCanVectorize covering both contiguous and aligned cases.

🤖 Prompt for AI Agents
In src/transform/loop_vectorize.cc around lines 240 to 258, the temporary
removal of the multi-phase contiguity logic made ph_v always equal to
target_vectorized_size so the contiguity check degenerates; restore the original
loop that iterates ph_v = target_vectorized_size; ph_v > 1; ph_v /= 2 (halving
each iteration), move the calculation of group = target_vectorized_size / ph_v
inside that loop, perform the CanProveEqual contiguity and alignment checks for
each ph_v and return true if any phase passes, otherwise return false after the
loop; then add unit tests for IndiceCanVectorize that exercise both contiguous
and aligned cases (including smaller physical widths) to validate the restored
behavior.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/transform/loop_vectorize.cc (2)

67-79: Fix invalid return statements in void overrides (compilation blocker).

These overrides return a value from functions with void return type, which is ill-formed C++. Replace the statements to call the base implementation and then return;.

Apply this diff:

--- a/src/transform/loop_vectorize.cc
+++ b/src/transform/loop_vectorize.cc
@@
-        return arith::IRVisitorWithAnalyzer::VisitExpr_(node);
+        arith::IRVisitorWithAnalyzer::VisitExpr_(node);
+        return;
@@
-    UpdateVectorSize(node->indices, node->buffer);
-    return arith::IRVisitorWithAnalyzer::VisitExpr(node->value);
+    UpdateVectorSize(node->indices, node->buffer);
+    arith::IRVisitorWithAnalyzer::VisitStmt_(node);
+    return;
@@
-    CheckConditionVectorized(node->condition);
-    return arith::IRVisitorWithAnalyzer::VisitStmt_(node);
+    CheckConditionVectorized(node->condition);
+    arith::IRVisitorWithAnalyzer::VisitStmt_(node);
+    return;
@@
-    return arith::IRVisitorWithAnalyzer::VisitExpr_(node);
+    arith::IRVisitorWithAnalyzer::VisitExpr_(node);
+    return;

Also applies to: 82-88, 90-104


221-255: Treat Broadcast as vectorizable (enables true broadcasted loads).

If expr_vectorized is a Broadcast, the current code returns false (since lanes() > 1 and it’s not a Ramp). Accept broadcasts explicitly to support patterns like B[i,j] = c[i] + A[i,j].

-  auto ramp_node = expr_vectorized.as<RampNode>();
-  if (!ramp_node) {
-    // Broadcast value
-    if (expr_vectorized.dtype().lanes() == 1)
-      return true;
-    else
-      return false;
-  } else {
-    return is_one(ramp_node->stride);
-  }
+  if (const auto* ramp = expr_vectorized.as<RampNode>()) {
+    return is_one(ramp->stride);
+  }
+  if (expr_vectorized.as<BroadcastNode>()) {
+    // Element is uniform across lanes -> OK for vectorization
+    return true;
+  }
+  // Scalar expression that ignored v0 is also fine.
+  return expr_vectorized.dtype().lanes() == 1;
♻️ Duplicate comments (1)
src/transform/loop_vectorize.cc (1)

197-212: Nice: concise independence check using UsesVar + equality.

This addresses prior feedback to use TVM’s utility and keeps the analyzer logic simple.

🧹 Nitpick comments (4)
src/transform/loop_vectorize.cc (4)

38-39: Use angle brackets for TVM headers to match project style.

Other TVM includes use <...>. Switch these two to angle brackets for consistency and to avoid accidental include-path issues.

-#include "tvm/tir/analysis.h"
-#include "tvm/tir/var.h"
+#include <tvm/tir/analysis.h>
+#include <tvm/tir/var.h>

131-135: Guard against “no-dependent-access” cases to avoid over-vectorization.

Early-returning when the access is independent of the loop var can leave vector_size_ at its initial value (128) if all accesses are independent, causing the rewriter to fail the extent % vector_size_ == 0 check.

Consider clamping at the end of Plan(...):

 int Plan(const For &node) {
   this->operator()(node);
-  return vector_size_;
+  if (inner_for_) {
+    if (const auto* imm = inner_for_->extent.as<IntImmNode>()) {
+      vector_size_ = arith::ZeroAwareGCD(vector_size_, static_cast<int>(imm->value));
+    } else {
+      vector_size_ = 1;
+    }
+  }
+  return vector_size_;
 }

Please run your test suite on a loop like for j in 0..N: B[i,j] = c[i] (no j-dependent loads) to confirm this avoids assertion failures.


142-145: Safety: add a floor for the halving loop.

While IndiceCanVectorize(..., 1, ...) returns true, add a defensive check to prevent accidental zero:

-    while (!IndiceCanVectorize(elem_offset, inner_for_->loop_var,
-                               inner_for_->extent, vector_size_, &analyzer_)) {
-      vector_size_ /= 2;
-    }
+    while (vector_size_ > 1 &&
+           !IndiceCanVectorize(elem_offset, inner_for_->loop_var,
+                               inner_for_->extent, vector_size_, &analyzer_)) {
+      vector_size_ /= 2;
+    }

259-265: Guard user-provided hints; don’t crash on non-divisible extents.

If callers pass vectorize_hint that doesn’t divide the loop extent, the rewriter ICHECKs. Clamp here to a valid factor.

   if (vectorize_hint <= 0) {
     VectorizePlanner planner;
     vectorize_hint = planner.Plan(loop);
   }
-  if (vectorize_hint == 1)
+  if (vectorize_hint == 1)
     return loop;
-  auto rewriter = VectorizeRewriter(vectorize_hint);
+  if (const auto* imm = loop->extent.as<IntImmNode>()) {
+    vectorize_hint = arith::ZeroAwareGCD(vectorize_hint, static_cast<int>(imm->value));
+  } else {
+    vectorize_hint = 1;
+  }
+  if (vectorize_hint == 1) return loop;
+  auto rewriter = VectorizeRewriter(vectorize_hint);
   return Downcast<For>(rewriter(loop));
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1fb176a and c08b815.

📒 Files selected for processing (1)
  • src/transform/loop_vectorize.cc (5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/transform/loop_vectorize.cc (2)
src/transform/atomicadd_vectorize.cc (2)
  • indices (83-131)
  • indices (83-83)
src/transform/loop_vectorize_dynamic.cc (2)
  • indices (141-194)
  • indices (141-141)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-test-nvidia
  • GitHub Check: bot-task
🔇 Additional comments (1)
src/transform/loop_vectorize.cc (1)

116-130: Good: stride-derived element offset computation.

Building elem_offset from explicit/derived strides is correct and makes the contiguity test robust across packed and strided layouts.

Comment on lines +172 to 185
if (extent == vector_size_) {
fnode.CopyOnWrite()->kind = ForKind::kVectorized;
return fnode;
} else {
Var inner_var = Var("vec");
Var outer_var = Var(old_var->name_hint);
Map<Var, PrimExpr> vmap;
vmap.Set(fnode->loop_var, outer_var * vector_size_ + inner_var);
Stmt body = Substitute(fnode->body, vmap);
body = For(inner_var, 0, vector_size_, ForKind::kVectorized, body);
body = For(outer_var, 0, extent / vector_size_, fnode->kind, body,
fnode->thread_binding, fnode->annotations, fnode->span);
return body;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Preserve loop var dtype and extents when splitting; current code may mis-type on non-i32 loops.

Var("vec") and Var(old_var->name_hint) default to i32. If old_var is not i32 (e.g., i64 target), this breaks type checking. Also ensure extent literals match the loop var dtype.

-      if (extent == vector_size_) {
+      if (extent == vector_size_) {
         fnode.CopyOnWrite()->kind = ForKind::kVectorized;
         return fnode;
       } else {
-        Var inner_var = Var("vec");
-        Var outer_var = Var(old_var->name_hint);
+        DataType it = old_var->dtype;
+        Var inner_var = Var("vec", it);
+        Var outer_var = Var(old_var->name_hint, it);
         Map<Var, PrimExpr> vmap;
         vmap.Set(fnode->loop_var, outer_var * vector_size_ + inner_var);
         Stmt body = Substitute(fnode->body, vmap);
-        body = For(inner_var, 0, vector_size_, ForKind::kVectorized, body);
-        body = For(outer_var, 0, extent / vector_size_, fnode->kind, body,
+        PrimExpr inner_extent = IntImm(it, vector_size_);
+        PrimExpr outer_extent = IntImm(it, extent / vector_size_);
+        body = For(inner_var, IntImm(it, 0), inner_extent, ForKind::kVectorized, body);
+        body = For(outer_var, IntImm(it, 0), outer_extent, fnode->kind, body,
                    fnode->thread_binding, fnode->annotations, fnode->span);
         return body;
       }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (extent == vector_size_) {
fnode.CopyOnWrite()->kind = ForKind::kVectorized;
return fnode;
} else {
Var inner_var = Var("vec");
Var outer_var = Var(old_var->name_hint);
Map<Var, PrimExpr> vmap;
vmap.Set(fnode->loop_var, outer_var * vector_size_ + inner_var);
Stmt body = Substitute(fnode->body, vmap);
body = For(inner_var, 0, vector_size_, ForKind::kVectorized, body);
body = For(outer_var, 0, extent / vector_size_, fnode->kind, body,
fnode->thread_binding, fnode->annotations, fnode->span);
return body;
}
if (extent == vector_size_) {
fnode.CopyOnWrite()->kind = ForKind::kVectorized;
return fnode;
} else {
DataType it = old_var->dtype;
Var inner_var = Var("vec", it);
Var outer_var = Var(old_var->name_hint, it);
Map<Var, PrimExpr> vmap;
vmap.Set(fnode->loop_var, outer_var * vector_size_ + inner_var);
Stmt body = Substitute(fnode->body, vmap);
PrimExpr inner_extent = IntImm(it, vector_size_);
PrimExpr outer_extent = IntImm(it, extent / vector_size_);
body = For(inner_var, IntImm(it, 0), inner_extent, ForKind::kVectorized, body);
body = For(outer_var, IntImm(it, 0), outer_extent, fnode->kind, body,
fnode->thread_binding, fnode->annotations, fnode->span);
return body;
}
🤖 Prompt for AI Agents
In src/transform/loop_vectorize.cc around lines 172-185, the split loop uses
Var("vec") and Var(old_var->name_hint) which default to i32 and can break type
checking for non-i32 loop vars; fix by creating inner_var and outer_var with the
same dtype as old_var (e.g. Var("vec", old_var->dtype) and
Var(old_var->name_hint, old_var->dtype)), and ensure any integer
extents/constants and arithmetic use casts to old_var->dtype (use
Cast(old_var->dtype, vector_size_) and Cast(old_var->dtype, extent /
vector_size_) or cast each literal/term as needed) so the substituted index
expression and the For extent PrimExprs all have matching dtype.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/transform/loop_vectorize.cc (1)

86-92: Fix invalid return in VisitStmt_ for BufferStore (build blocker).

VisitStmt_ returns void; returning the result of VisitExpr is a compile error and also skips visiting siblings properly.

   void VisitStmt_(const BufferStoreNode *node) final {
     if (node->buffer.scope() == "shared" || node->buffer.scope() == "global" ||
         node->buffer.scope() == "shared.dyn")
       has_nonlocal_memory_access_ = true;
     UpdateVectorSize(node->indices, node->buffer);
-    return arith::IRVisitorWithAnalyzer::VisitExpr(node->value);
+    arith::IRVisitorWithAnalyzer::VisitExpr(node->value);
   }
♻️ Duplicate comments (1)
src/transform/loop_vectorize.cc (1)

171-183: Preserve loop var dtype and extents when splitting (non‑i32 loops will mis-type).

Inner/outer vars and integer literals default to i32; this breaks when the loop var is i64 (or other). Cast/construct with the loop var dtype.

-      if (extent == vector_size_) {
+      if (extent == vector_size_) {
         fnode.CopyOnWrite()->kind = ForKind::kVectorized;
         return fnode;
       } else {
-        Var inner_var = Var("vec");
-        Var outer_var = Var(old_var->name_hint);
+        DataType it = old_var->dtype;
+        Var inner_var = Var("vec", it);
+        Var outer_var = Var(old_var->name_hint, it);
         Map<Var, PrimExpr> vmap;
-        vmap.Set(fnode->loop_var, outer_var * vector_size_ + inner_var);
+        PrimExpr vsize = IntImm(it, vector_size_);
+        vmap.Set(fnode->loop_var, outer_var * vsize + inner_var);
         Stmt body = Substitute(fnode->body, vmap);
-        body = For(inner_var, 0, vector_size_, ForKind::kVectorized, body);
-        body = For(outer_var, 0, extent / vector_size_, fnode->kind, body,
+        body = For(inner_var, IntImm(it, 0), vsize, ForKind::kVectorized, body);
+        PrimExpr outer_extent = IntImm(it, extent / vector_size_);
+        body = For(outer_var, IntImm(it, 0), outer_extent, fnode->kind, body,
                    fnode->thread_binding, fnode->annotations, fnode->span);
         return body;
       }
🧹 Nitpick comments (3)
src/transform/loop_vectorize.cc (3)

147-152: Unused state: has_nonlocal_memory_access_.

The field is written but never read; either use it to cap vector width or remove it.


196-211: Independence check: good call to UsesVar; consider short-circuiting on analyzer facts.

If range info proves var has a single value (extent == 1), CanProveIndependent could immediately return true to skip substitution.


213-254: IndiceCanVectorize: alignment/contiguity checks look correct for broadcasts and unit stride. Add tests for the target bug.

This should fix B[i,j] = c[i] + A[i,j] by allowing broadcasted c[i]. Add unit cases for:

  • Broadcast along innermost axis.
  • Stride-1 ramp.
  • Misaligned base offset (should shrink factor).

I can draft minimal TIR tests that construct these patterns and assert vector factor selection. Want me to add them?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c08b815 and d8ec462.

📒 Files selected for processing (1)
  • src/transform/loop_vectorize.cc (6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/transform/loop_vectorize.cc (1)
src/transform/loop_vectorize_dynamic.cc (18)
  • node (80-85)
  • node (80-80)
  • node (92-96)
  • node (92-92)
  • node (98-112)
  • node (98-98)
  • node (114-120)
  • node (114-114)
  • node (122-125)
  • node (122-122)
  • node (127-135)
  • node (127-127)
  • node (263-266)
  • node (263-263)
  • node (280-283)
  • node (280-280)
  • indices (141-194)
  • indices (141-141)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build-test-amd
🔇 Additional comments (2)
src/transform/loop_vectorize.cc (2)

256-265: Planner + Rewriter integration LGTM, pending fixes above.

Uses planner when no hint and rewrites innermost loop accordingly.


231-239: Scope analyzer bindings and align dtype of induction vars.

  • Make v0/v1 carry the same dtype as var and bind with typed bounds; suggested replacement:
-  Var v0("v0"), v1("v1");
-  analyzer->Bind(v0, Range(0, target_vectorized_size));
-  analyzer->Bind(v1, Range(0, analyzer->Simplify(FloorDiv(
-                                  iter_var_size, target_vectorized_size))));
+  Var v0("v0", var.dtype()), v1("v1", var.dtype());
+  analyzer->Bind(v0, Range(IntImm(var.dtype(), 0),
+                           IntImm(var.dtype(), target_vectorized_size)));
+  analyzer->Bind(v1, Range(IntImm(var.dtype(), 0),
+                           analyzer->Simplify(FloorDiv(iter_var_size, target_vectorized_size))));
+  // Consider unbinding after use if available:
+  // analyzer->Remove(v0); analyzer->Remove(v1);
  • Repository search shows no Analyzer::Remove or scoped RAII helper; verify whether an unbind API exists and use it (or add scoping) to avoid leaking analyzer bindings.

Location: src/transform/loop_vectorize.cc (~lines 231–239) and src/transform/loop_vectorize_dynamic.cc (~lines 44–50).

Comment on lines +59 to +67
auto extent_ptr = as_const_int(node->extent);
// Here I disable dynamic shape completely,
// In order to do it, the Planner should accept an analyzer with
// arithmetic info outside to prove the dividiblity of vector size
if (!extent_ptr) {
vector_size_ = 1;
return;
}
vector_size_ = arith::ZeroAwareGCD(vector_size_, *extent_ptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Don’t blanket-disable vectorization on dynamic extents; this regresses prior behavior.

Early-returning with vector_size_ = 1 for non-constant extents disables vectorization on dynamic loops entirely. At minimum, keep the optimistic upper bound and let UpdateVectorSize shrink it, or gate this behind a flag.

-    auto extent_ptr = as_const_int(node->extent);
-    // Here I disable dynamic shape completely,
-    //   In order to do it, the Planner should accept an analyzer with
-    //   arithmetic info outside to prove the dividiblity of vector size
-    if (!extent_ptr) {
-      vector_size_ = 1;
-      return;
-    }
-    vector_size_ = arith::ZeroAwareGCD(vector_size_, *extent_ptr);
+    if (const int64_t* extent_ptr = as_const_int(node->extent)) {
+      vector_size_ = arith::ZeroAwareGCD(vector_size_, *extent_ptr);
+    }  // else keep current vector_size_; UpdateVectorSize will shrink as needed
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto extent_ptr = as_const_int(node->extent);
// Here I disable dynamic shape completely,
// In order to do it, the Planner should accept an analyzer with
// arithmetic info outside to prove the dividiblity of vector size
if (!extent_ptr) {
vector_size_ = 1;
return;
}
vector_size_ = arith::ZeroAwareGCD(vector_size_, *extent_ptr);
if (const int64_t* extent_ptr = as_const_int(node->extent)) {
vector_size_ = arith::ZeroAwareGCD(vector_size_, *extent_ptr);
} // else keep current vector_size_; UpdateVectorSize will shrink as needed
🤖 Prompt for AI Agents
In src/transform/loop_vectorize.cc around lines 59 to 67, the code currently
sets vector_size_ = 1 and returns when extent is not a constant, which disables
vectorization for dynamic extents; instead preserve the existing optimistic
upper bound by removing the early return and not forcing vector_size_ to 1 when
extent_ptr is null, allow UpdateVectorSize (or subsequent analysis) to reduce
vector_size_ based on runtime/shrunken information, or optionally gate the
conservative fallback behind a runtime/config flag; update the logic so that
when extent_ptr is null you skip the ZeroAwareGCD step but do not reset
vector_size_, and add a comment describing the conservative vs optimistic
behavior and the optional flag if you choose to implement one.

Comment on lines +117 to +126
auto strides = buffer->strides;
if (buffer->strides.empty()) {
PrimExpr stride = 1;
for (int i = indices.size() - 1; i >= 0; --i) {
strides.push_back(stride);
stride = stride * buffer->shape[i];
}
strides = Array<PrimExpr>{strides.rbegin(), strides.rend()};
}
PrimExpr elem_offset = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Reverse-stride construction is broken; Array doesn’t support rbegin/rend (build blocker).

Array<PrimExpr>{strides.rbegin(), strides.rend()} won’t compile for TVM’s Array. Build the reversed array explicitly.

-    auto strides = buffer->strides;
-    if (buffer->strides.empty()) {
-      PrimExpr stride = 1;
-      for (int i = indices.size() - 1; i >= 0; --i) {
-        strides.push_back(stride);
-        stride = stride * buffer->shape[i];
-      }
-      strides = Array<PrimExpr>{strides.rbegin(), strides.rend()};
-    }
+    Array<PrimExpr> strides = buffer->strides;
+    if (strides.empty()) {
+      // Build row-major strides then reverse to match index order.
+      Array<PrimExpr> tmp;
+      PrimExpr stride = 1;
+      for (int i = static_cast<int>(indices.size()) - 1; i >= 0; --i) {
+        tmp.push_back(stride);
+        stride = stride * buffer->shape[i];
+      }
+      strides = Array<PrimExpr>();
+      for (int i = static_cast<int>(tmp.size()) - 1; i >= 0; --i) {
+        strides.push_back(tmp[i]);
+      }
+    }

Also applies to: 127-129

🤖 Prompt for AI Agents
In src/transform/loop_vectorize.cc around lines 117–126 (and similarly 127–129),
the code attempts to construct a TVM Array using strides.rbegin()/rendre which
TVM::Array does not support; replace the rbegin/rend construction with an
explicit reverse build: after filling the std::vector<PrimExpr> strides, create
a new temporary container and push the elements from strides in reverse order
(or std::reverse the vector and then construct Array from its begin/end), then
assign that Array<PrimExpr> to strides; apply the same explicit reverse
construction to the other occurrence at lines 127–129.

@yyttt6
Copy link
Contributor

yyttt6 commented Sep 24, 2025

Thank you for submitting this PR! We have performed some initial testing to evaluate the performance changes.
As shown in the results below, we observed slight performance improvements in the gemm, gemv , conv, and mha operations.
result
The following are the specific commands I executed:
python example_gemm_autotune.py
python example_gemv.py
python example_convolution_autotune.py
python example_mha_sink_fwd_bhsd.py --batch 4 --heads 8 --seq_q 512 --seq_kv 512 --dim 64 --window_size 256 --tune

@LeiWang1999
Copy link
Member

Awesome, let it in.

@LeiWang1999 LeiWang1999 merged commit 2d4b848 into tile-ai:main Sep 24, 2025
10 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants