-
Notifications
You must be signed in to change notification settings - Fork 231
[Fix] Fix bug 0905: tilelang doesn't vectorize B[i,j] = c[i] + A[i,j]
#798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix] Fix bug 0905: tilelang doesn't vectorize B[i,j] = c[i] + A[i,j]
#798
Conversation
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughRemoves dynamic/conditional vectorization paths and iter_map usage; introduces stride/offset-based vector-size derivation with independence/contiguity checks (CanProveIndependent, IndiceCanVectorize). Simplifies VectorizeRewriter to a fixed vector_size and updates VectorizeLoop to use the planner when no hint is provided. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Caller as Pass/Caller
participant VL as VectorizeLoop
participant Planner as VectorizePlanner
participant Analyzer as arith::Analyzer
participant Rewriter as VectorizeRewriter
Caller->>VL: Invoke(loop, vector_size_hint?)
alt hint <= 0
VL->>Planner: Analyze loop (compute strides/elem_offset)
Planner->>Analyzer: Request independence/stride facts
Analyzer-->>Planner: Facts
Planner-->>VL: chosen_vector_size
else hint > 0
VL-->>VL: use provided vector_size
end
VL->>Rewriter: Construct(vector_size)
Rewriter->>Analyzer: Call CanProveIndependent / IndiceCanVectorize
alt contiguity & alignment OK
Rewriter-->>VL: produce vectorized IR (kVectorized or nested vector/outer)
else fails
Rewriter-->>VL: shrink vector_size and retry or fallback
end
VL-->>Caller: Transformed IR
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @kurisu6912, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request refactors the loop vectorization pass, primarily to fix a bug concerning broadcasted values and to streamline the overall vectorization planning and rewriting process. The changes simplify the internal state management of the vectorization planner and introduce a new utility to improve the analysis of memory access patterns for vectorization.
Highlights
- Bug Fix: Addresses bug 0905 related to vectorization with broadcasted values.
- Code Refactoring: Simplifies the loop vectorization logic by removing unnecessary dynamic shape handling and related state.
- Improved Vectorization Analysis: Enhances the IndiceCanVectorize function with a more robust check for contiguous and aligned memory access.
- New Utility Function: Introduces CanProveIndependent to determine if an expression's value is independent of a given variable.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
B[i,j] = c[i] + A[i,j]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the loop vectorization logic, primarily to fix an issue with handling broadcasted values. The changes simplify the code by removing the complex and possibly defunct logic for dynamic/conditional vectorization and introducing a cleaner way to detect broadcasts. The new implementation is more readable and maintainable. I have one suggestion to further improve the code by using a standard TVM utility function, which will make it more concise and idiomatic.
src/transform/loop_vectorize.cc
Outdated
struct FindVarVisitor : ExprVisitor { | ||
Var target; | ||
bool found = false; | ||
FindVarVisitor(Var target) : target(std::move(target)) {} | ||
void run(const PrimExpr &expr) { this->VisitExpr(expr); } | ||
void VisitExpr_(const VarNode *node) final { | ||
if (node == target.get()) { | ||
found = true; | ||
} | ||
} | ||
}; | ||
FindVarVisitor visitor(var); | ||
visitor.run(expr); | ||
if (!visitor.found) | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FindVarVisitor
struct and its usage can be simplified by using the tvm::tir::UsesVar
utility function. This makes the code more concise and idiomatic to the TVM codebase.
You would need to add the following include at the top of the file:
#include <tvm/tir/analysis.h>
if (!tvm::tir::UsesVar(expr, [&var](const Var& v) { return v.same_as(var); })) {
return true;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/transform/loop_vectorize.cc
(5 hunks)src/transform/loop_vectorize.h
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/transform/loop_vectorize.h (1)
src/transform/loop_vectorize.cc (4)
CanProveIndependent
(195-220)CanProveIndependent
(195-196)expr
(202-202)expr
(202-202)
src/transform/loop_vectorize.cc (2)
src/transform/atomicadd_vectorize.cc (2)
indices
(83-131)indices
(83-83)src/transform/loop_vectorize_dynamic.cc (18)
indices
(141-194)indices
(141-141)node
(80-85)node
(80-80)node
(92-96)node
(92-92)node
(98-112)node
(98-98)node
(114-120)node
(114-114)node
(122-125)node
(122-122)node
(127-135)node
(127-127)node
(263-266)node
(263-263)node
(280-283)node
(280-280)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-test-nvidia
- GitHub Check: bot-task
🔇 Additional comments (10)
src/transform/loop_vectorize.h (1)
40-43
: LGTM! Clear and comprehensive documentation.The function declaration is well-documented with clear parameter names and documentation explaining its purpose - determining if an expression is independent of a variable.
src/transform/loop_vectorize.cc (9)
114-127
: LGTM! Improved stride computation with proper fallback.The stride computation logic correctly handles both cases: when buffer strides are explicitly provided and when they need to be computed from buffer shapes. The element offset calculation using dot product of indices and strides is mathematically sound.
129-133
: LGTM! Independence check prevents unnecessary vectorization.The early return when an element offset is proven independent of the loop variable is a good optimization that avoids attempting vectorization on expressions that don't benefit from it.
135-137
: LGTM! Simplified vectorization bound computation.The tight vectorization bound computation is cleaner and more straightforward than the previous GCD-based approach, while still ensuring proper alignment with hardware vector capabilities.
139-144
: LGTM! Iterative vectorization size reduction.The while loop that progressively halves the vector size until
IndiceCanVectorize
succeeds is a robust approach to finding the largest viable vectorization factor.
155-155
: LGTM! Simplified constructor signature.The constructor now takes a simple
int vector_size
parameter instead of a complex plan result, which aligns with the simplified vectorization approach.
170-183
: LGTM! Clear loop transformation logic.The loop transformation handles both cases appropriately:
- When extent equals vector size: mark the loop as vectorized
- Otherwise: split into outer scalar loop and inner vectorized loop with proper variable substitution
195-220
: LGTM! Robust independence analysis implementation.The
CanProveIndependent
function implements a two-step approach:
- Check if the variable exists in the expression using a visitor pattern
- Use substitution and equality proving to check mathematical independence
This matches the declaration in the header file and provides a solid foundation for the independence checks.
262-264
: LGTM! Clean planner integration.The
VectorizeLoop
function now properly uses theVectorizePlanner
when no hint is provided, creating a clear separation between planning and rewriting phases.
267-268
: LGTM! Simplified rewriter instantiation.The rewriter is now instantiated with just the vector size, consistent with the simplified approach that removed dynamic/conditional vectorization paths.
src/transform/loop_vectorize.cc
Outdated
PrimExpr access_pos = analyzer->Simplify( | ||
Substitute(expr, {{var, v0 + v1 * target_vectorized_size}})); | ||
Vectorizer vectorizer(v0, IntImm(v0->dtype, target_vectorized_size)); | ||
PrimExpr expr_vectorized = vectorizer.VisitExpr(expr_transformed); | ||
|
||
// This simplify is necessary for thread region specified | ||
// optimizations. | ||
expr_vectorized = analyzer->Simplify(expr_vectorized); | ||
auto ramp_node = expr_vectorized.as<RampNode>(); | ||
if (!ramp_node) { | ||
// Broadcast value | ||
if (expr_vectorized.dtype().lanes() == 1) | ||
return true; | ||
else | ||
return false; | ||
} else { | ||
return is_one(ramp_node->stride); | ||
// for (int ph_v = target_vectorized_size; ph_v > 1; ph_v /= 2) { | ||
// ph_v: physical load/store vectorized size | ||
// TODO: allow a more generalized vectorize: B[i] = A[i // 2] | ||
auto ph_v = target_vectorized_size; | ||
auto group = target_vectorized_size / ph_v; | ||
// Check if access_pos is contingentous: ap === v0 // group (mod ph_v) | ||
auto is_contingous = | ||
analyzer->CanProveEqual(FloorMod(access_pos, ph_v), FloorDiv(v0, group)); | ||
// Check if access is aligned | ||
auto is_aligned = | ||
analyzer->CanProveEqual(FloorMod(Substitute(expr, {{var, 0}}), ph_v), 0); | ||
if (is_contingous && is_aligned) { | ||
return true; | ||
} | ||
// } | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify the contiguity check logic.
The contiguity and alignment checks in IndiceCanVectorize
have been significantly simplified. The logic checks:
access_pos
is contiguous:FloorMod(access_pos, ph_v) == FloorDiv(v0, group)
- Access is aligned:
FloorMod(Substitute(expr, {{var, 0}}), ph_v) == 0
However, there are some concerns:
- Line 246:
group = target_vectorized_size / ph_v
whereph_v = target_vectorized_size
results ingroup = 1
, which may not be the intended behavior - The commented-out loop suggests this was meant to handle multiple physical vector sizes, but now only handles one case
Run the following script to check if there are test cases that validate this vectorization logic:
🏁 Script executed:
#!/bin/bash
# Search for tests related to IndiceCanVectorize and vectorization
echo "=== Searching for vectorization tests ==="
fd -e cc -e cpp -e py | xargs rg -l "IndiceCanVectorize|VectorizeLoop" | head -10
echo -e "\n=== Checking test patterns for contiguity/alignment ==="
rg -A5 -B5 "contiguous|aligned" --type cpp --type py | head -20
Length of output: 1758
Restore multi-phase contiguity logic in src/transform/loop_vectorize.cc
The current code sets
auto ph_v = target_vectorized_size;
auto group = target_vectorized_size / ph_v; // == 1
so the contiguity check
CanProveEqual(FloorMod(access_pos, ph_v), FloorDiv(v0, group))
degenerates and no longer iterates over smaller physical vector widths as intended (see commented-out for
on lines 243–246). Reinstate the loop over ph_v
(halving each iteration) and recalculate group
inside it. Add unit tests for IndiceCanVectorize
covering both contiguous and aligned cases.
🤖 Prompt for AI Agents
In src/transform/loop_vectorize.cc around lines 240 to 258, the temporary
removal of the multi-phase contiguity logic made ph_v always equal to
target_vectorized_size so the contiguity check degenerates; restore the original
loop that iterates ph_v = target_vectorized_size; ph_v > 1; ph_v /= 2 (halving
each iteration), move the calculation of group = target_vectorized_size / ph_v
inside that loop, perform the CanProveEqual contiguity and alignment checks for
each ph_v and return true if any phase passes, otherwise return false after the
loop; then add unit tests for IndiceCanVectorize that exercise both contiguous
and aligned cases (including smaller physical widths) to validate the restored
behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/transform/loop_vectorize.cc (2)
67-79
: Fix invalidreturn
statements in void overrides (compilation blocker).These overrides return a value from functions with
void
return type, which is ill-formed C++. Replace the statements to call the base implementation and thenreturn;
.Apply this diff:
--- a/src/transform/loop_vectorize.cc +++ b/src/transform/loop_vectorize.cc @@ - return arith::IRVisitorWithAnalyzer::VisitExpr_(node); + arith::IRVisitorWithAnalyzer::VisitExpr_(node); + return; @@ - UpdateVectorSize(node->indices, node->buffer); - return arith::IRVisitorWithAnalyzer::VisitExpr(node->value); + UpdateVectorSize(node->indices, node->buffer); + arith::IRVisitorWithAnalyzer::VisitStmt_(node); + return; @@ - CheckConditionVectorized(node->condition); - return arith::IRVisitorWithAnalyzer::VisitStmt_(node); + CheckConditionVectorized(node->condition); + arith::IRVisitorWithAnalyzer::VisitStmt_(node); + return; @@ - return arith::IRVisitorWithAnalyzer::VisitExpr_(node); + arith::IRVisitorWithAnalyzer::VisitExpr_(node); + return;Also applies to: 82-88, 90-104
221-255
: Treat Broadcast as vectorizable (enables true broadcasted loads).If
expr_vectorized
is aBroadcast
, the current code returns false (sincelanes() > 1
and it’s not aRamp
). Accept broadcasts explicitly to support patterns likeB[i,j] = c[i] + A[i,j]
.- auto ramp_node = expr_vectorized.as<RampNode>(); - if (!ramp_node) { - // Broadcast value - if (expr_vectorized.dtype().lanes() == 1) - return true; - else - return false; - } else { - return is_one(ramp_node->stride); - } + if (const auto* ramp = expr_vectorized.as<RampNode>()) { + return is_one(ramp->stride); + } + if (expr_vectorized.as<BroadcastNode>()) { + // Element is uniform across lanes -> OK for vectorization + return true; + } + // Scalar expression that ignored v0 is also fine. + return expr_vectorized.dtype().lanes() == 1;
♻️ Duplicate comments (1)
src/transform/loop_vectorize.cc (1)
197-212
: Nice: concise independence check usingUsesVar
+ equality.This addresses prior feedback to use TVM’s utility and keeps the analyzer logic simple.
🧹 Nitpick comments (4)
src/transform/loop_vectorize.cc (4)
38-39
: Use angle brackets for TVM headers to match project style.Other TVM includes use
<...>
. Switch these two to angle brackets for consistency and to avoid accidental include-path issues.-#include "tvm/tir/analysis.h" -#include "tvm/tir/var.h" +#include <tvm/tir/analysis.h> +#include <tvm/tir/var.h>
131-135
: Guard against “no-dependent-access” cases to avoid over-vectorization.Early-returning when the access is independent of the loop var can leave
vector_size_
at its initial value (128) if all accesses are independent, causing the rewriter to fail theextent % vector_size_ == 0
check.Consider clamping at the end of
Plan(...)
:int Plan(const For &node) { this->operator()(node); - return vector_size_; + if (inner_for_) { + if (const auto* imm = inner_for_->extent.as<IntImmNode>()) { + vector_size_ = arith::ZeroAwareGCD(vector_size_, static_cast<int>(imm->value)); + } else { + vector_size_ = 1; + } + } + return vector_size_; }Please run your test suite on a loop like
for j in 0..N: B[i,j] = c[i]
(noj
-dependent loads) to confirm this avoids assertion failures.
142-145
: Safety: add a floor for the halving loop.While
IndiceCanVectorize(..., 1, ...)
returns true, add a defensive check to prevent accidental zero:- while (!IndiceCanVectorize(elem_offset, inner_for_->loop_var, - inner_for_->extent, vector_size_, &analyzer_)) { - vector_size_ /= 2; - } + while (vector_size_ > 1 && + !IndiceCanVectorize(elem_offset, inner_for_->loop_var, + inner_for_->extent, vector_size_, &analyzer_)) { + vector_size_ /= 2; + }
259-265
: Guard user-provided hints; don’t crash on non-divisible extents.If callers pass
vectorize_hint
that doesn’t divide the loop extent, the rewriterICHECK
s. Clamp here to a valid factor.if (vectorize_hint <= 0) { VectorizePlanner planner; vectorize_hint = planner.Plan(loop); } - if (vectorize_hint == 1) + if (vectorize_hint == 1) return loop; - auto rewriter = VectorizeRewriter(vectorize_hint); + if (const auto* imm = loop->extent.as<IntImmNode>()) { + vectorize_hint = arith::ZeroAwareGCD(vectorize_hint, static_cast<int>(imm->value)); + } else { + vectorize_hint = 1; + } + if (vectorize_hint == 1) return loop; + auto rewriter = VectorizeRewriter(vectorize_hint); return Downcast<For>(rewriter(loop));
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/transform/loop_vectorize.cc
(5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/transform/loop_vectorize.cc (2)
src/transform/atomicadd_vectorize.cc (2)
indices
(83-131)indices
(83-83)src/transform/loop_vectorize_dynamic.cc (2)
indices
(141-194)indices
(141-141)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-test-nvidia
- GitHub Check: bot-task
🔇 Additional comments (1)
src/transform/loop_vectorize.cc (1)
116-130
: Good: stride-derived element offset computation.Building
elem_offset
from explicit/derived strides is correct and makes the contiguity test robust across packed and strided layouts.
if (extent == vector_size_) { | ||
fnode.CopyOnWrite()->kind = ForKind::kVectorized; | ||
return fnode; | ||
} else { | ||
Var inner_var = Var("vec"); | ||
Var outer_var = Var(old_var->name_hint); | ||
Map<Var, PrimExpr> vmap; | ||
vmap.Set(fnode->loop_var, outer_var * vector_size_ + inner_var); | ||
Stmt body = Substitute(fnode->body, vmap); | ||
body = For(inner_var, 0, vector_size_, ForKind::kVectorized, body); | ||
body = For(outer_var, 0, extent / vector_size_, fnode->kind, body, | ||
fnode->thread_binding, fnode->annotations, fnode->span); | ||
return body; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preserve loop var dtype and extents when splitting; current code may mis-type on non-i32 loops.
Var("vec")
and Var(old_var->name_hint)
default to i32. If old_var
is not i32 (e.g., i64 target), this breaks type checking. Also ensure extent
literals match the loop var dtype.
- if (extent == vector_size_) {
+ if (extent == vector_size_) {
fnode.CopyOnWrite()->kind = ForKind::kVectorized;
return fnode;
} else {
- Var inner_var = Var("vec");
- Var outer_var = Var(old_var->name_hint);
+ DataType it = old_var->dtype;
+ Var inner_var = Var("vec", it);
+ Var outer_var = Var(old_var->name_hint, it);
Map<Var, PrimExpr> vmap;
vmap.Set(fnode->loop_var, outer_var * vector_size_ + inner_var);
Stmt body = Substitute(fnode->body, vmap);
- body = For(inner_var, 0, vector_size_, ForKind::kVectorized, body);
- body = For(outer_var, 0, extent / vector_size_, fnode->kind, body,
+ PrimExpr inner_extent = IntImm(it, vector_size_);
+ PrimExpr outer_extent = IntImm(it, extent / vector_size_);
+ body = For(inner_var, IntImm(it, 0), inner_extent, ForKind::kVectorized, body);
+ body = For(outer_var, IntImm(it, 0), outer_extent, fnode->kind, body,
fnode->thread_binding, fnode->annotations, fnode->span);
return body;
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if (extent == vector_size_) { | |
fnode.CopyOnWrite()->kind = ForKind::kVectorized; | |
return fnode; | |
} else { | |
Var inner_var = Var("vec"); | |
Var outer_var = Var(old_var->name_hint); | |
Map<Var, PrimExpr> vmap; | |
vmap.Set(fnode->loop_var, outer_var * vector_size_ + inner_var); | |
Stmt body = Substitute(fnode->body, vmap); | |
body = For(inner_var, 0, vector_size_, ForKind::kVectorized, body); | |
body = For(outer_var, 0, extent / vector_size_, fnode->kind, body, | |
fnode->thread_binding, fnode->annotations, fnode->span); | |
return body; | |
} | |
if (extent == vector_size_) { | |
fnode.CopyOnWrite()->kind = ForKind::kVectorized; | |
return fnode; | |
} else { | |
DataType it = old_var->dtype; | |
Var inner_var = Var("vec", it); | |
Var outer_var = Var(old_var->name_hint, it); | |
Map<Var, PrimExpr> vmap; | |
vmap.Set(fnode->loop_var, outer_var * vector_size_ + inner_var); | |
Stmt body = Substitute(fnode->body, vmap); | |
PrimExpr inner_extent = IntImm(it, vector_size_); | |
PrimExpr outer_extent = IntImm(it, extent / vector_size_); | |
body = For(inner_var, IntImm(it, 0), inner_extent, ForKind::kVectorized, body); | |
body = For(outer_var, IntImm(it, 0), outer_extent, fnode->kind, body, | |
fnode->thread_binding, fnode->annotations, fnode->span); | |
return body; | |
} |
🤖 Prompt for AI Agents
In src/transform/loop_vectorize.cc around lines 172-185, the split loop uses
Var("vec") and Var(old_var->name_hint) which default to i32 and can break type
checking for non-i32 loop vars; fix by creating inner_var and outer_var with the
same dtype as old_var (e.g. Var("vec", old_var->dtype) and
Var(old_var->name_hint, old_var->dtype)), and ensure any integer
extents/constants and arithmetic use casts to old_var->dtype (use
Cast(old_var->dtype, vector_size_) and Cast(old_var->dtype, extent /
vector_size_) or cast each literal/term as needed) so the substituted index
expression and the For extent PrimExprs all have matching dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/transform/loop_vectorize.cc (1)
86-92
: Fix invalid return in VisitStmt_ for BufferStore (build blocker).
VisitStmt_
returns void; returning the result ofVisitExpr
is a compile error and also skips visiting siblings properly.void VisitStmt_(const BufferStoreNode *node) final { if (node->buffer.scope() == "shared" || node->buffer.scope() == "global" || node->buffer.scope() == "shared.dyn") has_nonlocal_memory_access_ = true; UpdateVectorSize(node->indices, node->buffer); - return arith::IRVisitorWithAnalyzer::VisitExpr(node->value); + arith::IRVisitorWithAnalyzer::VisitExpr(node->value); }
♻️ Duplicate comments (1)
src/transform/loop_vectorize.cc (1)
171-183
: Preserve loop var dtype and extents when splitting (non‑i32 loops will mis-type).Inner/outer vars and integer literals default to i32; this breaks when the loop var is i64 (or other). Cast/construct with the loop var dtype.
- if (extent == vector_size_) { + if (extent == vector_size_) { fnode.CopyOnWrite()->kind = ForKind::kVectorized; return fnode; } else { - Var inner_var = Var("vec"); - Var outer_var = Var(old_var->name_hint); + DataType it = old_var->dtype; + Var inner_var = Var("vec", it); + Var outer_var = Var(old_var->name_hint, it); Map<Var, PrimExpr> vmap; - vmap.Set(fnode->loop_var, outer_var * vector_size_ + inner_var); + PrimExpr vsize = IntImm(it, vector_size_); + vmap.Set(fnode->loop_var, outer_var * vsize + inner_var); Stmt body = Substitute(fnode->body, vmap); - body = For(inner_var, 0, vector_size_, ForKind::kVectorized, body); - body = For(outer_var, 0, extent / vector_size_, fnode->kind, body, + body = For(inner_var, IntImm(it, 0), vsize, ForKind::kVectorized, body); + PrimExpr outer_extent = IntImm(it, extent / vector_size_); + body = For(outer_var, IntImm(it, 0), outer_extent, fnode->kind, body, fnode->thread_binding, fnode->annotations, fnode->span); return body; }
🧹 Nitpick comments (3)
src/transform/loop_vectorize.cc (3)
147-152
: Unused state: has_nonlocal_memory_access_.The field is written but never read; either use it to cap vector width or remove it.
196-211
: Independence check: good call to UsesVar; consider short-circuiting on analyzer facts.If range info proves
var
has a single value (extent == 1),CanProveIndependent
could immediately return true to skip substitution.
213-254
: IndiceCanVectorize: alignment/contiguity checks look correct for broadcasts and unit stride. Add tests for the target bug.This should fix
B[i,j] = c[i] + A[i,j]
by allowing broadcastedc[i]
. Add unit cases for:
- Broadcast along innermost axis.
- Stride-1 ramp.
- Misaligned base offset (should shrink factor).
I can draft minimal TIR tests that construct these patterns and assert vector factor selection. Want me to add them?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/transform/loop_vectorize.cc
(6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/transform/loop_vectorize.cc (1)
src/transform/loop_vectorize_dynamic.cc (18)
node
(80-85)node
(80-80)node
(92-96)node
(92-92)node
(98-112)node
(98-98)node
(114-120)node
(114-114)node
(122-125)node
(122-122)node
(127-135)node
(127-127)node
(263-266)node
(263-263)node
(280-283)node
(280-280)indices
(141-194)indices
(141-141)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build-test-amd
🔇 Additional comments (2)
src/transform/loop_vectorize.cc (2)
256-265
: Planner + Rewriter integration LGTM, pending fixes above.Uses planner when no hint and rewrites innermost loop accordingly.
231-239
: Scope analyzer bindings and align dtype of induction vars.
- Make v0/v1 carry the same dtype as var and bind with typed bounds; suggested replacement:
- Var v0("v0"), v1("v1"); - analyzer->Bind(v0, Range(0, target_vectorized_size)); - analyzer->Bind(v1, Range(0, analyzer->Simplify(FloorDiv( - iter_var_size, target_vectorized_size)))); + Var v0("v0", var.dtype()), v1("v1", var.dtype()); + analyzer->Bind(v0, Range(IntImm(var.dtype(), 0), + IntImm(var.dtype(), target_vectorized_size))); + analyzer->Bind(v1, Range(IntImm(var.dtype(), 0), + analyzer->Simplify(FloorDiv(iter_var_size, target_vectorized_size)))); + // Consider unbinding after use if available: + // analyzer->Remove(v0); analyzer->Remove(v1);
- Repository search shows no Analyzer::Remove or scoped RAII helper; verify whether an unbind API exists and use it (or add scoping) to avoid leaking analyzer bindings.
Location: src/transform/loop_vectorize.cc (~lines 231–239) and src/transform/loop_vectorize_dynamic.cc (~lines 44–50).
auto extent_ptr = as_const_int(node->extent); | ||
// Here I disable dynamic shape completely, | ||
// In order to do it, the Planner should accept an analyzer with | ||
// arithmetic info outside to prove the dividiblity of vector size | ||
if (!extent_ptr) { | ||
vector_size_ = 1; | ||
return; | ||
} | ||
vector_size_ = arith::ZeroAwareGCD(vector_size_, *extent_ptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don’t blanket-disable vectorization on dynamic extents; this regresses prior behavior.
Early-returning with vector_size_ = 1
for non-constant extents disables vectorization on dynamic loops entirely. At minimum, keep the optimistic upper bound and let UpdateVectorSize
shrink it, or gate this behind a flag.
- auto extent_ptr = as_const_int(node->extent);
- // Here I disable dynamic shape completely,
- // In order to do it, the Planner should accept an analyzer with
- // arithmetic info outside to prove the dividiblity of vector size
- if (!extent_ptr) {
- vector_size_ = 1;
- return;
- }
- vector_size_ = arith::ZeroAwareGCD(vector_size_, *extent_ptr);
+ if (const int64_t* extent_ptr = as_const_int(node->extent)) {
+ vector_size_ = arith::ZeroAwareGCD(vector_size_, *extent_ptr);
+ } // else keep current vector_size_; UpdateVectorSize will shrink as needed
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
auto extent_ptr = as_const_int(node->extent); | |
// Here I disable dynamic shape completely, | |
// In order to do it, the Planner should accept an analyzer with | |
// arithmetic info outside to prove the dividiblity of vector size | |
if (!extent_ptr) { | |
vector_size_ = 1; | |
return; | |
} | |
vector_size_ = arith::ZeroAwareGCD(vector_size_, *extent_ptr); | |
if (const int64_t* extent_ptr = as_const_int(node->extent)) { | |
vector_size_ = arith::ZeroAwareGCD(vector_size_, *extent_ptr); | |
} // else keep current vector_size_; UpdateVectorSize will shrink as needed |
🤖 Prompt for AI Agents
In src/transform/loop_vectorize.cc around lines 59 to 67, the code currently
sets vector_size_ = 1 and returns when extent is not a constant, which disables
vectorization for dynamic extents; instead preserve the existing optimistic
upper bound by removing the early return and not forcing vector_size_ to 1 when
extent_ptr is null, allow UpdateVectorSize (or subsequent analysis) to reduce
vector_size_ based on runtime/shrunken information, or optionally gate the
conservative fallback behind a runtime/config flag; update the logic so that
when extent_ptr is null you skip the ZeroAwareGCD step but do not reset
vector_size_, and add a comment describing the conservative vs optimistic
behavior and the optional flag if you choose to implement one.
auto strides = buffer->strides; | ||
if (buffer->strides.empty()) { | ||
PrimExpr stride = 1; | ||
for (int i = indices.size() - 1; i >= 0; --i) { | ||
strides.push_back(stride); | ||
stride = stride * buffer->shape[i]; | ||
} | ||
strides = Array<PrimExpr>{strides.rbegin(), strides.rend()}; | ||
} | ||
PrimExpr elem_offset = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverse-stride construction is broken; Array doesn’t support rbegin/rend (build blocker).
Array<PrimExpr>{strides.rbegin(), strides.rend()}
won’t compile for TVM’s Array. Build the reversed array explicitly.
- auto strides = buffer->strides;
- if (buffer->strides.empty()) {
- PrimExpr stride = 1;
- for (int i = indices.size() - 1; i >= 0; --i) {
- strides.push_back(stride);
- stride = stride * buffer->shape[i];
- }
- strides = Array<PrimExpr>{strides.rbegin(), strides.rend()};
- }
+ Array<PrimExpr> strides = buffer->strides;
+ if (strides.empty()) {
+ // Build row-major strides then reverse to match index order.
+ Array<PrimExpr> tmp;
+ PrimExpr stride = 1;
+ for (int i = static_cast<int>(indices.size()) - 1; i >= 0; --i) {
+ tmp.push_back(stride);
+ stride = stride * buffer->shape[i];
+ }
+ strides = Array<PrimExpr>();
+ for (int i = static_cast<int>(tmp.size()) - 1; i >= 0; --i) {
+ strides.push_back(tmp[i]);
+ }
+ }
Also applies to: 127-129
🤖 Prompt for AI Agents
In src/transform/loop_vectorize.cc around lines 117–126 (and similarly 127–129),
the code attempts to construct a TVM Array using strides.rbegin()/rendre which
TVM::Array does not support; replace the rbegin/rend construction with an
explicit reverse build: after filling the std::vector<PrimExpr> strides, create
a new temporary container and push the elements from strides in reverse order
(or std::reverse the vector and then construct Array from its begin/end), then
assign that Array<PrimExpr> to strides; apply the same explicit reverse
construction to the other occurrence at lines 127–129.
Awesome, let it in. |
Summary by CodeRabbit