Skip to content

Conversation

@mafeguimaraes
Copy link
Contributor

This patch introduces the HWVectorization pass, which identifies bitwise patterns in hardware modules that can be represented as vectorized operations instead of per-bit logic.
The pass aims to simplify the IR by grouping related scalar bit operations (such as comb.extract and comb.concat) into higher-level vector constructs like comb.reverse, comb.replicate, or direct multi-bit comb.and, comb.or, and comb.xor.

The pass scans each hw.module and identifies groups of bit-level operations that can be merged into vector-level constructs. This version supports several key patterns based on bit-level dataflow analysis and structural analysis.

This patch was co-authored by @RosaUlisses.

Supported transformations include:

1. Linear concatenations (identity):

  • Pattern: Bits are extracted in ascending order (identity permutation) and concatenated.

  • Transformation: The entire comb.concat chain is replaced with the original input vector.

// Before
%0 = comb.extract %in from 0 : (i4) -> i1
%1 = comb.extract %in from 1 : (i4) -> i1
%2 = comb.extract %in from 2 : (i4) -> i1
%3 = comb.extract %in from 3 : (i4) -> i1
%concat = comb.concat %3, %2, %1, %0 : i1, i1, i1, i1
hw.output %concat : i4

// After
hw.output %in : i4

2. Bit reversal:

  • Pattern: Bits are extracted in descending (reverse) order and concatenated.

  • Transformation: The chain is replaced with a single comb.reverse.

// Before
%0 = comb.extract %in from 0 : (i4) -> i1
%1 = comb.extract %in from 1 : (i4) -> i1
%2 = comb.extract %in from 2 : (i4) -> i1
%3 = comb.extract %in from 3 : (i4) -> i1
%rev = comb.concat %0, %1, %2, %3 : i1, i1, i1, i1
hw.output %rev : i4

// After
%0 = comb.reverse %in : i4
hw.output %0 : i4

3. Structural Patterns (e.g., Vectorized Mux)

  • Pattern: Isomorphic, bit-parallel logic cones are detected. For example, a scalarized mux structure that uses a replicated i1 control signal for each bit.

  • Transformation: The replicated scalar operations are collapsed into equivalent vector-level operations (e.g., comb.replicate, comb.and, comb.xor, comb.or).

// Before (scalarized mux)
%sel_inv = comb.xor %sel, %true : i1
%and_a = comb.and %a, %sel : i1
%and_b = comb.and %b, %sel_inv : i1
%mux = comb.or %and_a, %and_b : i1
...
(repeated for each bit)

// After (vectorized mux)
%true = hw.constant true
%sel_vec = comb.replicate %sel : (i1) -> i4
%a_masked = comb.and %a, %sel_vec : i4
%sel_inv_vec = comb.xor %sel_vec, (comb.replicate %true) : i4
%b_masked = comb.and %b, %sel_inv_vec : i4
%mux = comb.or %a_masked, %b_masked : i4
hw.output %mux : i4

4. Partial Vectorization (Chunking):

  • Pattern: The pass identifies contiguous sub-ranges (chunks) that can be vectorized independently, even if the entire bus cannot be.

  • Transformation: The pass vectorizes the identifiable chunks (e.g., a linear chunk) and leaves the remaining scalar or structural logic as another chunk, then concatenates the chunks back together.

// Before (Mixed linear and structural patterns)
// out[3:1] = in[3:1] (linear)
// out[0]   = in[1] ^ in[0] (structural)
%in_3 = comb.extract %in from 3 : (i4) -> i1
%in_2 = comb.extract %in from 2 : (i4) -> i1
%in_1 = comb.extract %in from 1 : (i4) -> i1
// Logic for bit 0
%in_1_for_0 = comb.extract %in from 1 : (i4) -> i1
%in_0 = comb.extract %in from 0 : (i4) -> i1
%bit_0 = comb.xor %in_1_for_0, %in_0 : i1
// Final concatenation
%concat = comb.concat %in_3, %in_2, %in_1, %bit_0 : i1, i1, i1, i1
hw.output %concat : i4

// After (Partially vectorized)
// Chunk 1: [3:1] (vectorized)
%chunk_1 = comb.extract %in from 1 for 3 : (i4) -> i3
// Chunk 0: [0] (scalar logic)
%in_1 = comb.extract %in from 1 : (i4) -> i1
%in_0 = comb.extract %in from 0 : (i4) -> i1
%chunk_0 = comb.xor %in_1, %in_0 : i1
// Re-concat the vectorized chunks
%final = comb.concat %chunk_1, %chunk_0 : i3, i1
hw.output %final : i4

Patterns not transformed
The pass does not modify modules with cross-bit dependencies or non-linear control flows.
For example:

// cross-dependency example (should remain unchanged)
hw.module @cross_dependency(in %in : i2, out out : i2) {
  %0 = comb.extract %in from 0 : (i2) -> i1
  %1 = comb.extract %6 from 1 : (i2) -> i1
  %2 = comb.xor %0, %1 : i1
  %3 = comb.extract %in from 1 : (i2) -> i1
  %4 = comb.extract %6 from 0 : (i2) -> i1
  %5 = comb.xor %3, %4 : i1
  %6 = comb.concat %5, %2 : i1, i1
  hw.output %6 : i2
}

@mafeguimaraes mafeguimaraes force-pushed the feature/hw-vectorization-pass branch 3 times, most recently from 59a27df to 7dfbad0 Compare November 10, 2025 19:23
@pronesto
Copy link

Hi everyone, just a gentle ping on this PR. It has been open for a while, and I wanted to check whether there is anything we can do on our side to help move the review forward. Many thanks!

bit &operator=(const bit &other);
bool operator==(const bit &other) const;

bool left_adjacent(const bit &other);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please use camelBack (as noted in MLIR style guide https://mlir.llvm.org/getting_started/DeveloperGuide/#style-guide)


Block &block = module.getBody().front();
auto outputOp = dyn_cast<hw::OutputOp>(block.getTerminator());
if (!outputOp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if-statement is not necessary as hw::OutputOp is guaranteed by a verifier.


bool containsLLHD = false;
module.walk([&](mlir::Operation *op) {
if (op->getDialect()->getNamespace() == "llhd") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this gives up when there is llhd?

Comment on lines 816 to 875
} else if (auto andOp = dyn_cast<comb::AndOp>(op)) {
Value lhs = andOp.getInputs()[0];
Value rhs = andOp.getInputs()[1];
if (isa_and_nonnull<hw::ConstantOp>(rhs.getDefiningOp()))
return findBitSource(lhs, bitIndex, depth + 1);
if (isa_and_nonnull<hw::ConstantOp>(lhs.getDefiningOp()))
return findBitSource(rhs, bitIndex, depth + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'm following the code correctly but these parts seem not correct. Is it necessary to check the value of the constant? Also can't we simply treat and/or/xor as the source op? here?

Comment on lines 828 to 848
bool vectorizer::cleanup_dead_ops(Block &block) {
bool overallChanged = false;
bool changedInIteration = true;
while (changedInIteration) {
changedInIteration = false;
llvm::SmallVector<Operation *, 16> deadOps;
for (Operation &op : block) {
if (op.use_empty() && !op.hasTrait<mlir::OpTrait::IsTerminator>()) {
deadOps.push_back(&op);
}
}
if (!deadOps.empty()) {
changedInIteration = true;
overallChanged = true;
for (Operation *op : deadOps) {
op->erase();
}
}
}
return overallChanged;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use https://github.com/llvm/llvm-project/blob/db557bee1e2c128e77805deb86c1f364b5c29e70/mlir/lib/Transforms/Utils/RegionUtils.cpp#L495? There are few issues around side-effecting op and O(N^2) fixpoint iterations here so would be nice to simply use a library function.

Comment on lines 188 to 195
llvm::DenseSet<mlir::Value> sources;
for (const auto &[_, bit] : bits) {
if (!sources.contains(bit.source))
sources.insert(bit.source);
if (sources.size() >= 2)
return false;
}
return true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: using DenseSet is certainly overkill here, e.g.:

Suggested change
llvm::DenseSet<mlir::Value> sources;
for (const auto &[_, bit] : bits) {
if (!sources.contains(bit.source))
sources.insert(bit.source);
if (sources.size() >= 2)
return false;
}
return true;
mlir::Value source;
for (const auto &[_, bit] : bits) {
if(source && source != bit.source) return false;
source = bit.source;
}
return true;

@mafeguimaraes mafeguimaraes force-pushed the feature/hw-vectorization-pass branch 3 times, most recently from a756a7e to ae4c6b6 Compare December 16, 2025 11:43
@mafeguimaraes mafeguimaraes force-pushed the feature/hw-vectorization-pass branch from ae4c6b6 to 1f3df90 Compare December 16, 2025 11:46
@mafeguimaraes mafeguimaraes force-pushed the feature/hw-vectorization-pass branch from dcbe132 to 80f2389 Compare December 16, 2025 12:32
@mafeguimaraes
Copy link
Contributor Author

Hi @uenoku,

Thank you very much for the review and for pointing out the issues with the previous approach. It was really helpful.

I’ve reworked findBitSource to keep it strictly structural again, and moved all boolean reasoning into a separate helper (isBitConstant). This helper is intentionally limited: it only proves constants through structural traversal and identity propagation (e.g., and(x, 1) and or(x, 0)), and does not attempt general boolean simplification.

The helper is used only to recognize identity masks in and/or, which allows handling the mux-like pattern in test_mux without turning findBitSource into a semantic evaluator.

Please let me know if this direction looks more reasonable to you, or if you’d prefer an even more conservative restriction.

Thanks again for the review!

@mafeguimaraes
Copy link
Contributor Author

Hi @uenoku, hope you’re doing well and had a great holiday season!

I just wanted to gently follow up on this PR. It’s been rebased, all checks are passing, and it addresses the feedback about the isBitConstant helper.

Happy to make any further changes if needed. Thanks a lot!

Comment on lines +281 to +283
if (!allBitsHaveSameSource() || bits.empty()) {
return nullptr;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!allBitsHaveSameSource() || bits.empty()) {
return nullptr;
}
if (!allBitsHaveSameSource() || bits.empty())
return nullptr;


bool BitArray::allBitsHaveSameSource() const {
mlir::Value source;
for (const auto &[_, bit] : bits) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that iteration order of DenseMap is non-deterministic, is there any place that depends on the order?

return true;
}

Bit BitArray::getBit(int n) { return bits[n]; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it fine to mutate bits here? If n is not registered yet, it returns nullptr but is it handled in a caller?

IRRewriter rewriter(module.getContext());
bool changed = false;

for (Value oldOutputVal : outputOp->getOperands()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this only apply transformation for an output value? (though it makes sense as a first step, maybe you might want to consider other operations like hw.instance/seq.compreg etc.

}

bool Bit::operator==(const Bit &other) const {
return source == other.source and index == other.index;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for the consistency with other places.

Suggested change
return source == other.source and index == other.index;
return source == other.source && index == other.index;

return false;
}

bool Vectorizer::canVectorizeStructurally(mlir::Value output) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current pattern separates analysis (can* functions) from transformation (apply* functions), which requires analyzing the IR twice and makes it very difficult to understand what's been validated before mutation occurs.

Please consider combining these into try* methods that return LogicalResult:

LogicalResult tryStructuralVectorization(OpBuilder &builder, Value value);
LogicalResult tryPartialVectorization(OpBuilder &builder, Value value);
if (succeeded(tryLinearVectorization(oldOutputVal, sourceInput)))
  continue;

if (succeeded(tryReverseVectorization(rewriter, oldOutputVal, sourceInput)))
  continue;

if (succeeded(tryStructuralVectorization(rewriter, oldOutputVal)))
  continue;

if (succeeded(tryPartialVectorization(rewriter, oldOutputVal)))
  continue;

while ((i - len) >= 0) {
Value nextBitSource = findBitSource(oldOutputVal, i - len);
auto nextExtractOp =
dyn_cast_or_null<comb::ExtractOp>(nextBitSource.getDefiningOp());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dyn_cast_or_null<comb::ExtractOp>(nextBitSource.getDefiningOp());
nextBitSource.getDefiningOp<comb::ExtractOp>();

cone.insert(val);

Operation *definingOp = val.getDefiningOp();
if (!definingOp || isa<BlockArgument>(val) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: !definingOp means isa<BlockArgument>(val)

}
}

bool Vectorizer::isSafeSharedValue(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could elaborate why this is safe to share? Could you leave comments? I feel it always return true.


if (auto *op = val.getDefiningOp()) {
for (auto operand : op->getOperands()) {
if (!isSafeSharedValue(operand, visited))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also because visited is initialized every time at line 593 I think this function seems to visit entire use-def chain.

struct HWVectorizationPass
: public hw::impl::HWVectorizationBase<HWVectorizationPass> {

void getDependentDialects(mlir::DialectRegistry &registry) const override {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please define this in tablegen. Also SV dialect is not necessary i think.

return false;

if (auto c = dyn_cast<hw::ConstantOp>(defOp)) {
if (bitIndex < c.getValue().getBitWidth())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this condition ever happen?

Copy link
Member

@uenoku uenoku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize for the long delay - I'm still working through understanding what the pass does, and it's taking me some time.

Since this PR is quite large and does non-trivial IR transformations, it might be much easier to review and merge quickly if you could split it into smaller PRs. Something like:

  1. Pass boilerplate and basic infrastructure
  2. BitArray data structure + ExtractOp preprocessing + linear vectorization (applyLinearVectorization)
  3. Reverse and mixed permutation vectorization + ConcatOp handling
  4. Logical op preprocessing (and/or/xor) + structural vectorization
  5. Partial vectorization

Keeping each PR under ~300 LOC would make it much easier to review, provide targeted feedback, and merge incrementally. It would also help me (and other reviewers) understand the design better by seeing it build up piece by piece.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants