[TIR][Schedule] Add FuseReductionEpilogue primitive to fuse epilogue … #18418

kimm240 · 2025-11-04T09:31:17Z

Currently it is not possible to fuse an epilogue operation (e.g., bias addition) into a reduction block's initialization statement. This limitation prevents leveraging hardware-specific instructions that support bias accumulation in vector ISAs, such as MACC (multiply-accumulate with bias) instructions.

This commit implements a new schedule primitive 'fuse_reduction_epilogue' that addresses the problem described in:
https://discuss.tvm.apache.org/t/tir-problem-inlining-addition-into-matmul-block/18066

The primitive transforms the following pattern:

Before:
for i, j, k in T.grid(M, N, K):
with T.block("matmul"):
with T.init():
temp[vi, vj] = 0
temp[vi, vj] = temp[vi, vj] + A[vi, vk] * B[vj, vk]

for i, j in T.grid(M, N):
    with T.block("bias_add"):
        D[vi, vj] = temp[vi, vj] + C[vi, vj]

After:
for i, j, k in T.grid(M, N, K):
with T.block("matmul"):
T.reads(C[vi, vj], A[vi, vk], B[vj, vk])
T.writes(D[vi, vj])
with T.init():
D[vi, vj] = C[vi, vj] # Fused epilogue into init
D[vi, vj] = D[vi, vj] + A[vi, vk] * B[vj, vk]

The transformation removes the intermediate temp buffer and the separate epilogue block, enabling better tensorization opportunities for hardware with bias accumulation support.

Implementation:

ReductionEpilogueFuser class for pattern validation and IR transformation
- BodyPatternAllowFusion: Validates epilogue can be fused
- AnalyzeEpiloguePattern: Detects addition pattern (D = temp + C)
- ExtractEpilogueInfo: Extracts buffer and region information
- CreateFusedReductionBlock: Creates single block with modified T.init()
SingleBlockFusionReplacer: Replaces blocks and removes temp buffer
Variable mapping between epilogue and reduction block iter vars
Proper buffer and region updates with correct read/write ordering
FFI bindings and Python API following TVM conventions

Changes:

src/tir/schedule/primitive/compute_inline.cc: Core implementation (~430 lines)
src/tir/schedule/primitive.h: Function declaration
include/tvm/tir/schedule/schedule.h: Virtual method in ScheduleNode
src/tir/schedule/concrete_schedule.{h,cc}: ConcreteScheduleNode implementation
src/tir/schedule/traced_schedule.{h,cc}: TracedScheduleNode implementation
src/tir/schedule/schedule.cc: FFI binding registration
python/tvm/tir/schedule/schedule.py: Python API with documentation
tests/python/tir-schedule/test_tir_schedule_fuse_reduction_epilogue.py: Comprehensive tests including basic fusion, float32 variant, numerical correctness verification, and trace roundtrip validation

Run tests with:
pytest tests/python/tir-schedule/test_tir_schedule_fuse_reduction_epilogue.py -v

And, Could you please also take a look at #18240? Thx :)

gemini-code-assist · 2025-11-04T09:31:39Z

Summary of Changes

Hello @kimm240, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant optimization to the TVM Tensor IR (TIR) schedule by adding a new primitive, FuseReductionEpilogue. This primitive addresses the current limitation of fusing epilogue operations, such as bias addition, directly into reduction blocks. By transforming the IR to integrate the epilogue into the reduction block's initialization, it eliminates intermediate buffer usage and unlocks opportunities for more efficient hardware utilization, particularly for architectures supporting bias accumulation instructions.

Highlights

New Primitive: Introduced FuseReductionEpilogue to the TVM TIR schedule, allowing the fusion of epilogue operations (like bias addition) directly into reduction blocks.
Optimization Enablement: This fusion enables leveraging hardware-specific instructions (e.g., MACC) for bias accumulation, leading to improved performance and more efficient hardware utilization.
IR Transformation: The primitive transforms a two-block pattern (reduction followed by an epilogue) into a single, optimized reduction block by modifying its initialization statement and removing intermediate buffers.
Comprehensive Implementation: Includes C++ backend logic (ReductionEpilogueFuser, SingleBlockFusionReplacer), FFI bindings, a Python API with clear documentation, and extensive test cases to ensure correctness and functionality.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new and useful schedule primitive, fuse_reduction_epilogue, to fuse an epilogue operation into a reduction block's initialization. The implementation is well-structured, with a clear separation of concerns for analysis and transformation. The changes are consistently applied across the scheduling infrastructure, and the new functionality is well-tested, including checks for numerical correctness and trace round-tripping. I've found one high-severity correctness issue in the pattern matching logic and a couple of medium-severity opportunities to improve code clarity and robustness. Overall, this is a great addition to TVM's scheduling capabilities.

src/tir/schedule/primitive/compute_inline.cc

…into reduction init Currently it is not possible to fuse an epilogue operation (e.g., bias addition) into a reduction block's initialization statement. This limitation prevents leveraging hardware-specific instructions that support bias accumulation in vector ISAs, such as MACC (multiply-accumulate with bias) instructions. This commit implements a new schedule primitive 'fuse_reduction_epilogue' that addresses the problem described in: https://discuss.tvm.apache.org/t/tir-problem-inlining-addition-into-matmul-block/18066 The primitive transforms the following pattern: Before: for i, j, k in T.grid(M, N, K): with T.block("matmul"): with T.init(): temp[vi, vj] = 0 temp[vi, vj] = temp[vi, vj] + A[vi, vk] * B[vj, vk] for i, j in T.grid(M, N): with T.block("bias_add"): D[vi, vj] = temp[vi, vj] + C[vi, vj] After: for i, j, k in T.grid(M, N, K): with T.block("matmul"): T.reads(C[vi, vj], A[vi, vk], B[vj, vk]) T.writes(D[vi, vj]) with T.init(): D[vi, vj] = C[vi, vj] # Fused epilogue into init D[vi, vj] = D[vi, vj] + A[vi, vk] * B[vj, vk] The transformation removes the intermediate temp buffer and the separate epilogue block, enabling better tensorization opportunities for hardware with bias accumulation support. Implementation: - ReductionEpilogueFuser class for pattern validation and IR transformation - BodyPatternAllowFusion: Validates epilogue can be fused - AnalyzeEpiloguePattern: Detects addition pattern (D = temp + C) - ExtractEpilogueInfo: Extracts buffer and region information - CreateFusedReductionBlock: Creates single block with modified T.init() - SingleBlockFusionReplacer: Replaces blocks and removes temp buffer - Variable mapping between epilogue and reduction block iter vars - Proper buffer and region updates with correct read/write ordering - FFI bindings and Python API following TVM conventions Changes: - src/tir/schedule/primitive/compute_inline.cc: Core implementation (~430 lines) - src/tir/schedule/primitive.h: Function declaration - include/tvm/tir/schedule/schedule.h: Virtual method in ScheduleNode - src/tir/schedule/concrete_schedule.{h,cc}: ConcreteScheduleNode implementation - src/tir/schedule/traced_schedule.{h,cc}: TracedScheduleNode implementation - src/tir/schedule/schedule.cc: FFI binding registration - python/tvm/tir/schedule/schedule.py: Python API with documentation - tests/python/tir-schedule/test_tir_schedule_fuse_reduction_epilogue.py: Comprehensive tests including basic fusion, float32 variant, numerical correctness verification, and trace roundtrip validation Run tests with: pytest tests/python/tir-schedule/test_tir_schedule_fuse_reduction_epilogue.py -v

tlopex · 2025-11-07T18:55:53Z

cc @tqchen @Hzfengsy

wrongtest-intellif

@kimm240 Overally it seems to be a very useful primitive！ Could you also check more dataflow pattern cases?

For example,

@T.prim_func
def matmul_bias_before(
    A: T.Buffer((16, 16), "int8"),
    B: T.Buffer((16, 16), "int8"),
    C: T.Buffer((16, 16), "int32"),
    D: T.Buffer((16, 16), "int32"),
    E: T.Buffer((16, 16), "int32"),
) -> None:
    """Original function with separate reduction and epilogue blocks."""
    temp = T.alloc_buffer((16, 16), dtype="int32")
    for i, j, k in T.grid(16, 16, 16):
        with T.block("multiply"):
            vi, vj, vk = T.axis.remap("SSR", [i, j, k])
            with T.init():
                temp[vi, vj] = T.int32(0)
            temp[vi, vj] = temp[vi, vj] + T.cast(A[vi, vk], "int32") * T.cast(B[vj, vk], "int32")
    for i, j in T.grid(16, 16):
        with T.block("add"):
            vi, vj = T.axis.remap("SS", [i, j])
            D[vi, vj] = temp[vi, vj] + C[vi, vj]
    for i, j in T.grid(16, 16):
        with T.block("add2"):
            vi, vj = T.axis.remap("SS", [i, j])
            E[vi, vj] = temp[vi, vj] + C[vi, vj]

with sch.fuse_reduction_epilogue("multiply", "add")

…ogue - Add CheckBufferStillUsed helper function to check if reduction buffer is still referenced by other blocks after fusion - Only remove intermediate temp buffer if no other blocks reference it - Add test case for multiple epilogue blocks scenario where one epilogue is fused while another still uses the intermediate buffer - This addresses the case where multiple epilogue blocks use the same reduction output, ensuring the temp buffer is preserved when needed Related issue: https://discuss.tvm.apache.org/t/...

…into reduction init Currently it is not possible to fuse an epilogue operation (e.g., bias addition) into a reduction block's initialization statement. This limitation prevents leveraging hardware-specific instructions that support bias accumulation in vector ISAs, such as MACC (multiply-accumulate with bias) instructions. This commit implements a new schedule primitive 'fuse_reduction_epilogue' that addresses the problem described in: https://discuss.tvm.apache.org/t/tir-problem-inlining-addition-into-matmul-block/18066 The primitive transforms the following pattern: Before: for i, j, k in T.grid(M, N, K): with T.block("matmul"): with T.init(): temp[vi, vj] = 0 temp[vi, vj] = temp[vi, vj] + A[vi, vk] * B[vj, vk] for i, j in T.grid(M, N): with T.block("bias_add"): D[vi, vj] = temp[vi, vj] + C[vi, vj] After: for i, j, k in T.grid(M, N, K): with T.block("matmul"): T.reads(C[vi, vj], A[vi, vk], B[vj, vk]) T.writes(D[vi, vj]) with T.init(): D[vi, vj] = C[vi, vj] # Fused epilogue into init D[vi, vj] = D[vi, vj] + A[vi, vk] * B[vj, vk] The transformation removes the intermediate temp buffer and the separate epilogue block, enabling better tensorization opportunities for hardware with bias accumulation support. To resolve the issue where multiple epilogue blocks use the same reduction output, we modified the code to handle multiple epilogue blocks cases by adding CheckBufferStillUsed function that checks if other blocks still reference the reduction buffer, and modified to keep the temp buffer if it's still referenced. This ensures that when fusing one epilogue block, other epilogue blocks that still use the intermediate buffer continue to work correctly. Implementation: - ReductionEpilogueFuser class for pattern validation and IR transformation - BodyPatternAllowFusion: Validates epilogue can be fused - AnalyzeEpiloguePattern: Detects addition pattern (D = temp + C) - ExtractEpilogueInfo: Extracts buffer and region information - CreateFusedReductionBlock: Creates single block with modified T.init() - SingleBlockFusionReplacer: Replaces blocks and removes temp buffer - CheckBufferStillUsed: Helper function to check if reduction buffer is still referenced by other blocks after fusion - Conditionally removes temp buffer only if no other blocks reference it - Variable mapping between epilogue and reduction block iter vars - Proper buffer and region updates with correct read/write ordering - FFI bindings and Python API following TVM conventions Changes: - src/tir/schedule/primitive/compute_inline.cc: Core implementation (~430 lines) - src/tir/schedule/primitive.h: Function declaration - include/tvm/tir/schedule/schedule.h: Virtual method in ScheduleNode - src/tir/schedule/concrete_schedule.{h,cc}: ConcreteScheduleNode implementation - src/tir/schedule/traced_schedule.{h,cc}: TracedScheduleNode implementation - src/tir/schedule/schedule.cc: FFI binding registration - python/tvm/tir/schedule/schedule.py: Python API with documentation - tests/python/tir-schedule/test_tir_schedule_fuse_reduction_epilogue.py: Comprehensive tests including basic fusion, float32 variant, numerical correctness verification, trace roundtrip validation, and multiple epilogue blocks test case Tests can be verified through test_fuse_reduction_epilogue_multiple_epilogue function in tests/python/tir-schedule/test_tir_schedule_fuse_reduction_epilogue.py. Tests can be run using: python -m pytest tests/python/tir-schedule/test_tir_schedule_fuse_reduction_epilogue.py

kimm240 · 2025-11-18T08:33:38Z

@wrongtest-intellif
I've applied your suggestions to the test and verified that it works well. Please let me know if there are any other dataflows I should apply this to.

wrongtest-intellif

LGTM, with some small comments

src/tir/schedule/primitive/compute_inline.cc

…into reduction init Currently it is not possible to fuse an epilogue operation (e.g., bias addition) into a reduction block's initialization statement. This limitation prevents leveraging hardware-specific instructions that support bias accumulation in vector ISAs, such as MACC (multiply-accumulate with bias) instructions. This commit implements a new schedule primitive 'fuse_reduction_epilogue' that addresses the problem described in: https://discuss.tvm.apache.org/t/tir-problem-inlining-addition-into-matmul-block/18066 The primitive transforms the following pattern: Before: for i, j, k in T.grid(M, N, K): with T.block("matmul"): with T.init(): temp[vi, vj] = 0 temp[vi, vj] = temp[vi, vj] + A[vi, vk] * B[vj, vk] for i, j in T.grid(M, N): with T.block("bias_add"): D[vi, vj] = temp[vi, vj] + C[vi, vj] After: for i, j, k in T.grid(M, N, K): with T.block("matmul"): T.reads(C[vi, vj], A[vi, vk], B[vj, vk]) T.writes(D[vi, vj]) with T.init(): D[vi, vj] = C[vi, vj] # Fused epilogue into init D[vi, vj] = D[vi, vj] + A[vi, vk] * B[vj, vk] The transformation removes the intermediate temp buffer and the separate epilogue block, enabling better tensorization opportunities for hardware with bias accumulation support. To resolve the issue where multiple epilogue blocks use the same reduction output, we modified the code to handle multiple epilogue blocks cases by adding CheckBufferStillUsed function that checks if other blocks still reference the reduction buffer, and modified to keep the temp buffer if it's still referenced. This ensures that when fusing one epilogue block, other epilogue blocks that still use the intermediate buffer continue to work correctly. Implementation: - ReductionEpilogueFuser class for pattern validation and IR transformation - BodyPatternAllowFusion: Validates epilogue can be fused - AnalyzeEpiloguePattern: Detects addition pattern (D = temp + C) - ExtractEpilogueInfo: Extracts buffer and region information - CreateFusedReductionBlock: Creates single block with modified T.init() - SingleBlockFusionReplacer: Replaces blocks and removes temp buffer - CheckBufferStillUsed: Helper function to check if reduction buffer is still referenced by other blocks after fusion - Conditionally removes temp buffer only if no other blocks reference it - Variable mapping between epilogue and reduction block iter vars - Proper buffer and region updates with correct read/write ordering - FFI bindings and Python API following TVM conventions Changes: - src/tir/schedule/primitive/compute_inline.cc: Core implementation (~430 lines) - src/tir/schedule/primitive.h: Function declaration - include/tvm/tir/schedule/schedule.h: Virtual method in ScheduleNode - src/tir/schedule/concrete_schedule.{h,cc}: ConcreteScheduleNode implementation - src/tir/schedule/traced_schedule.{h,cc}: TracedScheduleNode implementation - src/tir/schedule/schedule.cc: FFI binding registration - python/tvm/tir/schedule/schedule.py: Python API with documentation - tests/python/tir-schedule/test_tir_schedule_fuse_reduction_epilogue.py: Comprehensive tests including basic fusion, float32 variant, numerical correctness verification, trace roundtrip validation, and multiple epilogue blocks test case Tests can be verified through test_fuse_reduction_epilogue_multiple_epilogue function in tests/python/tir-schedule/test_tir_schedule_fuse_reduction_epilogue.py. Tests can be run using: python -m pytest tests/python/tir-schedule/test_tir_schedule_fuse_reduction_epilogue.py

gemini-code-assist bot reviewed Nov 4, 2025

View reviewed changes

src/tir/schedule/primitive/compute_inline.cc Show resolved Hide resolved

src/tir/schedule/primitive/compute_inline.cc Outdated Show resolved Hide resolved

src/tir/schedule/primitive/compute_inline.cc Outdated Show resolved Hide resolved

kimm240 force-pushed the feature/fuse-reduction-epilogue-clean branch from 59f14e6 to a1c9681 Compare November 5, 2025 02:40

kimm240 force-pushed the feature/fuse-reduction-epilogue-clean branch from a1c9681 to 0fc40e7 Compare November 5, 2025 03:31

wrongtest-intellif requested changes Nov 17, 2025

View reviewed changes

hyun gyu kim added 2 commits November 18, 2025 15:54

kimm240 force-pushed the feature/fuse-reduction-epilogue-clean branch from f02dedb to b19f547 Compare November 18, 2025 07:27

wrongtest-intellif approved these changes Nov 19, 2025

View reviewed changes

src/tir/schedule/primitive/compute_inline.cc Show resolved Hide resolved

src/tir/schedule/primitive/compute_inline.cc Outdated Show resolved Hide resolved

wrongtest-intellif merged commit 2032e71 into apache:main Nov 24, 2025
10 checks passed

kimm240 mentioned this pull request Nov 27, 2025

[TIR][Schedule] FuseReductionEpilogue: Add Clipping pattern support #18515

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TIR][Schedule] Add FuseReductionEpilogue primitive to fuse epilogue … #18418

[TIR][Schedule] Add FuseReductionEpilogue primitive to fuse epilogue … #18418

Uh oh!

kimm240 commented Nov 4, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tlopex commented Nov 7, 2025

Uh oh!

wrongtest-intellif left a comment

Uh oh!

kimm240 commented Nov 18, 2025

Uh oh!

wrongtest-intellif left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[TIR][Schedule] Add FuseReductionEpilogue primitive to fuse epilogue … #18418

[TIR][Schedule] Add FuseReductionEpilogue primitive to fuse epilogue … #18418

Uh oh!

Conversation

kimm240 commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Nov 4, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tlopex commented Nov 7, 2025

Uh oh!

wrongtest-intellif left a comment

Choose a reason for hiding this comment

Uh oh!

kimm240 commented Nov 18, 2025

Uh oh!

wrongtest-intellif left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kimm240 commented Nov 4, 2025 •

edited

Loading