-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[TIR][Schedule] Add FuseReductionEpilogue primitive to fuse epilogue … #18418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TIR][Schedule] Add FuseReductionEpilogue primitive to fuse epilogue … #18418
Conversation
Summary of ChangesHello @kimm240, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant optimization to the TVM Tensor IR (TIR) schedule by adding a new primitive, Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new and useful schedule primitive, fuse_reduction_epilogue, to fuse an epilogue operation into a reduction block's initialization. The implementation is well-structured, with a clear separation of concerns for analysis and transformation. The changes are consistently applied across the scheduling infrastructure, and the new functionality is well-tested, including checks for numerical correctness and trace round-tripping. I've found one high-severity correctness issue in the pattern matching logic and a couple of medium-severity opportunities to improve code clarity and robustness. Overall, this is a great addition to TVM's scheduling capabilities.
59f14e6 to
a1c9681
Compare
…into reduction init Currently it is not possible to fuse an epilogue operation (e.g., bias addition) into a reduction block's initialization statement. This limitation prevents leveraging hardware-specific instructions that support bias accumulation in vector ISAs, such as MACC (multiply-accumulate with bias) instructions. This commit implements a new schedule primitive 'fuse_reduction_epilogue' that addresses the problem described in: https://discuss.tvm.apache.org/t/tir-problem-inlining-addition-into-matmul-block/18066 The primitive transforms the following pattern: Before: for i, j, k in T.grid(M, N, K): with T.block("matmul"): with T.init(): temp[vi, vj] = 0 temp[vi, vj] = temp[vi, vj] + A[vi, vk] * B[vj, vk] for i, j in T.grid(M, N): with T.block("bias_add"): D[vi, vj] = temp[vi, vj] + C[vi, vj] After: for i, j, k in T.grid(M, N, K): with T.block("matmul"): T.reads(C[vi, vj], A[vi, vk], B[vj, vk]) T.writes(D[vi, vj]) with T.init(): D[vi, vj] = C[vi, vj] # Fused epilogue into init D[vi, vj] = D[vi, vj] + A[vi, vk] * B[vj, vk] The transformation removes the intermediate temp buffer and the separate epilogue block, enabling better tensorization opportunities for hardware with bias accumulation support. Implementation: - ReductionEpilogueFuser class for pattern validation and IR transformation - BodyPatternAllowFusion: Validates epilogue can be fused - AnalyzeEpiloguePattern: Detects addition pattern (D = temp + C) - ExtractEpilogueInfo: Extracts buffer and region information - CreateFusedReductionBlock: Creates single block with modified T.init() - SingleBlockFusionReplacer: Replaces blocks and removes temp buffer - Variable mapping between epilogue and reduction block iter vars - Proper buffer and region updates with correct read/write ordering - FFI bindings and Python API following TVM conventions Changes: - src/tir/schedule/primitive/compute_inline.cc: Core implementation (~430 lines) - src/tir/schedule/primitive.h: Function declaration - include/tvm/tir/schedule/schedule.h: Virtual method in ScheduleNode - src/tir/schedule/concrete_schedule.{h,cc}: ConcreteScheduleNode implementation - src/tir/schedule/traced_schedule.{h,cc}: TracedScheduleNode implementation - src/tir/schedule/schedule.cc: FFI binding registration - python/tvm/tir/schedule/schedule.py: Python API with documentation - tests/python/tir-schedule/test_tir_schedule_fuse_reduction_epilogue.py: Comprehensive tests including basic fusion, float32 variant, numerical correctness verification, and trace roundtrip validation Run tests with: pytest tests/python/tir-schedule/test_tir_schedule_fuse_reduction_epilogue.py -v
a1c9681 to
0fc40e7
Compare
wrongtest-intellif
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kimm240 Overally it seems to be a very useful primitive! Could you also check more dataflow pattern cases?
For example,
@T.prim_func
def matmul_bias_before(
A: T.Buffer((16, 16), "int8"),
B: T.Buffer((16, 16), "int8"),
C: T.Buffer((16, 16), "int32"),
D: T.Buffer((16, 16), "int32"),
E: T.Buffer((16, 16), "int32"),
) -> None:
"""Original function with separate reduction and epilogue blocks."""
temp = T.alloc_buffer((16, 16), dtype="int32")
for i, j, k in T.grid(16, 16, 16):
with T.block("multiply"):
vi, vj, vk = T.axis.remap("SSR", [i, j, k])
with T.init():
temp[vi, vj] = T.int32(0)
temp[vi, vj] = temp[vi, vj] + T.cast(A[vi, vk], "int32") * T.cast(B[vj, vk], "int32")
for i, j in T.grid(16, 16):
with T.block("add"):
vi, vj = T.axis.remap("SS", [i, j])
D[vi, vj] = temp[vi, vj] + C[vi, vj]
for i, j in T.grid(16, 16):
with T.block("add2"):
vi, vj = T.axis.remap("SS", [i, j])
E[vi, vj] = temp[vi, vj] + C[vi, vj]with sch.fuse_reduction_epilogue("multiply", "add")
…ogue - Add CheckBufferStillUsed helper function to check if reduction buffer is still referenced by other blocks after fusion - Only remove intermediate temp buffer if no other blocks reference it - Add test case for multiple epilogue blocks scenario where one epilogue is fused while another still uses the intermediate buffer - This addresses the case where multiple epilogue blocks use the same reduction output, ensuring the temp buffer is preserved when needed Related issue: https://discuss.tvm.apache.org/t/...
…into reduction init Currently it is not possible to fuse an epilogue operation (e.g., bias addition) into a reduction block's initialization statement. This limitation prevents leveraging hardware-specific instructions that support bias accumulation in vector ISAs, such as MACC (multiply-accumulate with bias) instructions. This commit implements a new schedule primitive 'fuse_reduction_epilogue' that addresses the problem described in: https://discuss.tvm.apache.org/t/tir-problem-inlining-addition-into-matmul-block/18066 The primitive transforms the following pattern: Before: for i, j, k in T.grid(M, N, K): with T.block("matmul"): with T.init(): temp[vi, vj] = 0 temp[vi, vj] = temp[vi, vj] + A[vi, vk] * B[vj, vk] for i, j in T.grid(M, N): with T.block("bias_add"): D[vi, vj] = temp[vi, vj] + C[vi, vj] After: for i, j, k in T.grid(M, N, K): with T.block("matmul"): T.reads(C[vi, vj], A[vi, vk], B[vj, vk]) T.writes(D[vi, vj]) with T.init(): D[vi, vj] = C[vi, vj] # Fused epilogue into init D[vi, vj] = D[vi, vj] + A[vi, vk] * B[vj, vk] The transformation removes the intermediate temp buffer and the separate epilogue block, enabling better tensorization opportunities for hardware with bias accumulation support. To resolve the issue where multiple epilogue blocks use the same reduction output, we modified the code to handle multiple epilogue blocks cases by adding CheckBufferStillUsed function that checks if other blocks still reference the reduction buffer, and modified to keep the temp buffer if it's still referenced. This ensures that when fusing one epilogue block, other epilogue blocks that still use the intermediate buffer continue to work correctly. Implementation: - ReductionEpilogueFuser class for pattern validation and IR transformation - BodyPatternAllowFusion: Validates epilogue can be fused - AnalyzeEpiloguePattern: Detects addition pattern (D = temp + C) - ExtractEpilogueInfo: Extracts buffer and region information - CreateFusedReductionBlock: Creates single block with modified T.init() - SingleBlockFusionReplacer: Replaces blocks and removes temp buffer - CheckBufferStillUsed: Helper function to check if reduction buffer is still referenced by other blocks after fusion - Conditionally removes temp buffer only if no other blocks reference it - Variable mapping between epilogue and reduction block iter vars - Proper buffer and region updates with correct read/write ordering - FFI bindings and Python API following TVM conventions Changes: - src/tir/schedule/primitive/compute_inline.cc: Core implementation (~430 lines) - src/tir/schedule/primitive.h: Function declaration - include/tvm/tir/schedule/schedule.h: Virtual method in ScheduleNode - src/tir/schedule/concrete_schedule.{h,cc}: ConcreteScheduleNode implementation - src/tir/schedule/traced_schedule.{h,cc}: TracedScheduleNode implementation - src/tir/schedule/schedule.cc: FFI binding registration - python/tvm/tir/schedule/schedule.py: Python API with documentation - tests/python/tir-schedule/test_tir_schedule_fuse_reduction_epilogue.py: Comprehensive tests including basic fusion, float32 variant, numerical correctness verification, trace roundtrip validation, and multiple epilogue blocks test case Tests can be verified through test_fuse_reduction_epilogue_multiple_epilogue function in tests/python/tir-schedule/test_tir_schedule_fuse_reduction_epilogue.py. Tests can be run using: python -m pytest tests/python/tir-schedule/test_tir_schedule_fuse_reduction_epilogue.py
f02dedb to
b19f547
Compare
|
@wrongtest-intellif |
wrongtest-intellif
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with some small comments
…into reduction init Currently it is not possible to fuse an epilogue operation (e.g., bias addition) into a reduction block's initialization statement. This limitation prevents leveraging hardware-specific instructions that support bias accumulation in vector ISAs, such as MACC (multiply-accumulate with bias) instructions. This commit implements a new schedule primitive 'fuse_reduction_epilogue' that addresses the problem described in: https://discuss.tvm.apache.org/t/tir-problem-inlining-addition-into-matmul-block/18066 The primitive transforms the following pattern: Before: for i, j, k in T.grid(M, N, K): with T.block("matmul"): with T.init(): temp[vi, vj] = 0 temp[vi, vj] = temp[vi, vj] + A[vi, vk] * B[vj, vk] for i, j in T.grid(M, N): with T.block("bias_add"): D[vi, vj] = temp[vi, vj] + C[vi, vj] After: for i, j, k in T.grid(M, N, K): with T.block("matmul"): T.reads(C[vi, vj], A[vi, vk], B[vj, vk]) T.writes(D[vi, vj]) with T.init(): D[vi, vj] = C[vi, vj] # Fused epilogue into init D[vi, vj] = D[vi, vj] + A[vi, vk] * B[vj, vk] The transformation removes the intermediate temp buffer and the separate epilogue block, enabling better tensorization opportunities for hardware with bias accumulation support. To resolve the issue where multiple epilogue blocks use the same reduction output, we modified the code to handle multiple epilogue blocks cases by adding CheckBufferStillUsed function that checks if other blocks still reference the reduction buffer, and modified to keep the temp buffer if it's still referenced. This ensures that when fusing one epilogue block, other epilogue blocks that still use the intermediate buffer continue to work correctly. Implementation: - ReductionEpilogueFuser class for pattern validation and IR transformation - BodyPatternAllowFusion: Validates epilogue can be fused - AnalyzeEpiloguePattern: Detects addition pattern (D = temp + C) - ExtractEpilogueInfo: Extracts buffer and region information - CreateFusedReductionBlock: Creates single block with modified T.init() - SingleBlockFusionReplacer: Replaces blocks and removes temp buffer - CheckBufferStillUsed: Helper function to check if reduction buffer is still referenced by other blocks after fusion - Conditionally removes temp buffer only if no other blocks reference it - Variable mapping between epilogue and reduction block iter vars - Proper buffer and region updates with correct read/write ordering - FFI bindings and Python API following TVM conventions Changes: - src/tir/schedule/primitive/compute_inline.cc: Core implementation (~430 lines) - src/tir/schedule/primitive.h: Function declaration - include/tvm/tir/schedule/schedule.h: Virtual method in ScheduleNode - src/tir/schedule/concrete_schedule.{h,cc}: ConcreteScheduleNode implementation - src/tir/schedule/traced_schedule.{h,cc}: TracedScheduleNode implementation - src/tir/schedule/schedule.cc: FFI binding registration - python/tvm/tir/schedule/schedule.py: Python API with documentation - tests/python/tir-schedule/test_tir_schedule_fuse_reduction_epilogue.py: Comprehensive tests including basic fusion, float32 variant, numerical correctness verification, trace roundtrip validation, and multiple epilogue blocks test case Tests can be verified through test_fuse_reduction_epilogue_multiple_epilogue function in tests/python/tir-schedule/test_tir_schedule_fuse_reduction_epilogue.py. Tests can be run using: python -m pytest tests/python/tir-schedule/test_tir_schedule_fuse_reduction_epilogue.py
Currently it is not possible to fuse an epilogue operation (e.g., bias addition) into a reduction block's initialization statement. This limitation prevents leveraging hardware-specific instructions that support bias accumulation in vector ISAs, such as MACC (multiply-accumulate with bias) instructions.
This commit implements a new schedule primitive 'fuse_reduction_epilogue' that addresses the problem described in:
https://discuss.tvm.apache.org/t/tir-problem-inlining-addition-into-matmul-block/18066
The primitive transforms the following pattern:
Before:
for i, j, k in T.grid(M, N, K):
with T.block("matmul"):
with T.init():
temp[vi, vj] = 0
temp[vi, vj] = temp[vi, vj] + A[vi, vk] * B[vj, vk]
After:
for i, j, k in T.grid(M, N, K):
with T.block("matmul"):
T.reads(C[vi, vj], A[vi, vk], B[vj, vk])
T.writes(D[vi, vj])
with T.init():
D[vi, vj] = C[vi, vj] # Fused epilogue into init
D[vi, vj] = D[vi, vj] + A[vi, vk] * B[vj, vk]
The transformation removes the intermediate temp buffer and the separate epilogue block, enabling better tensorization opportunities for hardware with bias accumulation support.
Implementation:
Changes:
Run tests with:
pytest tests/python/tir-schedule/test_tir_schedule_fuse_reduction_epilogue.py -v
And, Could you please also take a look at #18240? Thx :)