[Enhancement] Optimize templates for half/bfloat16 by LJC00118 · Pull Request #1845 · tile-ai/tilelang

LJC00118 · 2026-02-13T06:15:12Z

Summary by CodeRabbit

Bug Fixes
- Corrected min/max behavior for bfloat16 and float16 scalars and added explicit bfloat16/float16 comparisons in reductions.
- Ensured proper initialization, duplication, and update semantics for max/min/absmax reduction paths when using temporary buffers.
Refactor
- Optimized 16-bit float shuffle and math helpers to use bitwise-safe pathways for improved correctness and performance.

github-actions · 2026-02-13T06:15:23Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2026-02-13T06:15:33Z

📝 Walkthrough

Walkthrough

Splits scalar bf16/fp16 Min/Max codegen into explicit type-specific branches using native __hmin/__hmax with casts, rewrites 16-bit shuffle/math helpers to operate via integer bitcasts for half_t/bfloat16_t, and extends reduction paths to handle max/min/absmax with proper init/dup/update semantics.

Changes

Cohort / File(s)	Summary
Scalar Min/Max codegen `src/target/codegen_cuda.cc`	MinNode/MaxNode scalar paths split into explicit `bfloat16` and `fp16` branches that emit `__hmin`/`__hmax` on native NV representations with `to_nv_bfloat16`/`to_half` casts and early returns.
16-bit warp shuffle & math helpers `src/tl_templates/cuda/common.h`	Replaced per-lane float promotion with integer bitcast-based shuffles for `half_t`/`bfloat16_t` (`shfl_*_sync` specializations); updated `__habs`/`hrsqrt` wrappers to use native-type reinterpret casts and operate on underlying bit representations.
CUDA reduce templates `src/tl_templates/cuda/reduce.h`	Added `MinOp`/`MaxOp` overloads for `bfloat16_t` and `half_t` that call `__hmin`/`__hmax` on NV native representations.
Runtime reduction logic `src/op/reduce.cc`	Extend non-clear reduction handling to treat max/min/absmax like other reductions: set duplication/update flags, initialize temporary clear buffers when needed, and apply per-element max/min/absmax updates.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

[BugFix] Implement bfloat16 support in CUDA code generation with min/max functions and inf/nan values #1143: Modifies Min/Max scalar handling in src/target/codegen_cuda.cc, directly overlapping the codegen branch changes.
[Bugfix] Support 16bits shfl_sync #1169: Changes 16-bit shuffle and helper implementations in src/tl_templates/cuda/common.h, overlapping template-level shuffle/math updates.
[Language] Refactor reduce and support shared memory as its in/out #1219: Adjusts reduction initialization and update logic in src/op/reduce.cc, closely related to the reduce-path semantics here.

Suggested reviewers

bucket-xv

Poem

🐰 I hop through bits with nimble paws and grin,
Half and bfloat find order tucked within.
Shuffles hum, intrinsics sing so neat,
Min and Max now march on careful feet. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 15.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection	⚠️ Warning	❌ Merge conflicts detected (11 files): ⚔️ `.github/workflows/dist.yml` (content) ⚔️ `src/op/finalize_reducer.cc` (content) ⚔️ `src/op/reduce.cc` (content) ⚔️ `src/target/codegen_cuda.cc` (content) ⚔️ `src/target/stubs/cuda.cc` (content) ⚔️ `src/target/stubs/cuda.h` (content) ⚔️ `src/target/stubs/cudart.cc` (content) ⚔️ `src/target/stubs/nvrtc.cc` (content) ⚔️ `src/tl_templates/cuda/common.h` (content) ⚔️ `src/tl_templates/cuda/reduce.h` (content) ⚔️ `testing/python/language/test_tilelang_language_reduce.py` (content) These conflicts must be resolved before merging into `main`.	Resolve conflicts locally and push changes to this branch.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change—optimizing templates for half/bfloat16 data types across multiple files (codegen, reduce operations, and shuffles).

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

⚔️ Resolve merge conflicts (beta)

Auto-commit resolved conflicts to branch qwq11
Post resolved changes as copyable diffs in a comment

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments

src/op/reduce.cc (1)
430-437: Pre-existing: dead branch in BitAnd update.

Line 433's ternary this->clear ? src_val : bitwise_and(…) is always in the !this->clear branch since need_update is only set when !this->clear (line 290). The this->clear ? src_val arm is unreachable. Not introduced by this PR, but worth a cleanup.
♻️ Suggested simplification
-          update = this->clear ? src_val : bitwise_and(dst_val, src_val);
+          update = bitwise_and(dst_val, src_val);

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

LJC00118 · 2026-02-13T09:53:49Z

@regression-perf

github-actions · 2026-02-13T10:24:57Z

Performance Regression Test Report

Triggered by: @LJC00118
Workflow run: https://github.com/tile-ai/tilelang/actions/runs/21982441237

Results

File	Original Latency	Current Latency	Speedup
example_per_token_cast_to_fp8	0.00726207	0.00739707	0.98175
example_mha_fwd_bhsd_wgmma_pipelined	0.0141607	0.0141931	0.997719
example_mha_sink_bwd_bhsd	0.0614047	0.0615298	0.997966
example_blocksparse_gemm	0.0223992	0.0224367	0.998328
example_mha_sink_fwd_bhsd_sliding_window	0.0155575	0.0155803	0.998536
example_tilelang_nsa_fwd	0.00693401	0.00694153	0.998918
example_mha_sink_fwd_bhsd_wgmma_pipelined	0.0152667	0.0152828	0.998946
example_mha_fwd_bshd	0.0258086	0.0258326	0.999072
example_gqa_bwd_wgmma_pipelined	0.0686826	0.0687396	0.999171
fp8_lighting_indexer	0.0353876	0.0354145	0.999241
sparse_mla_bwd	0.376535	0.376819	0.999248
example_tilelang_sparse_gqa_decode_varlen_indice	0.0168886	0.0169	0.999326
example_tilelang_gemm_fp8_2xAcc	0.18393	0.184049	0.999351
example_linear_attn_bwd	0.151305	0.151398	0.999385
tilelang_example_sparse_tensorcore	0.0148845	0.0148936	0.99939
example_dequant_groupedgemm_bf16_mxfp4_hopper	3.40557	3.40758	0.999408
example_mha_inference	0.0790229	0.0790642	0.999478
example_warp_specialize_gemm_softpipe_stage2	0.038313	0.0383321	0.999501
example_tilelang_nsa_decode	0.00730649	0.0073099	0.999535
example_mha_bwd_bshd	0.0405979	0.0406131	0.999624
example_mha_fwd_varlen	0.0449958	0.0450076	0.999738
example_gqa_sink_fwd_bhsd_wgmma_pipelined_sliding_window	0.0144177	0.0144212	0.999762
example_dequant_gemm_w4a8	5.30173	5.30264	0.999828
example_tilelang_gemm_fp8	0.318683	0.318728	0.99986
example_dequant_gemm_bf16_mxfp4_hopper	0.50253	0.502597	0.999865
example_fusedmoe_tilelang	0.131413	0.13143	0.999865
example_vertical_slash_sparse_attn	0.231587	0.231616	0.999872
example_mha_bwd_bhsd	0.0399914	0.0399959	0.999887
example_topk	0.0108931	0.0108938	0.999944
example_gemm_intrinsics	0.0342493	0.0342506	0.999962
example_tilelang_block_sparse_attn	0.0100717	0.0100719	0.999979
example_tilelang_gemm_fp8_intrinsic	0.822594	0.82261	0.999981
example_gemm_autotune	0.0223542	0.0223542	1
example_convolution	1.3094	1.30934	1.00004
example_dynamic	0.651544	0.651512	1.00005
example_gemv	0.281807	0.281789	1.00006
example_mla_decode	0.449423	0.449394	1.00006
example_linear_attn_fwd	0.0365443	0.03654	1.00012
example_tilelang_sparse_gqa_decode_varlen_mask	0.0231304	0.0231276	1.00012
example_convolution_autotune	0.989753	0.989612	1.00014
example_gemm_schedule	0.0322498	0.0322439	1.00018
example_gqa_bwd_tma_reduce_varlen	0.0515255	0.0515157	1.00019
sparse_mla_fwd_pipelined	0.0953252	0.0953051	1.00021
example_mha_sink_fwd_bhsd	0.0157255	0.0157222	1.00021
example_gqa_sink_bwd_bhsd_sliding_window	0.0251411	0.0251336	1.0003
example_gqa_fwd_bshd	0.0708461	0.0708247	1.0003
example_tilelang_gemm_splitk_vectorize_atomicadd	1.4009	1.40047	1.00031
example_mha_sink_bwd_bhsd_sliding_window	0.0442566	0.0442416	1.00034
example_gqa_bwd	0.0490331	0.0490159	1.00035
example_warp_specialize_gemm_copy_0_gemm_1	0.0387823	0.0387662	1.00041
example_tilelang_gemm_splitk	1.40237	1.40167	1.0005
example_dequant_gemm_fp4_hopper	1.03574	1.0352	1.00052
example_mha_fwd_bhsd	0.011079	0.0110732	1.00052
example_gqa_sink_fwd_bhsd_wgmma_pipelined	0.0142977	0.0142901	1.00053
example_mha_fwd_bshd_wgmma_pipelined	0.0144949	0.0144867	1.00057
block_sparse_attn_tilelang	0.0101613	0.0101556	1.00057
example_gqa_sink_bwd_bhsd	0.0408343	0.0408093	1.00061
example_warp_specialize_gemm_copy_1_gemm_0	0.0382955	0.0382707	1.00065
example_gqa_fwd_bshd_wgmma_pipelined	0.0551782	0.0551324	1.00083
example_mha_bwd_bshd_wgmma_pipelined	0.0254386	0.0254151	1.00093
example_gemm	0.0227593	0.0227352	1.00106
example_gqa_decode	0.0478334	0.0477821	1.00107
topk_selector	0.0528789	0.0528185	1.00114
example_dequant_gemv_fp16xint4	0.0284072	0.0283728	1.00121
example_elementwise_add	0.294226	0.293841	1.00131
sparse_mla_fwd	0.129205	0.129035	1.00132
example_dequant_gemm_bf16_fp4_hopper	0.566159	0.56514	1.0018
example_group_per_split_token_cast_to_fp8	0.01035	0.0103289	1.00204
example_warp_specialize_gemm_barrierpipe_stage2	0.0394405	0.0393564	1.00214
example_mha_sink_fwd_bhsd_wgmma_pipelined_sliding_window	0.0153477	0.0153118	1.00234

Artifacts

regression_result.png (speedup plot) is attached as a workflow artifact. Download it from the workflow run page above.

… Update initialization logic for temporary buffers and improve reduction operations in reduce.cc. Add print statements for debugging in test_tilelang_language_reduce.py.

Optimize templates for half/bfloat16

83fb127

LJC00118 added 2 commits February 13, 2026 14:18

Fix lint

ddd6222

Minor fix

5aa2381

Enhance ReduceOpNode to handle max/min/absmax cases with clear=false.…

45d4a35

… Update initialization logic for temporary buffers and improve reduction operations in reduce.cc. Add print statements for debugging in test_tilelang_language_reduce.py.

LeiWang1999 merged commit 5e3c6b3 into tile-ai:main Feb 14, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Optimize templates for half/bfloat16#1845

[Enhancement] Optimize templates for half/bfloat16#1845
LeiWang1999 merged 4 commits intotile-ai:mainfrom
LJC00118:qwq11

LJC00118 commented Feb 13, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Feb 13, 2026

Uh oh!

coderabbitai bot commented Feb 13, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

LJC00118 commented Feb 13, 2026

Uh oh!

github-actions bot commented Feb 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LJC00118 commented Feb 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Feb 13, 2026

Uh oh!

coderabbitai bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

LJC00118 commented Feb 13, 2026

Uh oh!

github-actions bot commented Feb 13, 2026

Performance Regression Test Report

Results

Artifacts

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LJC00118 commented Feb 13, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 13, 2026 •

edited

Loading