Skip to content

[Enhancement] Optimize templates for half/bfloat16#1845

Merged
LeiWang1999 merged 4 commits intotile-ai:mainfrom
LJC00118:qwq11
Feb 14, 2026
Merged

[Enhancement] Optimize templates for half/bfloat16#1845
LeiWang1999 merged 4 commits intotile-ai:mainfrom
LJC00118:qwq11

Conversation

@LJC00118
Copy link
Collaborator

@LJC00118 LJC00118 commented Feb 13, 2026

Summary by CodeRabbit

  • Bug Fixes

    • Corrected min/max behavior for bfloat16 and float16 scalars and added explicit bfloat16/float16 comparisons in reductions.
    • Ensured proper initialization, duplication, and update semantics for max/min/absmax reduction paths when using temporary buffers.
  • Refactor

    • Optimized 16-bit float shuffle and math helpers to use bitwise-safe pathways for improved correctness and performance.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 13, 2026

📝 Walkthrough

Walkthrough

Splits scalar bf16/fp16 Min/Max codegen into explicit type-specific branches using native __hmin/__hmax with casts, rewrites 16-bit shuffle/math helpers to operate via integer bitcasts for half_t/bfloat16_t, and extends reduction paths to handle max/min/absmax with proper init/dup/update semantics.

Changes

Cohort / File(s) Summary
Scalar Min/Max codegen
src/target/codegen_cuda.cc
MinNode/MaxNode scalar paths split into explicit bfloat16 and fp16 branches that emit __hmin/__hmax on native NV representations with to_nv_bfloat16/to_half casts and early returns.
16-bit warp shuffle & math helpers
src/tl_templates/cuda/common.h
Replaced per-lane float promotion with integer bitcast-based shuffles for half_t/bfloat16_t (shfl_*_sync specializations); updated __habs/hrsqrt wrappers to use native-type reinterpret casts and operate on underlying bit representations.
CUDA reduce templates
src/tl_templates/cuda/reduce.h
Added MinOp/MaxOp overloads for bfloat16_t and half_t that call __hmin/__hmax on NV native representations.
Runtime reduction logic
src/op/reduce.cc
Extend non-clear reduction handling to treat max/min/absmax like other reductions: set duplication/update flags, initialize temporary clear buffers when needed, and apply per-element max/min/absmax updates.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • bucket-xv

Poem

🐰 I hop through bits with nimble paws and grin,
Half and bfloat find order tucked within.
Shuffles hum, intrinsics sing so neat,
Min and Max now march on careful feet. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (11 files):

⚔️ .github/workflows/dist.yml (content)
⚔️ src/op/finalize_reducer.cc (content)
⚔️ src/op/reduce.cc (content)
⚔️ src/target/codegen_cuda.cc (content)
⚔️ src/target/stubs/cuda.cc (content)
⚔️ src/target/stubs/cuda.h (content)
⚔️ src/target/stubs/cudart.cc (content)
⚔️ src/target/stubs/nvrtc.cc (content)
⚔️ src/tl_templates/cuda/common.h (content)
⚔️ src/tl_templates/cuda/reduce.h (content)
⚔️ testing/python/language/test_tilelang_language_reduce.py (content)

These conflicts must be resolved before merging into main.
Resolve conflicts locally and push changes to this branch.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change—optimizing templates for half/bfloat16 data types across multiple files (codegen, reduce operations, and shuffles).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
⚔️ Resolve merge conflicts (beta)
  • Auto-commit resolved conflicts to branch qwq11
  • Post resolved changes as copyable diffs in a comment

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
src/op/reduce.cc (1)

430-437: Pre-existing: dead branch in BitAnd update.

Line 433's ternary this->clear ? src_val : bitwise_and(…) is always in the !this->clear branch since need_update is only set when !this->clear (line 290). The this->clear ? src_val arm is unreachable. Not introduced by this PR, but worth a cleanup.

♻️ Suggested simplification
-          update = this->clear ? src_val : bitwise_and(dst_val, src_val);
+          update = bitwise_and(dst_val, src_val);

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@LJC00118
Copy link
Collaborator Author

@regression-perf

@github-actions
Copy link

Performance Regression Test Report

Triggered by: @LJC00118
Workflow run: https://github.com/tile-ai/tilelang/actions/runs/21982441237

Results

File Original Latency Current Latency Speedup
example_per_token_cast_to_fp8 0.00726207 0.00739707 0.98175
example_mha_fwd_bhsd_wgmma_pipelined 0.0141607 0.0141931 0.997719
example_mha_sink_bwd_bhsd 0.0614047 0.0615298 0.997966
example_blocksparse_gemm 0.0223992 0.0224367 0.998328
example_mha_sink_fwd_bhsd_sliding_window 0.0155575 0.0155803 0.998536
example_tilelang_nsa_fwd 0.00693401 0.00694153 0.998918
example_mha_sink_fwd_bhsd_wgmma_pipelined 0.0152667 0.0152828 0.998946
example_mha_fwd_bshd 0.0258086 0.0258326 0.999072
example_gqa_bwd_wgmma_pipelined 0.0686826 0.0687396 0.999171
fp8_lighting_indexer 0.0353876 0.0354145 0.999241
sparse_mla_bwd 0.376535 0.376819 0.999248
example_tilelang_sparse_gqa_decode_varlen_indice 0.0168886 0.0169 0.999326
example_tilelang_gemm_fp8_2xAcc 0.18393 0.184049 0.999351
example_linear_attn_bwd 0.151305 0.151398 0.999385
tilelang_example_sparse_tensorcore 0.0148845 0.0148936 0.99939
example_dequant_groupedgemm_bf16_mxfp4_hopper 3.40557 3.40758 0.999408
example_mha_inference 0.0790229 0.0790642 0.999478
example_warp_specialize_gemm_softpipe_stage2 0.038313 0.0383321 0.999501
example_tilelang_nsa_decode 0.00730649 0.0073099 0.999535
example_mha_bwd_bshd 0.0405979 0.0406131 0.999624
example_mha_fwd_varlen 0.0449958 0.0450076 0.999738
example_gqa_sink_fwd_bhsd_wgmma_pipelined_sliding_window 0.0144177 0.0144212 0.999762
example_dequant_gemm_w4a8 5.30173 5.30264 0.999828
example_tilelang_gemm_fp8 0.318683 0.318728 0.99986
example_dequant_gemm_bf16_mxfp4_hopper 0.50253 0.502597 0.999865
example_fusedmoe_tilelang 0.131413 0.13143 0.999865
example_vertical_slash_sparse_attn 0.231587 0.231616 0.999872
example_mha_bwd_bhsd 0.0399914 0.0399959 0.999887
example_topk 0.0108931 0.0108938 0.999944
example_gemm_intrinsics 0.0342493 0.0342506 0.999962
example_tilelang_block_sparse_attn 0.0100717 0.0100719 0.999979
example_tilelang_gemm_fp8_intrinsic 0.822594 0.82261 0.999981
example_gemm_autotune 0.0223542 0.0223542 1
example_convolution 1.3094 1.30934 1.00004
example_dynamic 0.651544 0.651512 1.00005
example_gemv 0.281807 0.281789 1.00006
example_mla_decode 0.449423 0.449394 1.00006
example_linear_attn_fwd 0.0365443 0.03654 1.00012
example_tilelang_sparse_gqa_decode_varlen_mask 0.0231304 0.0231276 1.00012
example_convolution_autotune 0.989753 0.989612 1.00014
example_gemm_schedule 0.0322498 0.0322439 1.00018
example_gqa_bwd_tma_reduce_varlen 0.0515255 0.0515157 1.00019
sparse_mla_fwd_pipelined 0.0953252 0.0953051 1.00021
example_mha_sink_fwd_bhsd 0.0157255 0.0157222 1.00021
example_gqa_sink_bwd_bhsd_sliding_window 0.0251411 0.0251336 1.0003
example_gqa_fwd_bshd 0.0708461 0.0708247 1.0003
example_tilelang_gemm_splitk_vectorize_atomicadd 1.4009 1.40047 1.00031
example_mha_sink_bwd_bhsd_sliding_window 0.0442566 0.0442416 1.00034
example_gqa_bwd 0.0490331 0.0490159 1.00035
example_warp_specialize_gemm_copy_0_gemm_1 0.0387823 0.0387662 1.00041
example_tilelang_gemm_splitk 1.40237 1.40167 1.0005
example_dequant_gemm_fp4_hopper 1.03574 1.0352 1.00052
example_mha_fwd_bhsd 0.011079 0.0110732 1.00052
example_gqa_sink_fwd_bhsd_wgmma_pipelined 0.0142977 0.0142901 1.00053
example_mha_fwd_bshd_wgmma_pipelined 0.0144949 0.0144867 1.00057
block_sparse_attn_tilelang 0.0101613 0.0101556 1.00057
example_gqa_sink_bwd_bhsd 0.0408343 0.0408093 1.00061
example_warp_specialize_gemm_copy_1_gemm_0 0.0382955 0.0382707 1.00065
example_gqa_fwd_bshd_wgmma_pipelined 0.0551782 0.0551324 1.00083
example_mha_bwd_bshd_wgmma_pipelined 0.0254386 0.0254151 1.00093
example_gemm 0.0227593 0.0227352 1.00106
example_gqa_decode 0.0478334 0.0477821 1.00107
topk_selector 0.0528789 0.0528185 1.00114
example_dequant_gemv_fp16xint4 0.0284072 0.0283728 1.00121
example_elementwise_add 0.294226 0.293841 1.00131
sparse_mla_fwd 0.129205 0.129035 1.00132
example_dequant_gemm_bf16_fp4_hopper 0.566159 0.56514 1.0018
example_group_per_split_token_cast_to_fp8 0.01035 0.0103289 1.00204
example_warp_specialize_gemm_barrierpipe_stage2 0.0394405 0.0393564 1.00214
example_mha_sink_fwd_bhsd_wgmma_pipelined_sliding_window 0.0153477 0.0153118 1.00234

Artifacts

  • regression_result.png (speedup plot) is attached as a workflow artifact. Download it from the workflow run page above.

… Update initialization logic for temporary buffers and improve reduction operations in reduce.cc. Add print statements for debugging in test_tilelang_language_reduce.py.
@LeiWang1999 LeiWang1999 merged commit 5e3c6b3 into tile-ai:main Feb 14, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants