PR Digest Iteration 56

PR Summary

Period: 2026-04-26 to 2026-05-09 (Iteration 56) | Total PRs: 27 (19 from Xe2/Xe3/Xe3P, 8 from Xe4) | Lines changed: +21,099 / -5,145

Triton XPU BE (Xe2/Xe3/Xe3P) (19 PRs, +3,216 / -660)

The team advanced the 2D block I/O infrastructure by introducing new dialect-level ops that cleanly separate the decision to use hardware 2D block loads from the lowering mechanics, and relanded an important 1D-to-2D load reshape optimization with a correctness fix. Two new compiler passes — cache control annotation and 256-bit store widening — directly improve memory bandwidth for elementwise and streaming workloads on Xe3P. Three upstream synchronizations were completed with a 99.59%–99.75% test pass rate.

Key accomplishments:

Introduced new TTGIR-level 2D block load ops, decoupling the hardware acceleration decision from LLVM lowering
Added a cache control annotation pass that improves memory bandwidth 11–46% on streaming and elementwise kernels
Enabled 256-bit store vectorization on Xe3P, halving the number of store instructions for aligned workloads
Relanded the 1D-to-2D load reshape optimization with a fix for a layout edge case that caused a regression
Fixed a vLLM startup crash caused by Level Zero being accessed before initialization, with improved diagnostics

Memory Access & Lowering

New dialect ops and lowering infrastructure that lay the groundwork for a clean, TTGIR-level representation of 2D block I/O.

#6805 Move vBlocks capping into getBlockIOTileSize<true> — Centralizes vBlocks capping logic for load operations, eliminating duplication across three conversion classes. (+23/-23, @whitneywhtsang)
#6792 Add ttig.2d_block_load and ttig.2d_block_load_from_ptr ops — Introduces TTGIR-level ops for descriptor-based and pointer-based 2D block loads, separating the hardware-use decision from LLVM lowering. (+200/-0, @whitneywhtsang)
#6786 Extract block IO utilities from LoadStoreOpToLLVM — Moves BlockIOTileSizeInfo and related helpers to shared files so both TTGIR passes and LLVM lowering can reuse them. (+448/-377, @whitneywhtsang)
#6782 Return failure instead of asserting on unsupported tileWidth — Replaces an assertion in LoadOpToBlockIOConversion with a graceful fallback to scalar/gather lowering (closes #6740). (+46/-0, @wdziurdz)
#6747 Reland 1D→2D load reshape with W<tpw bail-out — Relands the strided-load-to-2D-block-load reshape from #6738 with a fix that skips the reshape when the tile width is narrower than the subgroup size. (+640/-31, @etiotto)
#6734 Fix adjust base width bug — Corrects a base-width adjustment error in 2D block-load lowering where pointer misalignment compensation produced an incorrect minimum-width constraint. (+0/-5, @chengjunlu)

Performance Optimizations

New compiler passes and kernel-level tuning that directly improve throughput on real workloads.

#6723 Add tritonintelgpu-annotate-cache-control pass — Annotates non-dot-operand loads with the streaming cache modifier to avoid L1 thrashing, yielding 11–46% bandwidth improvements on streaming kernels (closes #6715). (+933/-0, @etiotto)
#6727 Widen elementwise store vectorization to 256 bits (Xe3P) — Adds a post-coalesce pass that widens qualifying stores from 128b to 256b per thread on Xe3P hardware, halving the number of store instructions for aligned types (closes #6728). (+446/-17, @etiotto)
#6750 Use twisted grid order for flash attention kernel — Adopts a twisted program-ID ordering that improves flash attention forward kernel performance by ~1.05x geomean. (+16/-9, @chengjunlu)
#6768 Fix boundary_check bug of tl.make_block_ptr for column-major for SGLANG — Corrects the column-major boundary check logic in make_block_ptr to fix incorrect results in SGLANG workloads. (+23/-6, @anmyachev)

Correctness & Robustness

Bug fixes and defensive changes that prevent silent failures or hard crashes in production deployments.

#6777 Guard sqrt_cr/divide_cr builtins behind SPV_INTEL_rounded_divide_sqrt capability — Falls back to __imf functions on LTS drivers that lack the rounded divide/sqrt SPIR-V extension (related to #6261). (+134/-31, @whitneywhtsang)
#6767 Fix vLLM startup segfault and improve XPU init error handling — Calls zeInit before any Level Zero API usage in the JIT module and adds null-handle guards that turn opaque segfaults into readable Python errors. (+119/-48, @lslusarczyk)
#6772 Allow hoistConvertDotOperand across UpcastFpOpInterface — Aligns with upstream by adding UpcastFpOpInterface to the no-data-movement set, preventing spurious layout conversions around upcasting ops. (+7/-2, @etiotto)
#6771 Fix test_enable_fp_fusion by properly respecting default_fp_fusion knob — Corrects the FP fusion gate so the test honors the knob rather than always enabling fusion. (+10/-10, @anmyachev)
#6370 Add handleArgPtrDatatype to pipeline manager — Aligns Intel's FuncOpConversion with upstream by calling handleArgPtrDatatype, fixing missing kernel argument debug info in LLVM IR. (+15/-48, @dev-tomek)
#6774 Add explicit Utility.h include — Adds a direct include for the header providing lookupNumWarps and getDefaultBlockedEncoding, removing a transitive-include dependency. (+7/-2, @etiotto)

Developer Tooling

Improvements to diagnostics and debugging infrastructure.

#6800 Add debug statistics and TRITON_INTEL_HLC_STATS env var — Instruments HoistLayoutConversions with LLVM STATISTIC counters; setting TRITON_INTEL_HLC_STATS=1 prints a human-readable hoist/reject summary at pass completion (fixes #6799). (+42/-6, @etiotto)
#6720 Refactor SPIRVRunner arg dump to kernel cache, add dump-dir overrides, and support runner path option — Moves kernel argument dumps into the kernel cache directory and adds CLI options for custom dump paths and runner binary location. (+93/-36, @app/copilot-swe-agent)

Test & CI Reliability

Skiplist maintenance to keep CI green after upstream test changes.

#6761 Fix xe2 triton_kernels skiplist after upstream PR #9986 — Updates 14 stale skip-list entries on Xe2 that were broken when upstream added a new Case field and new swiglu test cases (fixes #6760). (+14/-14, @exolyr)

Upstream Alignment

3 upstream merges from OpenAI Triton (commits 9f34338, 88e8e52, 3123400) — pass rate 99.59%–99.75%.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR Digest Iteration 56

PR Summary

Triton XPU BE (Xe2/Xe3/Xe3P) (19 PRs, +3,216 / -660)

Memory Access & Lowering

Performance Optimizations

Correctness & Robustness

Developer Tooling

Test & CI Reliability

Upstream Alignment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally