-
Notifications
You must be signed in to change notification settings - Fork 94
PR Digest Iteration 56
Period: 2026-04-26 to 2026-05-09 (Iteration 56) | Total PRs: 27 (19 from Xe2/Xe3/Xe3P, 8 from Xe4) | Lines changed: +21,099 / -5,145
The team advanced the 2D block I/O infrastructure by introducing new dialect-level ops that cleanly separate the decision to use hardware 2D block loads from the lowering mechanics, and relanded an important 1D-to-2D load reshape optimization with a correctness fix. Two new compiler passes — cache control annotation and 256-bit store widening — directly improve memory bandwidth for elementwise and streaming workloads on Xe3P. Three upstream synchronizations were completed with a 99.59%–99.75% test pass rate.
Key accomplishments:
- Introduced new TTGIR-level 2D block load ops, decoupling the hardware acceleration decision from LLVM lowering
- Added a cache control annotation pass that improves memory bandwidth 11–46% on streaming and elementwise kernels
- Enabled 256-bit store vectorization on Xe3P, halving the number of store instructions for aligned workloads
- Relanded the 1D-to-2D load reshape optimization with a fix for a layout edge case that caused a regression
- Fixed a vLLM startup crash caused by Level Zero being accessed before initialization, with improved diagnostics
New dialect ops and lowering infrastructure that lay the groundwork for a clean, TTGIR-level representation of 2D block I/O.
-
#6805 Move
vBlockscapping intogetBlockIOTileSize<true>— Centralizes vBlocks capping logic for load operations, eliminating duplication across three conversion classes. (+23/-23, @whitneywhtsang) - #6792 Add ttig.2d_block_load and ttig.2d_block_load_from_ptr ops — Introduces TTGIR-level ops for descriptor-based and pointer-based 2D block loads, separating the hardware-use decision from LLVM lowering. (+200/-0, @whitneywhtsang)
- #6786 Extract block IO utilities from LoadStoreOpToLLVM — Moves BlockIOTileSizeInfo and related helpers to shared files so both TTGIR passes and LLVM lowering can reuse them. (+448/-377, @whitneywhtsang)
- #6782 Return failure instead of asserting on unsupported tileWidth — Replaces an assertion in LoadOpToBlockIOConversion with a graceful fallback to scalar/gather lowering (closes #6740). (+46/-0, @wdziurdz)
- #6747 Reland 1D→2D load reshape with W<tpw bail-out — Relands the strided-load-to-2D-block-load reshape from #6738 with a fix that skips the reshape when the tile width is narrower than the subgroup size. (+640/-31, @etiotto)
- #6734 Fix adjust base width bug — Corrects a base-width adjustment error in 2D block-load lowering where pointer misalignment compensation produced an incorrect minimum-width constraint. (+0/-5, @chengjunlu)
New compiler passes and kernel-level tuning that directly improve throughput on real workloads.
- #6723 Add tritonintelgpu-annotate-cache-control pass — Annotates non-dot-operand loads with the streaming cache modifier to avoid L1 thrashing, yielding 11–46% bandwidth improvements on streaming kernels (closes #6715). (+933/-0, @etiotto)
- #6727 Widen elementwise store vectorization to 256 bits (Xe3P) — Adds a post-coalesce pass that widens qualifying stores from 128b to 256b per thread on Xe3P hardware, halving the number of store instructions for aligned types (closes #6728). (+446/-17, @etiotto)
- #6750 Use twisted grid order for flash attention kernel — Adopts a twisted program-ID ordering that improves flash attention forward kernel performance by ~1.05x geomean. (+16/-9, @chengjunlu)
- #6768 Fix boundary_check bug of tl.make_block_ptr for column-major for SGLANG — Corrects the column-major boundary check logic in make_block_ptr to fix incorrect results in SGLANG workloads. (+23/-6, @anmyachev)
Bug fixes and defensive changes that prevent silent failures or hard crashes in production deployments.
- #6777 Guard sqrt_cr/divide_cr builtins behind SPV_INTEL_rounded_divide_sqrt capability — Falls back to __imf functions on LTS drivers that lack the rounded divide/sqrt SPIR-V extension (related to #6261). (+134/-31, @whitneywhtsang)
- #6767 Fix vLLM startup segfault and improve XPU init error handling — Calls zeInit before any Level Zero API usage in the JIT module and adds null-handle guards that turn opaque segfaults into readable Python errors. (+119/-48, @lslusarczyk)
- #6772 Allow hoistConvertDotOperand across UpcastFpOpInterface — Aligns with upstream by adding UpcastFpOpInterface to the no-data-movement set, preventing spurious layout conversions around upcasting ops. (+7/-2, @etiotto)
- #6771 Fix test_enable_fp_fusion by properly respecting default_fp_fusion knob — Corrects the FP fusion gate so the test honors the knob rather than always enabling fusion. (+10/-10, @anmyachev)
- #6370 Add handleArgPtrDatatype to pipeline manager — Aligns Intel's FuncOpConversion with upstream by calling handleArgPtrDatatype, fixing missing kernel argument debug info in LLVM IR. (+15/-48, @dev-tomek)
- #6774 Add explicit Utility.h include — Adds a direct include for the header providing lookupNumWarps and getDefaultBlockedEncoding, removing a transitive-include dependency. (+7/-2, @etiotto)
Improvements to diagnostics and debugging infrastructure.
- #6800 Add debug statistics and TRITON_INTEL_HLC_STATS env var — Instruments HoistLayoutConversions with LLVM STATISTIC counters; setting TRITON_INTEL_HLC_STATS=1 prints a human-readable hoist/reject summary at pass completion (fixes #6799). (+42/-6, @etiotto)
- #6720 Refactor SPIRVRunner arg dump to kernel cache, add dump-dir overrides, and support runner path option — Moves kernel argument dumps into the kernel cache directory and adds CLI options for custom dump paths and runner binary location. (+93/-36, @app/copilot-swe-agent)
Skiplist maintenance to keep CI green after upstream test changes.
- #6761 Fix xe2 triton_kernels skiplist after upstream PR #9986 — Updates 14 stale skip-list entries on Xe2 that were broken when upstream added a new Case field and new swiglu test cases (fixes #6760). (+14/-14, @exolyr)
3 upstream merges from OpenAI Triton (commits 9f34338, 88e8e52, 3123400) — pass rate 99.59%–99.75%.