Change compile_kernel to use threads_per_warp specified in metadata (#4814)

whitneywhtsang · web-flow · commit 9d65e0849a23 · 2025-08-01T11:41:29.000-04:00
Intel Triton selects different `threads_per_warp` based on the kernel, and stores the selected `threads_per_warp` in metadata. This PR changes `compile_kernel` to use the stored `threads_per_warp` in metadata. This PR fixes below error with `igc-19724`: ``` terminate called after throwing an instance of 'sycl::_V1::exception' what(): The specified local size {1, 1, 32} doesn't match the required work-group size specified in the program source {1, 1, 16} ``` CI with `igc-19724` + this change: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/16662411889
diff --git a/python/triton/tools/compile.py b/python/triton/tools/compile.py
@@ -153,6 +153,7 @@ def constexpr(s):
         }
     options = backend.parse_options(kwargs)
     ccinfo = triton.compile(src, target=target, options=options.__dict__)
+    args.threads_per_warp = ccinfo.metadata.threads_per_warp
 
     if getattr(ccinfo.metadata, "global_scratch_size", 0) > 0:
         raise RuntimeError("AOT compiling kernels with global scratch requirements is not yet implemented")

Original file line number	Diff line number	Diff line change
`@@ -153,6 +153,7 @@ def constexpr(s):`
`153`	`153`	`}`
`154`	`154`	`options = backend.parse_options(kwargs)`
`155`	`155`	`ccinfo = triton.compile(src, target=target, options=options.__dict__)`
	`156`	`+ args.threads_per_warp = ccinfo.metadata.threads_per_warp`
`156`	`157`
`157`	`158`	`if getattr(ccinfo.metadata, "global_scratch_size", 0) > 0:`
`158`	`159`	`raise RuntimeError("AOT compiling kernels with global scratch requirements is not yet implemented")`