You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Return n_regs for binaries compiled explicitly with a register size mode option (#2391)
Intel Data Center Max GPUs will dynamically scale the number of hardware
threads available per XVE depending on the specified GRF mode. With
small GRF mode (default), a single hardware thread can access 128 GRF
registers and each XVE engine has 8 hardware threads. In large GRF mode,
a single hardware thread can access 256 GRF registers but each XVE
engine only has 4 hardware threads. There is also an auto mode.
([see the docs for more
info](https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2024-2/small-register-mode-vs-large-register-mode.html))
This PR adds support for populating the `n_regs` parameter returned from
loading a binary with information about the selected GRF mode. Because
L0 does not return the number of registers and our register size info
does not work like NVIDIA, the semantics are a bit different from
upstream Triton. We _only_ return a value if the user has specified a
small or large GRF mode build flag. The purpose of returning `n_regs` in
upstream Triton/Torch Inductor is b/c NVIDIA can dynamically adjust
occupancy of a SM based on the register pressure per warp. This means
high register pressure can result in fewer running warps which reduces
parallelism and performance. Theoretically, you can have many different
"GRF modes" on a NVIDIA GPU as you adjust SM occupancy. For Intel GPUs,
the choice is binary - large or small - and the performance penalty for
register spills in small always outweighs any parallelism gains (at
least, in our testing so far). It is not clear that returning 128 is
actionable as further reductions in register usage will not effect
occupancy - only the large GRF mode effects occupancy. So, I focused on
making sure large GRF mode was properly handled and other cases were
handled as we were able, with any ambiguous case returning 0 (which will
cause torch inductor to skip any register-specific optimization).
The approach to returning GRF size is dependent on parsing the build
flags passed to the binary loader. Because the build flags are modified
in the `make_spv` step during generation of native code instead of a
SPIRV file, this approach should work for the native code POC recently
merged in #2148.
Note that I had to introduce exceptions into our `driver.c` code to make
the error handling acceptable. This cleaned up a lot of the code, and I
believe should be acceptable both because we already depend on c++ in
`driver.c` (just not in the external signatures) and because exceptions
are used in other parts of the Triton codebase.
I marked this as a draft PR because I would like to do a bit more
testing, but it is ready for review.
Close#1641
0 commit comments