You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To head things off, the way we've been using buffer instructions will still work fine.
However, there have been changes to what buffer descriptors support that will enable us to use them in more situations.
When not to use buffers
I'll note that our current pervasive use of buffer instructions wherever possible is probably not the right move once we can do optimizations that'll reflect the 32-bit-ness of indexing math more accurately - the hardware has addressing modes that are 64-bit SGPR plus 32-bit VGPR, and, if a given tensor or memref won't need to be loaded from a masked way, we shouldn't be paying the overhead of going through the texture unit for a bounds check, since from other people's experiments, it does cost us to do that.
This sort of care about where we construct buffer descriptors will be more important on gfx1250, since none of the new DMA operations introduced there support buffers, making it something of an either-or choice.
Addressing all of memory from a buffer descriptor on gfx1250
One nice thing that RDNA4 and its descendants give us is the stride_scale modifier, which lets us multiply our 14-bit stride value by 4, 8, or 32. It's worth noting that there is a structured addressing mode where the hardware reads from index * stride + offset , which are two separate arguments to a buffer load instruction - we've historically never used it because it's very un-pointer-like.
However, if you have a stride of 2^16, you can get 48 bits of address by just making the index field into addr >> 16 and the offset field into the low 16 bits of the address. Before gfx1250, this wasn't all that useful since you couldn't really do the bounds checking that you'd want.
gfx1250, however, extended the num_records field to 45 bits - more than enough, I claim - and allows for "raw" bounds checking even when you're using an index and stride. That is, you still express num_records in bytes, and that whole offset (including the scaled stride) will get compared against that value to check in-bounds-ness.
So, in principle, gfx1250 lets us remove the size restrictions on using buffer operations for masked loads.
In practice ... this'll want a lot of plumbing3.
If we want to do this splitting trick on the MLIR side, we'll want to go add ptr addrspace(9) to buffer fat pointer lowering
... that will require setting up structured GEP or one of those other proposals that's been flying around for replacing trying to divine structure off of GEP
We could also add a special type of pointer that's a buffer resource plus 64 bits of address and very specific restricitons on how it's set up, but I don't think the LLVM folks will like that
We'll certainly need some MLIR-level construct to represent this usage of buffers, especially since we'll probably have a pattern that runs during or after LLVM conversion that splits up "ptradd"s
Feasibility and impact
If we put in all the work to make these long buffers usable, it'll enable us to do masked loads using buffer instruction where we previously couldn't.
However, many usecases for such loads are being subsumed by TDM, and so this may not be something that absolutely needs to get done before release, if at all.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
To head things off, the way we've been using buffer instructions will still work fine.
However, there have been changes to what buffer descriptors support that will enable us to use them in more situations.
When not to use buffers
I'll note that our current pervasive use of buffer instructions wherever possible is probably not the right move once we can do optimizations that'll reflect the 32-bit-ness of indexing math more accurately - the hardware has addressing modes that are 64-bit SGPR plus 32-bit VGPR, and, if a given tensor or memref won't need to be loaded from a masked way, we shouldn't be paying the overhead of going through the texture unit for a bounds check, since from other people's experiments, it does cost us to do that.
This sort of care about where we construct buffer descriptors will be more important on gfx1250, since none of the new DMA operations introduced there support buffers, making it something of an either-or choice.
Addressing all of memory from a buffer descriptor on gfx1250
One nice thing that RDNA4 and its descendants give us is the stride_scale modifier, which lets us multiply our 14-bit stride value by 4, 8, or 32. It's worth noting that there is a structured addressing mode where the hardware reads from index * stride + offset , which are two separate arguments to a buffer load instruction - we've historically never used it because it's very un-pointer-like.
However, if you have a stride of 2^16, you can get 48 bits of address by just making the
indexfield into addr >> 16 and the offset field into the low 16 bits of the address. Before gfx1250, this wasn't all that useful since you couldn't really do the bounds checking that you'd want.gfx1250, however, extended the num_records field to 45 bits - more than enough, I claim - and allows for "raw" bounds checking even when you're using an index and stride. That is, you still express num_records in bytes, and that whole offset (including the scaled stride) will get compared against that value to check in-bounds-ness.
So, in principle, gfx1250 lets us remove the size restrictions on using buffer operations for masked loads.
In practice ... this'll want a lot of plumbing3.
Feasibility and impact
If we put in all the work to make these long buffers usable, it'll enable us to do masked loads using buffer instruction where we previously couldn't.
However, many usecases for such loads are being subsumed by TDM, and so this may not be something that absolutely needs to get done before release, if at all.
Beta Was this translation helpful? Give feedback.
All reactions