You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update on "[ET-VK][ez] Fix 8 bit linear compute shader dispatch"
## Context
Currently, for the `q_8w_linear` shader, both the texture and the buffer variants use the same global work group and local work group setting.
Specially, the global work group is set to `{out.numel(), 1, 1}` and the local work group is set to `{64, 1, 1}`.
However, I believe this results in a very poor memory re-use for the texture shader. In this configuration:
* Within a work group each invocation will be requesting a different row of A - 64 rows of A requested in total
* All work groups will be requesting the same row of B
* One work group will load 65 unique rows from A and B
Compare this to a local work group size of `{8, 8, 1}`
* Across the work group, 8 rows will be loaded from A and 8 rows will be loaded from B
* One work group will load 16 unique rows total from A and B
Evidently, there is better memory re-use in the latter work group as fewer unique rows are loaded.
## Changes
Modify the `q_8w_linear` shader to use `{8, 8, 1}` local wg if possible. If `M` is small, then instead use `{4, 16, 1}` or `{2, 32, 1}` to reduce the number of inactive invocations.
Differential Revision: [D71706489](https://our.internmc.facebook.com/intern/diff/D71706489/)
[ghstack-poisoned]
0 commit comments