-
Notifications
You must be signed in to change notification settings - Fork 13
Fixes for Adreno inference (Q8/Q4) and OUT_PROD #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: temp-latest-finetuning
Are you sure you want to change the base?
Fixes for Adreno inference (Q8/Q4) and OUT_PROD #34
Conversation
c53efd4
to
082fa25
Compare
Fix GGML_VK_CHECK_RESULTS build option by properly supporting IM2COL_3D.
082fa25
to
0f376cd
Compare
vec2 acc = vec2(0.0); | ||
|
||
for (uint k = 0; k < p.ne01; k++) { | ||
if (i0 + 1 >= p.ne20) { // XXX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does XXX mean? why i0+1? wont it skip last element?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I forgot to remove this comment. Usually XXX
is used in some codebases to bring attention to some line of code that could be problematic, to be reviewed later. It's similar to TODO
or FIXME
.
In this case what you asked about i0+1
is the reason I marked it with XXX
, so I would verify it worked later.
This shader uses i0
to calculate an index iqs
into a Q8 data block. This index is then used to call dequantize(ib, iqs, 0)
, which retrieves two Q8 values from data_a[]
and store them into a vec2
, this is why I use i0+1
, otherwise the second value would be out of range.
This math here is a bit tricky, since i0
is calculated from get_dst_indices/data_d
and indexes into data_a
so it could be wrong, but I believe it's correct at the moment. Let me know if you see a mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is for not accessing 32th value, right? p.ne20 is output index? what if output is larger than 32? also if i0 is 31 isn't it still valid (only 32 not valid)?
for example:
Scenario 1: p.ne20 = 64 (64 elements)
- i0 = 31: Check is 32 >= 64? NO → processes element 31 → BUG (accesses qs[32])
- i0 = 63: Check is 64 >= 64? YES → skips element 63 → WRONG RESULT (misses last element)
Shouldn't change any behavior since currently nb00 is always 1. Robustness is usually disabled for Q8/Q4 shaders since having it enabled impacts performance more significantly for those types than F16/F32.
Introduce a CMAKE option for disabling Adreno-specific shaders if needed, this improves build time, but should not be used when targeting Adreno devices.
Extend device detection to classify Qualcomm Adreno GPUs, enabling targeted workarounds and shader selection when those devices are present.
Avoid subgroup operations on Adreno by selecting safer paths to sidestep compiler/driver bugs while preserving behavior.
Similar to what we do for other vendors such as Intel.
This optimization broke inference on Adreno.
Add build-time generation of Adreno-targeted shader variants under a guard, so Adreno devices use safer code paths while other GPUs remain unaffected.
Increase OUT_PROD Q8 performance through improving memory locality.
This makes finetuning work without crashing on Adreno 830.
0f376cd
to
0f6d6da
Compare
@olyasir I've addressed your comments and pushed some cleanups, please feel free to give it another look. I've also edited the PR to merge into |
Make sure to read the contributing guidelines before submitting a PR