Skip to content

Conversation

infinitalo
Copy link

Make sure to read the contributing guidelines before submitting a PR

Fix GGML_VK_CHECK_RESULTS build option by properly supporting IM2COL_3D.
vec2 acc = vec2(0.0);

for (uint k = 0; k < p.ne01; k++) {
if (i0 + 1 >= p.ne20) { // XXX
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does XXX mean? why i0+1? wont it skip last element?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot to remove this comment. Usually XXX is used in some codebases to bring attention to some line of code that could be problematic, to be reviewed later. It's similar to TODO or FIXME.

In this case what you asked about i0+1 is the reason I marked it with XXX, so I would verify it worked later.

This shader uses i0 to calculate an index iqs into a Q8 data block. This index is then used to call dequantize(ib, iqs, 0), which retrieves two Q8 values from data_a[] and store them into a vec2, this is why I use i0+1, otherwise the second value would be out of range.

This math here is a bit tricky, since i0 is calculated from get_dst_indices/data_d and indexes into data_a so it could be wrong, but I believe it's correct at the moment. Let me know if you see a mistake.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is for not accessing 32th value, right? p.ne20 is output index? what if output is larger than 32? also if i0 is 31 isn't it still valid (only 32 not valid)?

for example:
Scenario 1: p.ne20 = 64 (64 elements)

  • i0 = 31: Check is 32 >= 64? NO → processes element 31 → BUG (accesses qs[32])
  • i0 = 63: Check is 64 >= 64? YES → skips element 63 → WRONG RESULT (misses last element)

Italo Nicola added 6 commits October 16, 2025 16:31
Shouldn't change any behavior since currently nb00 is always 1.
Robustness is usually disabled for Q8/Q4 shaders since having it enabled
impacts performance more significantly for those types than F16/F32.
Introduce a CMAKE option for disabling Adreno-specific shaders if
needed, this improves build time, but should not be used when targeting
Adreno devices.
Extend device detection to classify Qualcomm Adreno GPUs, enabling
targeted workarounds and shader selection when those devices are
present.
Avoid subgroup operations on Adreno by selecting safer paths to sidestep
compiler/driver bugs while preserving behavior.
Similar to what we do for other vendors such as Intel.
This optimization broke inference on Adreno.
Italo Nicola added 3 commits October 16, 2025 16:32
Add build-time generation of Adreno-targeted shader variants under a
guard, so Adreno devices use safer code paths while other GPUs remain
unaffected.
Increase OUT_PROD Q8 performance through improving memory locality.
This makes finetuning work without crashing on Adreno 830.
@infinitalo infinitalo force-pushed the italo/adreno_inference branch from 0f376cd to 0f6d6da Compare October 16, 2025 20:15
@infinitalo infinitalo changed the base branch from temp-latest to temp-latest-finetuning October 16, 2025 20:39
@infinitalo
Copy link
Author

infinitalo commented Oct 16, 2025

@olyasir I've addressed your comments and pushed some cleanups, please feel free to give it another look.

I've also edited the PR to merge into temp-latest-finetuning instead of temp-latest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants