Skip to content

Conversation

@jeffbolznv
Copy link
Collaborator

Add variants of the im2col shaders that use buffer_device_address/buffer_reference, and use 64-bit address calculations. This is needed for large convolutions used in stable-diffusion.cpp.

I've been working on getting leejet/stable-diffusion.cpp#778 to work in Vulkan. The main thing that's missing is that it does 2d and 3d convolutions that have intermediate im2col buffers that are larger than 4GB. This change fixes the im2col part, I'll make a separate change for the matmul part.

Memory allocations larger than maxMemoryAllocationSize are not technically forbidden, and at least NVIDIA's windows driver will allocate more than 4GB.

@jeffbolznv jeffbolznv requested a review from 0cc4m as a code owner September 20, 2025 19:30
@github-actions github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Sep 20, 2025
@0cc4m
Copy link
Collaborator

0cc4m commented Sep 21, 2025

Memory allocations larger than maxMemoryAllocationSize are not technically forbidden, and at least NVIDIA's windows driver will allocate more than 4GB.

This is true for allocations, but not for buffers. If I disable the allocation size check in ggml_vk_create_buffer your new test kinda runs on all my devices, but I'm not sure if it runs correctly. Validation layers complain about the buffer size and the descriptor range, of course.

I tried running your new im2col and im2col_3d tests:

On AMD (RADV) it does the allocation, but fails the test runs.
On Intel (ANV) it gets the correct result for im2col. im2col_3d fails because 16GB VRAM wasn't enough.
On Nvidia (proprietary Linux driver) it runs correctly.

On all three it takes very long to finish the test. The tests also used huge amounts of RAM (>80GB), not sure if that's the CPU backend or something else.

@jeffbolznv
Copy link
Collaborator Author

I've pushed a fix for the descriptor range validation failure. I'm not aware of one related to the buffer size.

The large memory usage and slowness is expected. The test framework ends up with multiple copies of the huge tensor, converted to f32. I don't intend to enable these tests by default.

I'm surprised the AMD driver is failing. Maybe it could have been related to the validation failure, but that's a bit surprising since that descriptor isn't actually used.

@0cc4m
Copy link
Collaborator

0cc4m commented Sep 21, 2025

Mesa driver development seems to work by building stuff, optimizing it and fixing issues when they come up. If things don't come up, they often don't work, so this is probably another case of "nobody tried to do this yet". We'll have to open an issue about it, most likely.

@0cc4m
Copy link
Collaborator

0cc4m commented Sep 27, 2025

I've pushed a fix for the descriptor range validation failure. I'm not aware of one related to the buffer size.

I mean this:

vkCreateBuffer(): pCreateInfo->size (11041505280) is larger than the maximum allowed buffer size VkPhysicalDeviceMaintenance4Properties.maxBufferSize (4294967292).

It does not happen on Nvidia, because Nvidia reports a very large maxBufferSize, while AMD and Intel do not.

Besides that, how do you plan to handle the allocation size check in ggml_vk_create_buffer?

@jeffbolznv
Copy link
Collaborator Author

OK, that explains why I didn't see it on NVIDIA. I don't know how to get around that on other implementations. Maybe they can eventually relax the limit in their drivers.

Besides that, how do you plan to handle the allocation size check in ggml_vk_create_buffer?

Based on all this, maybe I should change it to check maxBufferSize. I've been planning to do it in a separate change.

Add variants of the im2col shaders that use buffer_device_address/buffer_reference,
and use 64-bit address calculations. This is needed for large convolutions used in
stable-diffusion.cpp.
@jeffbolznv
Copy link
Collaborator Author

Rebased, to hopefully resolve old CI failures.

@0cc4m 0cc4m merged commit d8359f5 into ggml-org:master Sep 28, 2025
110 of 120 checks passed
@0cc4m
Copy link
Collaborator

0cc4m commented Sep 29, 2025

@jeffbolznv I'm looking into using buffer_reference to reduce the integer dot mmq shader shared memory size. As a first basic test I did this:

layout(buffer_reference) buffer ShmemTypeB { int32_t data[BN * SHMEM_STRIDE]; };

shared ShmemTypeB buf_b_qs;
[...]
buf_b_qs.data[i] = ...

It works on Intel and Nvidia, but completely crashes AMD RADV to the point that it automatically reboots the entire server, so something is very wrong. Can you tell me if that's correct use? If so, I need to open an issue with Mesa.

@jeffbolznv
Copy link
Collaborator Author

buffer_reference types always point to buffer memory, so I can't quite tell what this snippet is supposed to do. It looks like it declares a pointer to buffer memory in shared memory.

If what you're trying to do is reuse the same shared memory bytes for different parts of the shader, e.g. make coopmat_stage use the same memory as buf_a_qs/buf_b_qs, then https://github.com/KhronosGroup/GLSL/blob/main/extensions/ext/GL_EXT_shared_memory_block.txt is the extension you want (warning: the spec text is not very helpful).

@0cc4m
Copy link
Collaborator

0cc4m commented Sep 29, 2025

Oh, alright. I was trying to get pointer casting for shared memory, basically, to get some more flexibility with buffering. I need to spend more time trying to understand these extensions first, they are quite hard to grasp and I can find barely any examples.

yael-works pushed a commit to yael-works/llama.cpp that referenced this pull request Oct 15, 2025
* vulkan: 64-bit im2col

Add variants of the im2col shaders that use buffer_device_address/buffer_reference,
and use 64-bit address calculations. This is needed for large convolutions used in
stable-diffusion.cpp.

* fix validation error for large im2col
pwilkin pushed a commit to pwilkin/llama.cpp that referenced this pull request Oct 23, 2025
* vulkan: 64-bit im2col

Add variants of the im2col shaders that use buffer_device_address/buffer_reference,
and use 64-bit address calculations. This is needed for large convolutions used in
stable-diffusion.cpp.

* fix validation error for large im2col
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants