vulkan: 64-bit im2col #16135

jeffbolznv · 2025-09-20T19:30:09Z

Add variants of the im2col shaders that use buffer_device_address/buffer_reference, and use 64-bit address calculations. This is needed for large convolutions used in stable-diffusion.cpp.

I've been working on getting leejet/stable-diffusion.cpp#778 to work in Vulkan. The main thing that's missing is that it does 2d and 3d convolutions that have intermediate im2col buffers that are larger than 4GB. This change fixes the im2col part, I'll make a separate change for the matmul part.

Memory allocations larger than maxMemoryAllocationSize are not technically forbidden, and at least NVIDIA's windows driver will allocate more than 4GB.

0cc4m · 2025-09-21T08:23:46Z

Memory allocations larger than maxMemoryAllocationSize are not technically forbidden, and at least NVIDIA's windows driver will allocate more than 4GB.

This is true for allocations, but not for buffers. If I disable the allocation size check in ggml_vk_create_buffer your new test kinda runs on all my devices, but I'm not sure if it runs correctly. Validation layers complain about the buffer size and the descriptor range, of course.

I tried running your new im2col and im2col_3d tests:

On AMD (RADV) it does the allocation, but fails the test runs.
On Intel (ANV) it gets the correct result for im2col. im2col_3d fails because 16GB VRAM wasn't enough.
On Nvidia (proprietary Linux driver) it runs correctly.

On all three it takes very long to finish the test. The tests also used huge amounts of RAM (>80GB), not sure if that's the CPU backend or something else.

jeffbolznv · 2025-09-21T14:53:48Z

I've pushed a fix for the descriptor range validation failure. I'm not aware of one related to the buffer size.

The large memory usage and slowness is expected. The test framework ends up with multiple copies of the huge tensor, converted to f32. I don't intend to enable these tests by default.

I'm surprised the AMD driver is failing. Maybe it could have been related to the validation failure, but that's a bit surprising since that descriptor isn't actually used.

0cc4m · 2025-09-21T15:30:53Z

Mesa driver development seems to work by building stuff, optimizing it and fixing issues when they come up. If things don't come up, they often don't work, so this is probably another case of "nobody tried to do this yet". We'll have to open an issue about it, most likely.

0cc4m · 2025-09-27T08:54:17Z

I've pushed a fix for the descriptor range validation failure. I'm not aware of one related to the buffer size.

I mean this:

vkCreateBuffer(): pCreateInfo->size (11041505280) is larger than the maximum allowed buffer size VkPhysicalDeviceMaintenance4Properties.maxBufferSize (4294967292).

It does not happen on Nvidia, because Nvidia reports a very large maxBufferSize, while AMD and Intel do not.

Besides that, how do you plan to handle the allocation size check in ggml_vk_create_buffer?

jeffbolznv · 2025-09-27T15:23:09Z

OK, that explains why I didn't see it on NVIDIA. I don't know how to get around that on other implementations. Maybe they can eventually relax the limit in their drivers.

Besides that, how do you plan to handle the allocation size check in ggml_vk_create_buffer?

Based on all this, maybe I should change it to check maxBufferSize. I've been planning to do it in a separate change.

Add variants of the im2col shaders that use buffer_device_address/buffer_reference, and use 64-bit address calculations. This is needed for large convolutions used in stable-diffusion.cpp.

jeffbolznv · 2025-09-27T15:31:54Z

Rebased, to hopefully resolve old CI failures.

0cc4m · 2025-09-29T05:17:59Z

@jeffbolznv I'm looking into using buffer_reference to reduce the integer dot mmq shader shared memory size. As a first basic test I did this:

layout(buffer_reference) buffer ShmemTypeB { int32_t data[BN * SHMEM_STRIDE]; };

shared ShmemTypeB buf_b_qs;
[...]
buf_b_qs.data[i] = ...

It works on Intel and Nvidia, but completely crashes AMD RADV to the point that it automatically reboots the entire server, so something is very wrong. Can you tell me if that's correct use? If so, I need to open an issue with Mesa.

jeffbolznv · 2025-09-29T12:08:03Z

buffer_reference types always point to buffer memory, so I can't quite tell what this snippet is supposed to do. It looks like it declares a pointer to buffer memory in shared memory.

If what you're trying to do is reuse the same shared memory bytes for different parts of the shader, e.g. make coopmat_stage use the same memory as buf_a_qs/buf_b_qs, then https://github.com/KhronosGroup/GLSL/blob/main/extensions/ext/GL_EXT_shared_memory_block.txt is the extension you want (warning: the spec text is not very helpful).

0cc4m · 2025-09-29T12:22:41Z

Oh, alright. I was trying to get pointer casting for shared memory, basically, to get some more flexibility with buffering. I need to spend more time trying to understand these extensions first, they are quite hard to grasp and I can find barely any examples.

* vulkan: 64-bit im2col Add variants of the im2col shaders that use buffer_device_address/buffer_reference, and use 64-bit address calculations. This is needed for large convolutions used in stable-diffusion.cpp. * fix validation error for large im2col

jeffbolznv requested a review from 0cc4m as a code owner September 20, 2025 19:30

github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Sep 20, 2025

jeffbolznv force-pushed the im2col_bda branch from 0ed0dcc to e926474 Compare September 21, 2025 14:51

jeffbolznv mentioned this pull request Sep 22, 2025

vulkan: handle mat_mul with A matrix > 4GB #16176

Merged

jeffbolznv added 2 commits September 27, 2025 10:24

vulkan: 64-bit im2col

64a367b

Add variants of the im2col shaders that use buffer_device_address/buffer_reference, and use 64-bit address calculations. This is needed for large convolutions used in stable-diffusion.cpp.

fix validation error for large im2col

218f2d7

jeffbolznv force-pushed the im2col_bda branch from e926474 to 218f2d7 Compare September 27, 2025 15:31

jeffbolznv requested a review from slaren as a code owner September 27, 2025 15:31

slaren approved these changes Sep 27, 2025

View reviewed changes

0cc4m approved these changes Sep 28, 2025

View reviewed changes

0cc4m merged commit d8359f5 into ggml-org:master Sep 28, 2025
110 of 120 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: 64-bit im2col #16135

vulkan: 64-bit im2col #16135

Uh oh!

jeffbolznv commented Sep 20, 2025

Uh oh!

0cc4m commented Sep 21, 2025

Uh oh!

jeffbolznv commented Sep 21, 2025

Uh oh!

0cc4m commented Sep 21, 2025

Uh oh!

0cc4m commented Sep 27, 2025

Uh oh!

jeffbolznv commented Sep 27, 2025

Uh oh!

jeffbolznv commented Sep 27, 2025

Uh oh!

Uh oh!

0cc4m commented Sep 29, 2025

Uh oh!

jeffbolznv commented Sep 29, 2025

Uh oh!

0cc4m commented Sep 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vulkan: 64-bit im2col #16135

vulkan: 64-bit im2col #16135

Uh oh!

Conversation

jeffbolznv commented Sep 20, 2025

Uh oh!

0cc4m commented Sep 21, 2025

Uh oh!

jeffbolznv commented Sep 21, 2025

Uh oh!

0cc4m commented Sep 21, 2025

Uh oh!

0cc4m commented Sep 27, 2025

Uh oh!

jeffbolznv commented Sep 27, 2025

Uh oh!

jeffbolznv commented Sep 27, 2025

Uh oh!

Uh oh!

0cc4m commented Sep 29, 2025

Uh oh!

jeffbolznv commented Sep 29, 2025

Uh oh!

0cc4m commented Sep 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants