AMD GPU Vulkan & ROCm/HIP Discussion #562

ubergarm · 2025-06-28T22:47:37Z

ubergarm
Jun 28, 2025

Background

I've been asked a few times now about AMD GPU support with ik's fork. I recently got access to an AMD RX 7900 XTX to try it out, and as discussed on Issue 503 the Vulkan and ROCm backends are not the focus of this fork hence limited support on AMD GPU hardware.

I'm starting this discussion to have a place to point folks who might be interested the current state AMD GPU backend support, and especially if they wanted to attempt updates and work on it at all.

Current State

ik_llama.cpp actually does compile with Vulkan and can do some limited inferencing. As it is unmaintained, it is slower than mainline at the moment. However I couldn't get it to compile with ROCm/HIP support. I only tried the AMD official open source AMDVLK backend and not the community open source RADV backend.

There is a good benchmarking discussion on mainline maintained by @netrunnereve which was very helpful for establishing baseline expectations and trying to understand the various AMD GPU driver development environments.

Benchmarks

I did a comparison between mainline llama.cpp and ik_llama.cpp at the given sha's for what I could get working.

Methodology

To keep things somewhat consistent with the establish methodologies I used TheBloke's now vintage Llama-2-7B at classic Q4_0 quantization. The following is how compilation was done as well as running llama-sweep-bench with and without flash attention -fa:

Compiling

# compile for Vulkan
cmake -B build -DGGML_HIP=OFF -DGGML_VULKAN=1 -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j $(nproc)

# couldn't find a combination that worked below
# compile for ROCm/HIP
export HIPCXX="$(hipconfig -l)/clang"
export HIP_PATH="$(hipconfig -R)"
#cmake -B build -DGGML_VULKAN=0 -DGGML_HIP=ON -DGPU_TARGETS=gfx1100 -DGGML_HIP_ROCWMMA_FATTN=ON -DCMAKE_BUILD_TYPE=Release
cmake -B build -DGGML_VULKAN=0 -DGGML_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1100 -DGGML_HIP_ROCWMMA_FATTN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j $(nproc)
In file included from /home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda/fattn.cu:15:
In file included from /home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda/fattn-mma-f16.cuh:3:
/home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda/mma_new.cuh:49:27: error: use of undeclared identifier '__shfl_sync'
   49 |     const int ret_low  = (__shfl_sync(0xFFFFFFFF, x, src_laneid_low,  WARP_SIZE) >> shift_low)  & 0x0000FFFF;
      |                           ^
/home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda/mma_new.cuh:50:27: error: use of undeclared identifier '__shfl_sync'
   50 |     const int ret_high = (__shfl_sync(0xFFFFFFFF, x, src_laneid_high, WARP_SIZE) << shift_high) & 0xFFFF0000;
      |                           ^
4 errors generated when compiling for gfx1100.

sweep-bench

export model=/models/TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_0.gguf
# try with and without -fa
./build/bin/llama-sweep-bench \
  --model "$model" \
  -fa \
  -c 18432 \
  -ngl 99 \
  --warmup-batch \
  --threads 1

Observations

Surprisingly Vulkan without FA managed to complete the benchmark and even give similar performance as mainline for the no FA token generation at longer context lengths.
However, Vulkan with FA enabled shows very poor performance and consistently crashes at N_KV=7680. iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
I did not test any other quantizations especially the newer ik exclusive quants.
I did do a quick vibe check and confirm the model was at least valid tokens, however the chat template seemed odd or could be due to my client settings for temp etc but the responses seemed wrong and had <|im_start|> and <|im_end|> type tokens which don't usually come back from the chat endpoint.

Conclusion

Well, sorry if you have AMD GPU hardware and were hoping to try out the latest greatest stuff on ik's fork. You can still make use of the CPU only optimizations fwiw. You can see the relative performance of native CUDA in the linked benchmark thread for one of my other tests, and ik's fork does run faster than mainline for CUDA.

Finally, I saw and interesting NVIDIA slide deck from the Vulkanised 2025 Developer Conference which discusses llama.cpp on pages 14 and 15 even showing what looks like some of ik's IQ4_NL code with implementation discussions. I was surprised that some models benchmark faster on NVIDIA GPUs using vulkan backend beating out the native CUDA implementation, but perhaps that is for another day...

Thanks and curious if anyone else has tried this or is interested in improving support here. Cheers!

OneOfOne · 2025-06-29T01:50:14Z

OneOfOne
Jun 29, 2025

llama.cpp's vulkan backend is faster and uses less memory on my 7900xtx as well (I'm using latest rocm on Arch so it's not that).

1 reply

ubergarm Jun 29, 2025
Author

Yup, this is to be expected given ik's fork prioritizes a couple CPU types and CUDA implementations and does not focus on maintaining Vulkan nor ROCm/HIP backends.

firecoperana · 2025-06-29T14:50:07Z

firecoperana
Jun 29, 2025
Collaborator

I'm working on bringing ik_llama.cpp up to date with llama.cpp's vulkan backend. It is actually easier than I expected.

5 replies

ubergarm Jun 29, 2025
Author

@firecoperana very cool to hear 🔥 !

As suggested by @0cc4m and some discussion by the author of those Vulkanised Conference PDF slides linked above, @jeffbolznv ,over on the mainline vulkan benchmark discussion I might try to pacman -Sy extra/nvidia-utils and build the vulkan backend for my NVIDIA RTX 3090TI FE GPU and compare performance there as well.

Please update us here if you have a fork/branch/PR you'd like to test and if I still have access to the AMD RX 7900 XTX I can give it a go as I'd like to use ik's SOTA quants on that machine for a fun project...

ikawrakow Jun 29, 2025
Maintainer

@firecoperana Great that you want to port the mainline Vulkan back-end to ik_llama.cpp, but are you also willing to maintain it?

firecoperana Jun 29, 2025
Collaborator

PR is created. Welcome to test. I can maintain it if the vulkan code there hasn't been refactored too much. With this PR, the future update should be easier too. I don't use vulkan much so need someone to remind me if there is some major improvement in vulkan that is worth porting.

ubergarm Jun 29, 2025
Author

I'll give it a try, I just updated my home rig to latest greatest drivers (which I loathe to do but sometimes u gotta pay the piper...).

Interestingly on a Qwen3-14B-Q4_0 the Vulkan FA=1 backend beats native CUDA implementation in token generation at sufficiently deep n_kv

ggml-org/llama.cpp#10879 (reply in thread)

I'll take a look at the PR now, thanks! #563

firecoperana Jun 29, 2025
Collaborator

ggml-org/llama.cpp#14366
Vulkan also needs this one, but I couldn't port it in easily. The issue is vulkan does not have FUSED_RMS_NORM and FUSED_MUL_UNARY support, and when using RPC, it needs this. My current workaround is skip ggml_fused_rms_norm and ggml_fused_mul_unary when using vulkan. @ikawrakow

ikawrakow · 2025-07-01T13:50:50Z

ikawrakow
Jul 1, 2025
Maintainer

So, what is the "approved" way of installing the necessary dependencies for Vulkan development on Ubuntu? I ended up installing LunarG VulkanSDK, but the thing almost bricked my system because I hadn't run sudo apt update && sudo apt upgrade before importing their repository and attempting to install. Is there a better way, preferably with just Ubuntu packages and no 3rd party stuff?

Anyhow, at the end I got the mainline Vulkan build working, but performance is very far from CUDA on my RTX-4080

Vulkan sweep-bench, LlaMA-3.1-8B

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	0.339	3024.00	2.808	91.16
1024	256	1024	0.337	3035.97	2.709	94.51
1024	256	2048	0.328	3121.27	2.657	96.36
1024	256	3072	0.336	3052.01	2.661	96.19
1024	256	4096	0.368	2781.06	2.704	94.67
1024	256	5120	0.405	2531.44	2.794	91.61
1024	256	6144	0.465	2202.62	2.917	87.75
1024	256	7168	0.542	1888.01	3.047	84.00
1024	256	8192	0.618	1656.82	3.196	80.10
1024	256	9216	0.657	1559.24	3.283	77.98
1024	256	10240	0.695	1473.46	3.365	76.08
1024	256	11264	0.720	1422.92	3.412	75.02
1024	256	12288	0.753	1359.30	3.464	73.89
1024	256	13312	0.792	1293.13	3.523	72.67
1024	256	14336	0.814	1257.77	3.588	71.35
1024	256	15360	0.858	1192.89	3.625	70.63

CUDA sweep-bench, LlaMA-3.1-8B

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	0.134	7649.04	2.018	126.88
1024	256	1024	0.129	7921.34	2.105	121.63
1024	256	2048	0.135	7561.83	2.170	117.99
1024	256	3072	0.144	7121.15	2.226	114.99
1024	256	4096	0.151	6784.15	2.292	111.71
1024	256	5120	0.159	6460.57	2.354	108.75
1024	256	6144	0.164	6225.61	2.423	105.66
1024	256	7168	0.172	5961.15	2.484	103.05
1024	256	8192	0.183	5606.81	2.545	100.61
1024	256	9216	0.194	5289.56	2.604	98.31
1024	256	10240	0.195	5239.75	2.662	96.15
1024	256	11264	0.206	4962.13	2.731	93.72
1024	256	12288	0.214	4777.95	2.787	91.85
1024	256	13312	0.217	4725.71	2.845	89.97
1024	256	14336	0.230	4454.44	2.919	87.71
1024	256	15360	0.238	4311.56	2.966	86.30

So, PP is 3X lower, TG is 20-25% lower.

Given this, does it make sense to spend time on Vulkan? When I forked llama.cpp last year the Vulkan stuff was mostly a gimmick, with performance not much better than just running on a moderately fast CPU. They have done a lot of Vulkan development and performance improvements in mainline since then, but it still seems way too far behind.

2 replies

jeffbolznv Jul 1, 2025

Installing the Vulkan SDK is the "right" way to get the dependencies. The pp scores shouldn't be that low, it suggests cooperative matrix isn't getting used. What driver version are you using? Can you share the beginning of the log where ggml-vulkan prints device info?

ubergarm Jul 1, 2025
Author

Given this, does it make sense to spend time on Vulkan?

Personally, the two things I see Vulkan back-end support providing are:

A path allowing AMD GPUs to be used e.g. RX 7900 XTX 24GB VRAM
Potentially faster NVIDIA path for some situations/models (this was news to me).

This Qwen3-14B-Q4_0 dense sweep-bench I ran a couple days ago opened my eyes where the vulkan backend on mainline took the lead on TG after about ~8k depth. NV_coopmat2 is described in @jeffbolznv recent Vulkanised 2025 slides.

Otherwise ik CUDA is generally the fastest. I haven't tested other models/configs but likely vulkan takes the lead in other situations reading the benchmarks in the slides.

However, I also don't want to distract ik whatever optimizations and experiments are most interesting and intrinsically motivating. So nice to see a few folks from the community possibly providing some support. Big thanks @firecoperana for taking a stab at it on #563

Thanks!

ikawrakow · 2025-07-01T14:11:22Z

ikawrakow
Jul 1, 2025
Maintainer

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4080 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
build: 5781 (ba3ef86c5) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
llama_model_load_from_file_impl: using device Vulkan0 (NVIDIA GeForce RTX 4080) - 16376 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from ../ncuda/junk.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  18:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  27:               general.quantization_version u32              = 2
llama_model_loader: - kv  28:                          general.file_type u32              = 2
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 4.33 GiB (4.64 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 8B
print_info: model params     = 8.03 B
print_info: general.name     = Meta Llama 3.1 8B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:      Vulkan0 model buffer size =  4155.99 MiB
load_tensors:   CPU_Mapped model buffer size =   281.81 MiB
......................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 1024
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host  output buffer size =     0.49 MiB
llama_kv_cache_unified:    Vulkan0 KV buffer size =  2048.00 MiB
llama_kv_cache_unified: size = 2048.00 MiB ( 16384 cells,  32 layers,  1 seqs), K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_context:    Vulkan0 compute buffer size =   613.01 MiB
llama_context: Vulkan_Host compute buffer size =    80.01 MiB
llama_context: graph nodes  = 999
llama_context: graph splits = 2

0 replies

ikawrakow · 2025-07-01T14:19:04Z

ikawrakow
Jul 1, 2025
Maintainer

@jeffbolznv Thank you for chiming in. Above is the log. Is there something additional I need to do to improve performance? I did

mkdir vulkan && cd vulkan
cmake .. -DGGML_VULKAN=ON -DGGML_CUDA=OFF
make -j

2 replies

jeffbolznv Jul 1, 2025

Is it a release build? I can't tell.

You'd probably get a boost from a newer driver (to enable coopmat2), but the pp numbers seem slow for coopmat1.

ikawrakow Jul 1, 2025
Maintainer

Yes, this is a release build. @ubergarm is getting in the range of 3000 t/s for LlaMA-7B on his RX 7900 XTX, so same ball park.

jeffbolznv · 2025-07-01T14:53:29Z

jeffbolznv
Jul 1, 2025

What's the llama-bench equivalent of the N_KV column in that table? Is it -d? I see a big difference between coopmat1 and coopmat2 with large depth.

14 replies

ikawrakow Jul 2, 2025
Maintainer

Here is a 16B parameter MoE model that easily fits in your 5090 with VRAM to spare that uses the exact same attention mechanism as DeepSeek-V3/R1: https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite (except that it has 16 instead of 128 heads). I think this is what Johannes used for testing when he implemented the k-head-size != v-head-size FA in the llama.cpp CUDA backend. I did have a partial implementation here using this model quite a bit earlier than mainline (the part for k-head=192, v-head=128), but I was straggling to get a performant implementation for the k-head=576, v-head=512 case, so that's why I asked the question if there are principle issues with the Vulkan implementation.

jeffbolznv Jul 2, 2025

I thought deepseek v2 was already accelerated and it was only deepseek R1 that uses the large/mixed head sizes?

ikawrakow Jul 2, 2025
Maintainer

Well, get the model and see what happens.

jeffbolznv Jul 2, 2025

OK, I do see FA falling back to CPU with it.

jeffbolznv Jul 2, 2025

I added support for these head sizes in ggml-org/llama.cpp#14509. Performance is tolerable with the coopmat2 shader but very slow for coopmat1/scalar. I'm sure there's some room for tuning.

ikawrakow · 2025-07-02T06:16:07Z

ikawrakow
Jul 2, 2025
Maintainer

Personally, the two things I see Vulkan back-end support providing are:

A path allowing AMD GPUs to be used e.g. RX 7900 XTX 24GB VRAM

But a port of the mainline Vulkan back-end to ik_llama.cpp without the additions that make ik_llama.cpp faster for CUDA and CPU inference has zero benefits. People can simply use llama.cpp with their AMD GPUs.

3 replies

firecoperana Jul 2, 2025
Collaborator

Another benefit is to people who have both nvidia and amd or even intel GPUs. They can use RPC to load different backends or just use vulkan to use non CUDA GPU to offload more weights to vram.

ikawrakow Jul 2, 2025
Maintainer

Another benefit is to people who have both nvidia and amd or even intel GPUs. They can use RPC to load different backends or just use vulkan to use non CUDA GPU to offload more weights to vram.

They already have this with llama.cpp. What does ik_llama.cpp without the additions implemented for Vulkan give them that they don't already have with llama.cpp?

firecoperana Jul 2, 2025
Collaborator

One major thing I can think of is mla support for old quants of Deepseek V2.5 and V3 models. And if someone is already using ik_llama.cpp, adding AMD gpu that is not useable earlier can offer more speed boost.

ikawrakow · 2025-07-06T13:41:44Z

ikawrakow
Jul 6, 2025
Maintainer

So, the Vulkan back-end is usable, and performance is better than llama.cpp (see, e.g., PR #584 that has a comparison for a MoE model). But compared to CUDA on the same GPU, performance is much lower, especially for MoE models (and most users appear to be using ik_llama.cpp exactly for one of the giant MoE models). I have mixed feelings how to proceed:

There is much more performance optimization potential in the Vulkan back-end compared to CUDA or CPU. So, from that point of view it seems worthwhile to put some effort into optimizing the Vulkan back-end
I know nothing about Vulkan programming in general or the llama.cpp Vulkan back-end in particular, hence, at least initially, it will be an uphill battle. Without a significant interest from the user base, I don't feel particularly motivated to do this to myself.

So, if you feel that Vulkan performance improvement in ik_llama.cpp is important, go to discussion #590 and vote!

0 replies

narikm · 2025-08-06T01:27:42Z

narikm
Aug 6, 2025

I dont know if this is the right place, but since it is vulkan related... I compiled vulkan with your line and tried to load Deepseek IQ2 with this:

numactl -N 0 -m 0
./llama-server
-rtr
--ctx-size 32768
-ctk q8_0
-fa
-amb 512
-fmoe
--n-gpu-layers 63
--numa numactl
--override-tensor exps=CPU
--host 0.0.0.0
--port 8080
--model "/media/tug/AI NVMe/DeepSeek-R1-0528-IQ2_K_R4/DeepSeek-R1-0528-IQ2_K_R4-00001-of-00005.gguf"

It loaded, my MI50 ram full, but the speed (prompt and output) is half of CPU alone! Did i make an error in the compilation or launch? if i use -mla 1 to 3, only 12GB out of 32 are used and the speed is always the same, slower than CPU.

9 replies

ikawrakow Aug 7, 2025
Maintainer

@0cc4m Thank you! I'm sure I will make use of the offer.

ikawrakow Aug 7, 2025
Maintainer

@0cc4m

Ha, I already have my first question. I see that in the dequantization shaders the buffers are bound to a specific data type using code such as this:

 layout (binding = 0) readonly buffer A {block_iq4_xs data_a[];};
layout (binding = 1) writeonly buffer D {D_TYPE data_b[];};

But in ik_llama.cpp there are quantization types that have a per tensor row float scale. E.g., if the scale is fp32, it will be stored in the first 4 bytes for each row. The actual quantized data follows after that. In CUDA I can simply do

const float * dptr = (const float *)((const char *)data + row*ggml_row_size(type, ne00));
const block_something * bx = (const block_something *)(dptr + 1);

How can I do this in a Vulkan shader?

0cc4m Aug 7, 2025

That is basically the only case where this is not yet a problem, you can bind buffers multiple times, for example:

layout (binding = 1) readonly buffer A1 {float data_a_float[];};
layout (binding = 1) readonly buffer A2 {int data_a_int[];};

Note the same binding index. We use this to conditionally load vectors of floats or individual floats. If your data types don't align, you can use uint8_t as input data type, combine the correct ones into an uint32_t and cast to a float using uintBitsToFloat.

Some data types require one of these extensions to be used in shader functions: https://github.com/KhronosGroup/GLSL/blob/main/extensions/ext/GL_EXT_shader_explicit_arithmetic_types.txt. It also contains a pack32 function that you can use to pack 4 uint8_t into 1 int32_t.

I think basically all devices support all integer variants, but float16_t is not supported everywhere. You can use float16_t in input/output buffer, we require devices to support that.

ikawrakow Aug 7, 2025
Maintainer

Thanks. It sounds like I need to pass two addresses to the dequantization function: one that points to the beginning of the tensor data (which is then bound to fp32 or fp16) and one that points to beginning of tensor data + sizeof(row_scale) (which is bound to the actual quantization blocks). Otherwise, if I do a binding to uint8_t and there are no casts available, it becomes awkward to access the data members of, for instance,

typedef struct {
    uint16_t extra;
    uint8_t  scales[QK_K/64];
    uint8_t  qs[QK_K/4];
} block_iq2_ks;

0cc4m Aug 7, 2025

Hmm, a float and then a number of structs per row might be very awkward to address in Vulkan. You can bind the same buffer again, but you can't just offset it by a byte, because offsets must be aligned to the value of minStorageBufferOffsetAlignment of the device.

minStorageBufferOffsetAlignment is the minimum required alignment, in bytes, for the offset member of the VkDescriptorBufferInfo structure for storage buffers. When a descriptor of type VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC is updated, the offset must be an integer multiple of this limit. Similarly, dynamic offsets for storage buffers must be multiples of this limit. The value must be a power of two.

Source VkPhysicalDeviceLimits, many devices require 16-byte alignment here.

You have to handle the offset in the shader. I'll think about whether there's a way to do this besides binding uint16_t and manually doing the offset calculations, not sure if there's anything elegant. Maybe repacking the tensor on upload to separate floats and structs would be easiest.

There is an extension to add pointer-like features to Vulkan: https://docs.vulkan.org/samples/latest/samples/extensions/buffer_device_address/README.html, but I haven't used it yet. I think it would require major changes to the ggml-vulkan.cpp file to use and it wouldn't work on all devices.

AMD GPU Vulkan & ROCm/HIP Discussion #562

Uh oh!

Background

Current State

Benchmarks

Methodology

Compiling

sweep-bench

Observations

Conclusion

Replies: 9 comments · 36 replies

Uh oh!

Uh oh!

ubergarm Jun 29, 2025 Author

Uh oh!

firecoperana Jun 29, 2025 Collaborator

Uh oh!

ubergarm Jun 29, 2025 Author

Uh oh!

ikawrakow Jun 29, 2025 Maintainer

Uh oh!

firecoperana Jun 29, 2025 Collaborator

Uh oh!

ubergarm Jun 29, 2025 Author

Uh oh!

firecoperana Jun 29, 2025 Collaborator

Uh oh!

Uh oh!

ikawrakow Jul 1, 2025 Maintainer

Vulkan sweep-bench, LlaMA-3.1-8B

CUDA sweep-bench, LlaMA-3.1-8B

Uh oh!

Uh oh!

Uh oh!

ubergarm Jul 1, 2025 Author

Uh oh!

ikawrakow Jul 1, 2025 Maintainer

Uh oh!

ikawrakow Jul 1, 2025 Maintainer

Uh oh!

Uh oh!

ikawrakow Jul 1, 2025 Maintainer

Uh oh!

Uh oh!

ikawrakow Jul 2, 2025 Maintainer

Uh oh!

Uh oh!

ikawrakow Jul 2, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

ikawrakow Jul 2, 2025 Maintainer

Uh oh!

firecoperana Jul 2, 2025 Collaborator

Uh oh!

ikawrakow Jul 2, 2025 Maintainer

Uh oh!

firecoperana Jul 2, 2025 Collaborator

Uh oh!

ikawrakow Jul 6, 2025 Maintainer

Uh oh!

Uh oh!

ikawrakow Aug 7, 2025 Maintainer

Uh oh!

ikawrakow Aug 7, 2025 Maintainer

Uh oh!

Replies: 9 comments 36 replies

ubergarm Jun 29, 2025
Author

firecoperana
Jun 29, 2025
Collaborator

ubergarm Jun 29, 2025
Author

ikawrakow Jun 29, 2025
Maintainer

firecoperana Jun 29, 2025
Collaborator

ubergarm Jun 29, 2025
Author

firecoperana Jun 29, 2025
Collaborator

ikawrakow
Jul 1, 2025
Maintainer

ubergarm Jul 1, 2025
Author

ikawrakow
Jul 1, 2025
Maintainer

ikawrakow
Jul 1, 2025
Maintainer

ikawrakow Jul 1, 2025
Maintainer

ikawrakow Jul 2, 2025
Maintainer

ikawrakow Jul 2, 2025
Maintainer

ikawrakow
Jul 2, 2025
Maintainer

firecoperana Jul 2, 2025
Collaborator

ikawrakow Jul 2, 2025
Maintainer

firecoperana Jul 2, 2025
Collaborator

ikawrakow
Jul 6, 2025
Maintainer

ikawrakow Aug 7, 2025
Maintainer

ikawrakow Aug 7, 2025
Maintainer