How important is Vulkan back-end development? #590

ikawrakow · 2025-07-06T13:40:49Z

ikawrakow
Jul 6, 2025
Maintainer

Tthe Vulkan back-end in ik_llama.cpp is now usable, and performance is better than llama.cpp (see, e.g., PR #584 that has a comparison for a MoE model). But compared to CUDA on the same GPU, performance is much lower, especially for MoE models (and most users appear to be using ik_llama.cpp exactly for one of the giant MoE models). I have mixed feelings how to proceed:

There is much more performance optimization potential in the Vulkan back-end compared to CUDA or CPU. So, from that point of view it seems worthwhile to put some effort into optimizing the Vulkan back-end
I know nothing about Vulkan programming in general or the llama.cpp Vulkan back-end in particular, hence, at least initially, it will be an uphill battle. Without a significant interest from the user base, I don't feel particularly motivated to do this to myself.

Please select one of the following options

Vulkan performance improvement is very important

78%

Vulkan support is important, but other topics are just as important, or perhaps even more important

13%

Please don't waste time on Vulkan, there is so much else to do

8%

87 votes

OneOfOne · 2025-07-06T16:55:32Z

OneOfOne
Jul 6, 2025

On AMD, vulkan is faster and more memory efficient than rocm.

0 replies

mcm007 · 2025-07-06T18:25:18Z

mcm007
Jul 6, 2025

Currently, owners of Nvidia GPUs have access to a wide range of inference engines (e.g., vllm, exllama, sglang, mlc, aphrodite-engine) that are optimized for CUDA. This allows them to fully utilize their hardware, which is great.

In contrast, Vulkan support could provide significant benefits to users of AMD and Intel GPUs, which currently have less mature tooling and support.

AMD appears not so friendly toward regular consumers, eg. AMD Rocm barely supports their top GPUs.
The recent Vulkan improvements by jeffbolznv on mainline llama.cpp are higher for Nvidia GPUs because he seems from Nvidia backgrounds.
Is not nice that we don't notice AMD people providing some support... just enough to be noticed.
As much I don't like Nvidia I swapped my new 7900XTX for a used 3090.

Also, with Vulkan support would be possible to run the fast ik_llama.cpp on devices like Intel iGPU or Ryzen 3400G APU, using KS quants, re-use the quantized files, etc.

I want to acknowledge the effort and quality of your work, therefore whatever you choose (improve speed, quants quality, Vulkan, features, ...) doesn't matter at the end: they will benefit us, users/community.

7 replies

saood06 Jul 7, 2025
Collaborator

Usually, support is not complete and misses features or optimizations like FA, supporting all quants, and quantized cache. 😞

I did look into the state of flash attention support for ROCm and it did seem like they are working on it with things like paged attention not fully there yet.

Like I said I don't have personal experience so I don't know what the experience is like, just thought it should be mentioned that they all do seem like they do support the hardware (to some level).

Since IK created -rtr, or with the recent on-the-fly repacks #531, #533, #534, PP performance has skyrocketed, making the CPU viable for small models on simple tasks 😄. Indeed, the extra performance added by iGPU part doesn't seem worth the effort.

Yeah.

the sweep-bench looks incredible on the Vulkan build

Thanks for the graphs. I'd be curious about peak batched performance comparisons (I never got around to adding a plot tool to batched-bench)

There is a way to increase the memory allocated to the iGPU Smokeless_UMAF but it's a bit of a hassle - one needs to boot from the modified BIOS every time and make the modification.

That is interesting to hear for if I ever use that CPU again (but if I do use it, I'm not sure if I'd want to allocate more or less VRAM assuming less is possible).

ikawrakow Jul 7, 2025
Maintainer Author

@mcm007 What is the CPU for these graphs? PP < 200 t/s seems quite low for a 0.6B model.

Here is what I get for Q6_K-quantized Qwen3-0.6B on my Ryzen-7950X CPU:

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	0.201	2546.00	1.259	101.65
512	128	512	0.209	2451.59	1.463	87.48
512	128	1024	0.233	2197.58	1.646	77.78
512	128	1536	0.258	1985.52	1.669	76.67
512	128	2048	0.282	1814.45	1.715	74.63
512	128	2560	0.307	1665.39	1.783	71.80
512	128	3072	0.333	1537.27	1.856	68.95
512	128	3584	0.361	1419.98	1.925	66.48

mcm007 Jul 7, 2025

@saood06

I'd be curious about peak batched performance comparisons (I never got around to adding a plot tool to batched-bench)

Results here, click to expand

Vulkan build

llama-batched-bench -m /models1/Qwen_Qwen3-0.6B-Q6_K.gguf -c 4096 -b 512 -ub 512 -ngl 0,1 -npp 128 -ntg 128 -npl 1,2,4,6,8,10,12,14,16

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	2.158	59.33	3.076	41.61	5.233	48.92
128	128	2	512	1.814	141.12	4.738	54.03	6.552	78.14
128	128	4	1024	1.870	273.78	7.437	68.84	9.308	110.02
128	128	6	1536	3.498	219.57	10.354	74.17	13.852	110.89
128	128	8	2048	3.621	282.79	14.736	69.49	18.357	111.56
128	128	10	2560	5.542	230.95	19.563	65.43	25.106	101.97
128	128	12	3072	5.408	284.02	24.153	63.59	29.561	103.92
128	128	14	3584	7.023	255.17	29.784	60.17	36.807	97.37
128	128	16	4096	7.103	288.33	35.599	57.53	42.702	95.92

llama-batched-bench -m /models1/Qwen_Qwen3-0.6B-Q6_K.gguf -c 4096 -b 512 -ub 512 -ngl 0,1 -npp 128 -ntg 128 -npl 1,2,4,6,8,10,12,14,16 -fa --cache-type-k q8_0

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	2.136	59.93	2.950	43.39	5.086	50.34
128	128	2	512	2.843	90.03	4.471	57.26	7.314	70.00
128	128	4	1024	4.506	113.62	7.563	67.70	12.069	84.85
128	128	6	1536	7.924	96.92	11.261	68.20	19.185	80.06
128	128	8	2048	9.385	109.12	14.843	68.99	24.228	84.53
128	128	10	2560	13.274	96.43	21.822	58.66	35.096	72.94
128	128	12	3072	14.836	103.53	30.557	50.27	45.392	67.68
128	128	14	3584	18.849	95.07	41.660	43.02	60.509	59.23
128	128	16	4096	20.788	98.52	34.703	59.01	55.492	73.81

CPU build

llama-batched-bench -m /models1/Qwen_Qwen3-0.6B-Q6_K.gguf -c 4096 -b 512 -ub 512 -ngl 0,1 -npp 128 -ntg 128 -npl 1,2,4,6,8,10,12,14,16

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	0.858	149.13	3.157	40.55	4.015	63.76
128	128	2	512	1.683	152.13	4.879	52.47	6.562	78.02
128	128	4	1024	3.570	143.42	7.726	66.27	11.296	90.65
128	128	6	1536	5.465	140.53	10.482	73.27	15.947	96.32
128	128	8	2048	7.761	131.94	15.193	67.40	22.954	89.22
128	128	10	2560	9.970	128.38	19.755	64.79	29.726	86.12
128	128	12	3072	12.513	122.75	24.533	62.61	37.046	82.92
128	128	14	3584	15.011	119.38	30.032	59.67	45.043	79.57
128	128	16	4096	17.933	114.20	35.927	57.01	53.860	76.05

llama-batched-bench -m /models1/Qwen_Qwen3-0.6B-Q6_K.gguf -c 4096 -b 512 -ub 512 -ngl 0,1 -npp 128 -ntg 128 -npl 1,2,4,6,8,10,12,14,16 -fa --cache-type-k q8_0

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	1.061	120.60	3.088	41.46	4.149	61.70
128	128	2	512	1.668	153.51	4.754	53.85	6.422	79.73
128	128	4	1024	3.566	143.58	7.453	68.70	11.019	92.93
128	128	6	1536	5.346	143.65	11.886	64.61	17.232	89.13
128	128	8	2048	7.491	136.70	14.897	68.74	22.388	91.48
128	128	10	2560	9.620	133.06	22.426	57.08	32.045	79.89
128	128	12	3072	11.950	128.54	31.101	49.39	43.051	71.36
128	128	14	3584	14.372	124.69	42.149	42.52	56.520	63.41
128	128	16	4096	17.197	119.09	34.384	59.56	51.581	79.41

@ikawrakow

What is the CPU for these graphs?

AMD Ryzen 5 3400G, old and without AVX512 😄

But it's always powered on thus llama-server Webui immediately accessible even from phone.

Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxs r sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl n onstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_ 1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_lega cy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr _nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bm i2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pft hreshold avic v_vmsave_vmload vgif overflow_recov succor smca sev sev_es

saood06 Jul 7, 2025
Collaborator

I'd be curious about peak batched performance comparisons (I never got around to adding a plot tool to batched-bench)

Results here, click to expand

Thanks, but not sure what to make of these given you use -ngl 0,1 (which I think is being interpreted as 1), instead of 99/0 like you did for sweep-bench

Edit:

AMD Ryzen 5 3400G, old and without AVX512 😄

My server CPU uses the first CPU architecture with AVX2.

mcm007 Jul 7, 2025

Sorry, 0,1 was meant for fa I think. It used 0 in the typo llm_load_tensors: offloaded 0/29 layers to GPU.

CPU/Vulkan/ngl/FA Results

Vulkan build

llama-batched-bench -m /models1/Qwen_Qwen3-0.6B-Q6_K.gguf -c 4096 -b 512 -ub 512 -npp 128 -ntg 128 -npl 1,2,4,6,8,10,12,14,16 -fa -ngl 0

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	2.027	63.14	3.037	42.15	5.064	50.55
128	128	2	512	1.793	142.76	4.595	55.71	6.388	80.15
128	128	4	1024	1.839	278.46	7.841	65.30	9.679	105.79
128	128	6	1536	3.420	224.57	14.302	53.70	17.722	86.67
128	128	8	2048	3.590	285.26	15.373	66.61	18.963	108.00
128	128	10	2560	5.156	248.23	27.476	46.59	32.633	78.45
128	128	12	3072	5.747	267.28	41.406	37.10	47.153	65.15
128	128	14	3584	7.283	246.05	58.771	30.49	66.054	54.26
128	128	16	4096	8.226	248.97	37.488	54.63	45.714	89.60

llama-batched-bench -m /models1/Qwen_Qwen3-0.6B-Q6_K.gguf -c 4096 -b 512 -ub 512 -npp 128 -ntg 128 -npl 1,2,4,6,8,10,12,14,16 -fa -ngl 99

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	0.326	392.75	2.841	45.05	3.167	80.82
128	128	2	512	0.388	660.42	3.400	75.29	3.788	135.16
128	128	4	1024	0.841	608.95	5.633	90.89	6.474	158.18
128	128	6	1536	1.328	578.33	7.383	104.03	8.711	176.34
128	128	8	2048	1.960	522.41	9.095	112.59	11.055	185.25
128	128	10	2560	2.595	493.23	16.859	75.92	19.455	131.59
128	128	12	3072	3.487	440.48	17.976	85.45	21.463	143.13
128	128	14	3584	4.313	415.48	19.101	93.82	23.414	153.07
128	128	16	4096	5.380	380.64	20.148	101.65	25.528	160.45

llama-batched-bench -m /models1/Qwen_Qwen3-0.6B-Q6_K.gguf -c 4096 -b 512 -ub 512 -npp 128 -ntg 128 -npl 1,2,4,6,8,10,12,14,16 -ngl 0

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	2.151	59.52	3.212	39.85	5.363	47.74
128	128	2	512	1.815	141.07	4.438	57.69	6.252	81.89
128	128	4	1024	1.870	273.79	7.488	68.37	9.358	109.42
128	128	6	1536	3.499	219.48	10.361	74.13	13.860	110.82
128	128	8	2048	3.622	282.70	14.533	70.46	18.155	112.81
128	128	10	2560	5.552	230.56	19.646	65.15	25.198	101.60
128	128	12	3072	5.427	283.01	24.115	63.69	29.543	103.98
128	128	14	3584	6.983	256.63	29.911	59.91	36.894	97.14
128	128	16	4096	7.082	289.20	36.246	56.50	43.327	94.54

llama-batched-bench -m /models1/Qwen_Qwen3-0.6B-Q6_K.gguf -c 4096 -b 512 -ub 512 -npp 128 -ntg 128 -npl 1,2,4,6,8,10,12,14,16 -ngl 99

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	0.303	422.98	2.686	47.65	2.989	85.65
128	128	2	512	0.335	763.09	4.162	61.50	4.498	113.83
128	128	4	1024	0.679	753.86	7.281	70.32	7.960	128.65
128	128	6	1536	1.051	730.81	10.296	74.60	11.346	135.37
128	128	8	2048	1.433	714.54	12.580	81.40	14.013	146.15
128	128	10	2560	1.855	690.11	17.271	74.11	19.126	133.85
128	128	12	3072	2.277	674.54	18.591	82.62	20.868	147.21
128	128	14	3584	2.747	652.35	19.879	90.15	22.626	158.40
128	128	16	4096	3.213	637.39	21.080	97.15	24.293	168.61

CPU build

llama-batched-bench -m /models1/Qwen_Qwen3-0.6B-Q6_K.gguf -c 4096 -b 512 -ub 512 -npp 128 -ntg 128 -npl 1,2,4,6,8,10,12,14,16 -fa

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	1.079	118.58	3.090	41.43	4.169	61.40
128	128	2	512	1.695	151.00	4.751	53.89	6.446	79.43
128	128	4	1024	3.609	141.89	7.772	65.88	11.380	89.98
128	128	6	1536	5.607	136.98	15.116	50.81	20.723	74.12
128	128	8	2048	7.843	130.56	15.871	64.52	23.715	86.36
128	128	10	2560	10.113	126.57	28.216	45.36	38.329	66.79
128	128	12	3072	12.770	120.28	42.656	36.01	55.426	55.43
128	128	14	3584	15.405	116.32	60.220	29.76	75.625	47.39
128	128	16	4096	18.308	111.86	37.814	54.16	56.122	72.98

llama-batched-bench -m /models1/Qwen_Qwen3-0.6B-Q6_K.gguf -c 4096 -b 512 -ub 512 -npp 128 -ntg 128 -npl 1,2,4,6,8,10,12,14,16

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	0.891	143.70	3.195	40.07	4.085	62.66
128	128	2	512	1.690	151.47	4.721	54.23	6.411	79.86
128	128	4	1024	3.582	142.94	7.592	67.44	11.174	91.64
128	128	6	1536	5.515	139.26	10.560	72.73	16.075	95.55
128	128	8	2048	7.711	132.79	15.253	67.13	22.964	89.18
128	128	10	2560	9.933	128.87	19.750	64.81	29.682	86.25
128	128	12	3072	12.619	121.72	24.358	63.06	36.978	83.08
128	128	14	3584	14.966	119.73	29.971	59.79	44.938	79.75
128	128	16	4096	17.959	114.04	36.230	56.53	54.189	75.59

firecoperana · 2025-07-06T23:37:31Z

firecoperana
Jul 6, 2025
Collaborator

You don't need to make the decision so soon. You can wait and see if this improvement in Vulkan draws more interests from Vulkan users or even developers. It's more important for AMD and Intel users, but they may not know about this yet.

0 replies

Nexesenex · 2025-07-12T01:15:49Z

Nexesenex
Jul 12, 2025

I personally voted against Vulkan, and only because the community's opinion was asked.

@ikawrakow : My argument would basically go along yours. If there's demand, and most importantly if there's motivation, and even better if there is help, then I'd love to see IKL support Vulkan, because this backend seems to have a future.

But as of now, your development are so valuable on what you master than it might be more pertinent to focus on your art rather than learn a new technique. A technique which could be provided by skilled Vulkan devs to roll in your wheel, rather than to have to do it yourself. Skilled Vulkan devs who might eventually come to IKL and join you, firecoperana and the fray, because IKL is where the good stuff is, quants and big-moe support-wise, and also "welcoming to all good-wills wise".

Just my opinion, I'll be happy whatever you choose.

Especially after the IQ2_KL surprise! :)

0 replies

gapeleon · 2025-07-17T09:04:15Z

gapeleon
Jul 17, 2025

I voted 'no' but regret it / can't remove my vote. I'd rather abstain :)

For me personally, I use this app to get more performance out of my used Nvidia hardware + CPU with MoE's. The biggest win for me would be if someone could improve rpc server performance, as this would make it viable for us to link multiple rigs without cutting prompt processing in half.

But Vulkan would help both Intel and AMD users.
I noticed a lot of people buying multiple MI50's recently to run larger models, and prompt processing on these with Vulkan is incredibly slow.

Intel are releasing a 24GB GPU later this year. And while Openvino and sycl are way faster, there's an issue with Openvino whereby you can't use KV Cache with multiple GPUs. That 48GB dual-GPU one of the board partners is releasing --will effectively be 2x24gb GPUs, so people buying that card would benefit from faster Vulkan performance.

I have mixed feelings how to proceed

ik_llama is a passion project right? So perhaps just do what would be most interesting?

2 replies

ikawrakow Jul 17, 2025
Maintainer Author

ik_llama is a passion project right? So perhaps just do what would be most interesting?

"Passion" would be pushing it. But yes, it is a hobby project that I started to hack around for fun. It has never been about winning a popularity contest, and I never went out to beat the drum in HN, Reddit, X, etc. But with time quite a few people have found the project useful, and this is what creates the mixed feelings: it is obvious that a high quality Vulkan back-end will be useful for many, I don't need to be convinced of that. At the same time I'm not sure that I will be having fun adding all the ik_llama.cpp quants and the optimizations for MoE models to the Vulkan back-end.

In any case, thank you for voting!
But 14 votes in total does not provide a very strong motivation.

firecoperana Jul 17, 2025
Collaborator

It's not a big problem not adding ik_llama.cpp quants and other optimization to vulkan because Vulkan users are accustomed to missing features compared to CUDA, especially if you don't feel like doing it. Back then, there was no IQ quant support, and FA was barely supported in vulkan in mainline until recently, but it does not stop people from using Vulkan. Until there is more interest from Vulkan users, it's fine the way it is now.

FullstackSensei · 2025-07-18T00:12:31Z

FullstackSensei
Jul 18, 2025

Found this discussion while searching for references to SYCL to see if building for SYCL is supported (having a lot of compilation errors).
I have two inference rigs powered by Nvidia and I'm re-purposing a dual Cascade Lake machine I have for MoE inference by adding A770s.

I voted for improving the Vulkan backend but here are my two cents:

This project doesn't get that much attention on reddit, etc compared to llama.cpp. So, he current userbase is a lot smaller. Having this question in the discussions, while appropriate, won't attract that much attention.
Vulkan is the only backend that's not tied to a specific vendor. Any optimization you make there will be useful on all GPUs, discrete or otherwise. If you can bring Vulkan close to parity with CUDA, it will be a huge win for any device that supports Vulkan, including older GPUs from Nvidia and AMD.
As firecoperana noted, not all quants need to be supported. A handful of the recent IQs used in recent MoE's like Qwen3-235B, DeepSeek-671B, and Kimi-K2 are more than enough. I'd even argue for supporting only power of two IQ quants only initially to limit scope and effort.
Inte's A770 is now arguably the cheapest 16GB GPU with decent compute and memory bandwidth, but it doesn't get much attention in the community. Vulkan support would benefit those of us running Arcs, and free us from having to fiddle with OneAPI.

0 replies

ExeVirus · 2025-07-18T02:45:00Z

ExeVirus
Jul 18, 2025

You are correct to ask this question. Your target users are those with a single powerful GPU and a decent dram CPU combo.

Those users are power users and small businesses. Further, most serious ones are using 24GB machines or better. They have rocm and cuda, and if Intel ever comes out with a 24GB single card that is actually available, they'll support it properly as well.

Vulcan helps old hardware, and people that love hassle free setups. I don't think you should be doing that hassle free work yourself, given your users are all very capable of that work/setup, as much as we would like to have that ease of use.

If your goal is mass popularity like llama.cpp, then yeah get started on Vulcan, and also get some help, cause that's a tall order. Just my thoughts

0 replies

ACWeb23 · 2025-07-18T04:06:52Z

ACWeb23
Jul 18, 2025

I think improvements to vulkan performance would be a positive. This would allow uses greater flexibility when deciding on hardware. Also ARC and AMD GPU users would benefit from these improvements.

0 replies

lin72h · 2025-07-18T04:24:40Z

lin72h
Jul 18, 2025

Vote for Vulkan. It's the API that all vendors are pushing hard to support. AMD's RADV driver is really solid, Intel's ANV is steadily improving, and Jeff Bolz from NVIDIA has been contributing to llama.cpp's Vulkan backend for several months now.

1 reply

jeffzhou2000 Jul 22, 2025

Can't agree more.

The ggml-vulkan backend supports many many AI operators and works fine/well on x86-Linux or arm-Linux or Windows but didn't works fine as expected on Snapdragon based Android phone. As the original author of ggml-vulkan said on 10/25/2024: "Sadly, in my experience Qualcomm is not handling Vulkan in a proper way, like AMD, Nvidia and Intel are, so there are a lot of quirks to deal with."

How to make ggml-vulkan run well on an Android phone equipped with Qualcomm Snapdragon mobile SoC? Thanks!

ikawrakow · 2025-07-18T04:53:10Z

ikawrakow
Jul 18, 2025
Maintainer Author

Wow, I see 18 new votes since I last checked yesterday. For people who came here to vote for Vulkan but are not familiar with this project, the mainline llama.cpp Vulkan back-end has been ported to ik_llama.cpp(#608), so it should be on par with what you have in mainline. For models utilizing MLA attention (DeepSeek, Kimi-2), ik_llama.cpp outperforms llama.cpp by quite a margin as it is - see here.

2 replies

FullstackSensei Jul 18, 2025

Wow, I see 18 new votes since I last checked yesterday. For people who came here to vote for Vulkan but are not familiar with this project, the mainline llama.cpp Vulkan back-end has been ported to ik_llama.cpp(#608), so it should be on par with what you have in mainline. For models utilizing MLA attention (DeepSeek, Kimi-2), ik_llama.cpp outperforms llama.cpp by quite a margin as it is - see here.

I took the liberty of posting about this discussion on LocalLLaMA and IntelArc subreddits. Hope you don't mind! Your work makes large models like DeepSeek and Kimi usable on hardware that doesn't cost a kidney, and Vulkan optimizations would only lower the cost to run such models at decent speeds.

This project doesn't get the exposure it deserves, IMO.. So, I thought at worst more people will become familiar with it.

ikawrakow Jul 18, 2025
Maintainer Author

I took the liberty of posting about this discussion on LocalLLaMA and IntelArc subreddits. Hope you don't mind!

This project was the best kept secret on Github for a while, but it no longer is, so feel free to post about it.

This project doesn't get the exposure it deserves, IMO

Thank you.

DealsBeam · 2025-07-18T11:54:36Z

DealsBeam
Jul 18, 2025

Intel Arc GPUs would greatly benefit from Vulkan improvement, thanks for your hard work and dedicating your time on this great project.

13 replies

jeffzhou2000 Jul 23, 2025

"Vulkan compute is the future". Agree!

I am very interested in the topic about performance of llama.cpp on Android device(especially Android phone). I wrote a short tech article on July 8th: <<Overall evaluation of ggml-opencl,ggml-vulkan,ggml-hexagon and the default ggml backend on Android phone>>

jeffzhou2000/ggml-hexagon#64 (comment)

FullstackSensei Jul 23, 2025

Oh! You have a fork of llama.cpp for Hexagon!!!! You really should post about this on r/localllama! Everyone is building llama.cpp against the CPU backend on Android. Until five minutes ago, I thought the best option for Android was executorch because it enbled using Hexagon. There goes my afternoon! 😛

jeffzhou2000 Jul 23, 2025

The initial version of ggml-hexagon was launched on 04/2024 and reverse-engineered from executorch. the history of ggml-hexagon can be found at jeffzhou2000/ggml-hexagon#18.

jeffzhou2000 Jul 23, 2025

You really should post about this on r/localllama! Everyone is building llama.cpp against the CPU backend on Android.

I only have accounts on github and huggingface(actually I have a twitter account but I'd like to distinguish technical accounts and social accounts), because these two places are purely technical communities. I do this for tech interest and keep my brain utilized and to have some fun.

g2mt Jul 24, 2025

If I'm not mistaken, sycl is still used in Intel's official llama.cpp builds (see https://github.com/intel/ipex-llm). From my experience, sycl is better at prompt processing compared to vulkan on Iris Xe integrated graphics unit.

ikawrakow · 2025-07-22T15:07:48Z

ikawrakow
Jul 22, 2025
Maintainer Author

So, my account was suspended for 2 days, and I'm going on vacation two days from now, so I'll look into the Vulkan situation when I come back in 2 weeks.

2 replies

FullstackSensei Jul 22, 2025

Please try to mirror everything in this repo somewhere before you go. I'll try to setup a private mirror tonight using gitea.

I feel the suspension was a bit my fault, because of the attention it got after my reddit posts. Sorry!

ikawrakow Jul 22, 2025
Maintainer Author

See this comment

Mushoz · 2025-07-24T14:43:06Z

Mushoz
Jul 24, 2025

Does the vulkan backend support the IQ1_KT quant? If not, is that something that is planned by any chance? Would love to play around with the new Qwen3 coder model, but with 128GB vram I can't fit anything bigger. I am using the AMD 395 with iGPU, so using CUDA is unfortunately not possible.

0 replies

jagusztinl · 2025-07-25T08:51:04Z

jagusztinl
Jul 25, 2025

All PCs and laptops with igpu (no nvidia/adm gpu) have a huge benefit from Vulkan. If you ckeck the ollama project their users begging it for more then 1 year. Great that ik:llama.cpp will have it.

0 replies

dc740 · 2025-07-27T15:40:05Z

dc740
Jul 27, 2025

If I may add my two cents:

Amd Instinct MI50 is 32GB
it runs Deepseek (unsloth) for personal use at a reasonable tk/s
AMD decided to deprecate it and no longer support it in rocml even though 3 of these give you 96GB of VRAM
These cards were broken and would only report 16GB on vulkan, while reporting 32gb in rocml. Turns out it was a BIOS thing, as flashing them with a different BIOS fixes the issue, making them very useful for LLMs https://www.reddit.com/r/LocalLLaMA/comments/1m389gi/32gb_mi50_but_llamacpp_vulkan_sees_only_16gb/
Vulkan still has many years of support, while rocml will still be dropped on these cards (maintaining an up to date fork while keeping support is a big task)

TL;DR:

Speeding up Vulkan would fight planned obsolescence

Thank you!

1 reply

FullstackSensei Jul 27, 2025

Deprecation doesn't mean it will stop working. ROCm 6.3 has the Mi50 marked as deprecated, but no functionality has been removed. Even Pytorch still supports CUDA 11 which was EoL almost 3 years ago. Llama.cpp also still builds against CUDA 11. Backwards compatibility is very much a thing in these frameworks.

I say all this as someone who just bought 5 Mi50s from China, and someone who strongly supports optimizating Vulkan kernels. If you came here from reddit, it's probably via one of my posts advocating for this. I just don't want things like ROCm deprecation to be a reason, because it's a non-issue. The actual benefit, IMO, is hardware independence. Good Vulkan support could ostensibly open the door to mixing cards from different brands without having to resort to RPC and all the downsides that come with it. You could have a 3090 for prompt processing and attention (making use of those tensor cores) and a bunch of Mi50 for the rest of the model.

How important is Vulkan back-end development? #590

Uh oh!

ikawrakow Jul 6, 2025 Maintainer

Replies: 15 comments · 28 replies

Uh oh!

Uh oh!

Uh oh!

saood06 Jul 7, 2025 Collaborator

Uh oh!

ikawrakow Jul 7, 2025 Maintainer Author

Uh oh!

Vulkan build

CPU build

Uh oh!

Uh oh!

saood06 Jul 7, 2025 Collaborator

Uh oh!

Vulkan build

CPU build

Uh oh!

firecoperana Jul 6, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ikawrakow Jul 17, 2025 Maintainer Author

Uh oh!

firecoperana Jul 17, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ikawrakow Jul 18, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

ikawrakow Jul 18, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ikawrakow
Jul 6, 2025
Maintainer

Replies: 15 comments 28 replies

saood06 Jul 7, 2025
Collaborator

ikawrakow Jul 7, 2025
Maintainer Author

saood06 Jul 7, 2025
Collaborator

firecoperana
Jul 6, 2025
Collaborator

ikawrakow Jul 17, 2025
Maintainer Author

firecoperana Jul 17, 2025
Collaborator

ikawrakow
Jul 18, 2025
Maintainer Author

ikawrakow Jul 18, 2025
Maintainer Author