Replies: 15 comments 28 replies
-
On AMD, vulkan is faster and more memory efficient than rocm. |
Beta Was this translation helpful? Give feedback.
-
Currently, owners of Nvidia GPUs have access to a wide range of inference engines (e.g., vllm, exllama, sglang, mlc, aphrodite-engine) that are optimized for CUDA. This allows them to fully utilize their hardware, which is great. In contrast, Vulkan support could provide significant benefits to users of AMD and Intel GPUs, which currently have less mature tooling and support. AMD appears not so friendly toward regular consumers, eg. AMD Rocm barely supports their top GPUs. Also, with Vulkan support would be possible to run the fast I want to acknowledge the effort and quality of your work, therefore whatever you choose (improve speed, quants quality, Vulkan, features, ...) doesn't matter at the end: they will benefit us, users/community. |
Beta Was this translation helpful? Give feedback.
-
You don't need to make the decision so soon. You can wait and see if this improvement in Vulkan draws more interests from Vulkan users or even developers. It's more important for AMD and Intel users, but they may not know about this yet. |
Beta Was this translation helpful? Give feedback.
-
I personally voted against Vulkan, and only because the community's opinion was asked. @ikawrakow : My argument would basically go along yours. If there's demand, and most importantly if there's motivation, and even better if there is help, then I'd love to see IKL support Vulkan, because this backend seems to have a future. But as of now, your development are so valuable on what you master than it might be more pertinent to focus on your art rather than learn a new technique. A technique which could be provided by skilled Vulkan devs to roll in your wheel, rather than to have to do it yourself. Skilled Vulkan devs who might eventually come to IKL and join you, firecoperana and the fray, because IKL is where the good stuff is, quants and big-moe support-wise, and also "welcoming to all good-wills wise". Just my opinion, I'll be happy whatever you choose. Especially after the IQ2_KL surprise! :) |
Beta Was this translation helpful? Give feedback.
-
I voted 'no' but regret it / can't remove my vote. I'd rather abstain :) For me personally, I use this app to get more performance out of my used Nvidia hardware + CPU with MoE's. The biggest win for me would be if someone could improve rpc server performance, as this would make it viable for us to link multiple rigs without cutting prompt processing in half. But Vulkan would help both Intel and AMD users. Intel are releasing a 24GB GPU later this year. And while Openvino and sycl are way faster, there's an issue with Openvino whereby you can't use KV Cache with multiple GPUs. That 48GB dual-GPU one of the board partners is releasing --will effectively be 2x24gb GPUs, so people buying that card would benefit from faster Vulkan performance.
ik_llama is a passion project right? So perhaps just do what would be most interesting? |
Beta Was this translation helpful? Give feedback.
-
Found this discussion while searching for references to SYCL to see if building for SYCL is supported (having a lot of compilation errors). I voted for improving the Vulkan backend but here are my two cents:
|
Beta Was this translation helpful? Give feedback.
-
You are correct to ask this question. Your target users are those with a single powerful GPU and a decent dram CPU combo. Those users are power users and small businesses. Further, most serious ones are using 24GB machines or better. They have rocm and cuda, and if Intel ever comes out with a 24GB single card that is actually available, they'll support it properly as well. Vulcan helps old hardware, and people that love hassle free setups. I don't think you should be doing that hassle free work yourself, given your users are all very capable of that work/setup, as much as we would like to have that ease of use. If your goal is mass popularity like llama.cpp, then yeah get started on Vulcan, and also get some help, cause that's a tall order. Just my thoughts |
Beta Was this translation helpful? Give feedback.
-
I think improvements to vulkan performance would be a positive. This would allow uses greater flexibility when deciding on hardware. Also ARC and AMD GPU users would benefit from these improvements. |
Beta Was this translation helpful? Give feedback.
-
Vote for Vulkan. It's the API that all vendors are pushing hard to support. AMD's RADV driver is really solid, Intel's ANV is steadily improving, and Jeff Bolz from NVIDIA has been contributing to llama.cpp's Vulkan backend for several months now. |
Beta Was this translation helpful? Give feedback.
-
Wow, I see 18 new votes since I last checked yesterday. For people who came here to vote for Vulkan but are not familiar with this project, the mainline |
Beta Was this translation helpful? Give feedback.
-
Intel Arc GPUs would greatly benefit from Vulkan improvement, thanks for your hard work and dedicating your time on this great project. |
Beta Was this translation helpful? Give feedback.
-
So, my account was suspended for 2 days, and I'm going on vacation two days from now, so I'll look into the Vulkan situation when I come back in 2 weeks. |
Beta Was this translation helpful? Give feedback.
-
Does the vulkan backend support the IQ1_KT quant? If not, is that something that is planned by any chance? Would love to play around with the new Qwen3 coder model, but with 128GB vram I can't fit anything bigger. I am using the AMD 395 with iGPU, so using CUDA is unfortunately not possible. |
Beta Was this translation helpful? Give feedback.
-
All PCs and laptops with igpu (no nvidia/adm gpu) have a huge benefit from Vulkan. If you ckeck the ollama project their users begging it for more then 1 year. Great that ik:llama.cpp will have it. |
Beta Was this translation helpful? Give feedback.
-
If I may add my two cents:
TL;DR:
Thank you! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Tthe Vulkan back-end in
ik_llama.cpp
is now usable, and performance is better thanllama.cpp
(see, e.g., PR #584 that has a comparison for a MoE model). But compared to CUDA on the same GPU, performance is much lower, especially for MoE models (and most users appear to be usingik_llama.cpp
exactly for one of the giant MoE models). I have mixed feelings how to proceed:llama.cpp
Vulkan back-end in particular, hence, at least initially, it will be an uphill battle. Without a significant interest from the user base, I don't feel particularly motivated to do this to myself.82 votes ·
Beta Was this translation helpful? Give feedback.
All reactions