Replies: 9 comments 36 replies
-
llama.cpp's vulkan backend is faster and uses less memory on my 7900xtx as well (I'm using latest rocm on Arch so it's not that). |
Beta Was this translation helpful? Give feedback.
-
I'm working on bringing ik_llama.cpp up to date with llama.cpp's vulkan backend. It is actually easier than I expected. |
Beta Was this translation helpful? Give feedback.
-
So, what is the "approved" way of installing the necessary dependencies for Vulkan development on Ubuntu? I ended up installing LunarG VulkanSDK, but the thing almost bricked my system because I hadn't run Anyhow, at the end I got the mainline Vulkan build working, but performance is very far from CUDA on my RTX-4080 Vulkan sweep-bench, LlaMA-3.1-8B
CUDA sweep-bench, LlaMA-3.1-8B
So, PP is 3X lower, TG is 20-25% lower. Given this, does it make sense to spend time on Vulkan? When I forked |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
@jeffbolznv Thank you for chiming in. Above is the log. Is there something additional I need to do to improve performance? I did
|
Beta Was this translation helpful? Give feedback.
-
What's the llama-bench equivalent of the |
Beta Was this translation helpful? Give feedback.
-
But a port of the mainline Vulkan back-end to |
Beta Was this translation helpful? Give feedback.
-
So, the Vulkan back-end is usable, and performance is better than
So, if you feel that Vulkan performance improvement in |
Beta Was this translation helpful? Give feedback.
-
I dont know if this is the right place, but since it is vulkan related... I compiled vulkan with your line and tried to load Deepseek IQ2 with this: numactl -N 0 -m 0 It loaded, my MI50 ram full, but the speed (prompt and output) is half of CPU alone! Did i make an error in the compilation or launch? if i use -mla 1 to 3, only 12GB out of 32 are used and the speed is always the same, slower than CPU. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Background
I've been asked a few times now about AMD GPU support with ik's fork. I recently got access to an AMD RX 7900 XTX to try it out, and as discussed on Issue 503 the Vulkan and ROCm backends are not the focus of this fork hence limited support on AMD GPU hardware.
I'm starting this discussion to have a place to point folks who might be interested the current state AMD GPU backend support, and especially if they wanted to attempt updates and work on it at all.
Current State
ik_llama.cpp actually does compile with Vulkan and can do some limited inferencing. As it is unmaintained, it is slower than mainline at the moment. However I couldn't get it to compile with ROCm/HIP support. I only tried the AMD official open source AMDVLK backend and not the community open source RADV backend.
There is a good benchmarking discussion on mainline maintained by @netrunnereve which was very helpful for establishing baseline expectations and trying to understand the various AMD GPU driver development environments.
Benchmarks
I did a comparison between mainline llama.cpp and ik_llama.cpp at the given sha's for what I could get working.
Methodology
To keep things somewhat consistent with the establish methodologies I used TheBloke's now vintage Llama-2-7B at classic Q4_0 quantization. The following is how compilation was done as well as running
llama-sweep-bench
with and without flash attention-fa
:Compiling
sweep-bench
Observations
N_KV=7680
.iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
<|im_start|>
and<|im_end|>
type tokens which don't usually come back from the chat endpoint.Conclusion
Well, sorry if you have AMD GPU hardware and were hoping to try out the latest greatest stuff on ik's fork. You can still make use of the CPU only optimizations fwiw. You can see the relative performance of native CUDA in the linked benchmark thread for one of my other tests, and ik's fork does run faster than mainline for CUDA.
Finally, I saw and interesting NVIDIA slide deck from the Vulkanised 2025 Developer Conference which discusses llama.cpp on pages 14 and 15 even showing what looks like some of ik's IQ4_NL code with implementation discussions. I was surprised that some models benchmark faster on NVIDIA GPUs using vulkan backend beating out the native CUDA implementation, but perhaps that is for another day...
Thanks and curious if anyone else has tried this or is interested in improving support here. Cheers!
Beta Was this translation helpful? Give feedback.
All reactions