Performance of llama.cpp with Vulkan #10879
Replies: 122 comments 214 replies
-
AMD FirePro W8100
|
Beta Was this translation helpful? Give feedback.
-
AMD RX 470
|
Beta Was this translation helpful? Give feedback.
-
ubuntu 24.04, vulkan and cuda installed from official APT packages.
build: 4da69d1 (4351) vs CUDA on the same build/setup
build: 4da69d1 (4351) |
Beta Was this translation helpful? Give feedback.
-
Macbook Air M2 on Asahi Linux ggml_vulkan: Found 1 Vulkan devices:
|
Beta Was this translation helpful? Give feedback.
-
Gentoo Linux on ROG Ally (2023) Ryzen Z1 Extreme ggml_vulkan: Found 1 Vulkan devices:
|
Beta Was this translation helpful? Give feedback.
-
ggml_vulkan: Found 4 Vulkan devices:
|
Beta Was this translation helpful? Give feedback.
-
build: 0d52a69 (4439) NVIDIA GeForce RTX 3090 (NVIDIA)
AMD Radeon RX 6800 XT (RADV NAVI21) (radv)
AMD Radeon (TM) Pro VII (RADV VEGA20) (radv)
Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver)
|
Beta Was this translation helpful? Give feedback.
-
@netrunnereve Some of the tg results here are a little low, I think they might be debug builds. The cmake step (at least on Linux) might require |
Beta Was this translation helpful? Give feedback.
-
Build: 8d59d91 (4450)
Lack of proper Xe coopmat support in the ANV driver is a setback honestly.
edit: retested both with the default batch size. |
Beta Was this translation helpful? Give feedback.
-
Here's something exotic: An AMD FirePro S10000 dual GPU from 2012 with 2x 3GB GDDR5. build: 914a82d (4452)
|
Beta Was this translation helpful? Give feedback.
-
Latest arch with For the sake of consistency I run every bit in a script and also build every target from scratch (for some reason kill -STOP -1
timeout 240s $COMMAND
kill -CONT -1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none
build: ff3fcab (4459)
This bit seems to underutilise both GPU and CPU in real conditions based on
|
Beta Was this translation helpful? Give feedback.
-
Intel ARC A770 on Windows:
build: ba8a1f9 (4460) |
Beta Was this translation helpful? Give feedback.
-
Single GPU VulkanRadeon Instinct MI25 ggml_vulkan: 0 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) Radeon PRO VII ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) Multi GPU Vulkanggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) Single GPU RocmDevice 0: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no
build: 2739a71 (4461) Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
build: 2739a71 (4461) Multi GPU RocmDevice 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
build: 2739a71 (4461) Layer split
build: 2739a71 (4461) Row split
build: 2739a71 (4461) Single GPU speed is decent, but multi GPU trails Rocm by a wide margin, especially with large models due to the lack of row split. |
Beta Was this translation helpful? Give feedback.
-
AMD Radeon RX 5700 XT on Arch using mesa-git and setting a higher GPU power limit compared to the stock card.
I also think it could be interesting adding the flash attention results to the scoreboard (even if the support for it still isn't as mature as CUDA's).
|
Beta Was this translation helpful? Give feedback.
-
I tried but there's nothing after 1 hrs , ok, might be 40 mins... Anyway I run the llama_cli for a sample eval...
Meanwhile OpenBLAS
|
Beta Was this translation helpful? Give feedback.
-
GTX 1660 Ti Mobile
|
Beta Was this translation helpful? Give feedback.
-
Saw some updates that may improve Vulkan performance, so I ran benchmarks again. 7900XTX (Powercolor Red Devil)
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (AMD proprietary driver) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
|
Beta Was this translation helpful? Give feedback.
-
I used the trick from here: #10879 (comment) to get usable amdvlk setup with llama.cpp RX 6800m benchmarks radv vs amdvlk: radv
amdvlk
Interesting results, amdvlk is slower at PP for me (maybe because of theh absence of matrix cores?) |
Beta Was this translation helpful? Give feedback.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as spam.
This comment was marked as spam.
-
CPU: "AMD Ryzen 7 9800X3D 8-Core Processor" Vulkan and ROCm installed following the official docs(llama.cpp + lunarX + amd docs), rocWMMA linked to HIP
I never expected Vulkan to go toe-to-toe/beat AMD's HIP runtime. :) |
Beta Was this translation helpful? Give feedback.
-
Here are some results for three different cards
Benchmarks for the Radeon Pro V620 are difficult to find. I picked one up off of eBay to see if 2 or 4 of them could make a reasonably inexpensive option to increase VRAM density over something like the RTX 5060 Ti 16 GB. As the results below show, while the token generation speeds are decent for this card, the prompt processing speeds are abysmal compared to even the RTX 5060 TI. What is great to see is the Vulkan numbers for the V620 are competitive to the ROCm numbers (even beating ROCm for the small llama 7B model). I tried three different drivers for the AMD card: RADV, the AMD open-source driver, and the AMD proprietary driver. RADV gave the best results on this particular card. I'm not sure if this matters, but the motherboards I'm using for these tests don't support Resizable BAR. I don't know if that has a negative impact on the Intel Arc card for inference speeds. AMD Radeon Pro V620 32GB
ROCm 6.4.2 for comparison
Intel Arc A770 16GB
Unfortunately I've been having problems getting a SYCL build working on this Fedora Rawhide installation. I'll swap in a SSD running Ubuntu 24.something that I believe I have a working SYCL build of llama.cpp on for comparison. NVIDIA RTX 5060 Ti 16GB
CUDA 12.8 for comparison
For additional comparisons, here is a 51B model quantized to IQ4_XS: Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ4_XS.gguf. I chose this as a relatively large model that could comfortably fit in 32GB VRAM. AMD Radeon Pro V620 32GB
NVIDIA RTX 5060 Ti 16GB
(Obviously I can't fit this 51B model on a single 16GB card, so no benchmarks of my only Arc A770.) |
Beta Was this translation helpful? Give feedback.
-
RX 7800 XT (Sapphire Pulse 280W)ggml_vulkan: Found 1 Vulkan devices:
build: baad948 (6056) ggml_cuda_init: found 1 ROCm devices:
build: 00131d6 (6031) |
Beta Was this translation helpful? Give feedback.
-
Radeon Pro V620root@llama:~# /root/llama-builds/llama.cpp/bin/llama-bench -m /mnt/models/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -mg 1 -sm none
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon (TM) Pro WX 3200 Series (RADV POLARIS12) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon PRO V620 (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
build: 03d4698 (6074) Linux |
Beta Was this translation helpful? Give feedback.
-
AMD Radeon RX Vega 64
build: ec428b0 (3) ROCm pp is way slower
build: ec428b0 (3) Using Arch Linux |
Beta Was this translation helpful? Give feedback.
-
Powercolor Red Devil 7900XTXAdrenalin 25.8.1 just came out, so time to test again
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (AMD proprietary driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
build: 2572689 (6099) A loss against May 26th (3419 and 3187), and a loss against July 22nd (3489 and 3225). |
Beta Was this translation helpful? Give feedback.
-
Intel core ultra 7 155H iGPU
build: 1d72c84 (6109) Strange that CPU is faster in tg128
build: 1d72c84 (6109) Am i doing something incorrectly? |
Beta Was this translation helpful? Give feedback.
-
got a nice 4-5% performance increase in tg128 since i last tested in late june using build: fd1234c (6096) ggml_vulkan: 0 = AMD Radeon RX 9070 XT (AMD open-source driver) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
ggml_vulkan: 0 = AMD Radeon RX 9070 XT (RADV GFX1201) (radv) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
|
Beta Was this translation helpful? Give feedback.
-
About 5% performance increase vs 8e6f8bc. GTX 1660 Ti Mobile
build: e54d41b (6121) CUDA pp is over 2x slower, but TG is 10% faster. |
Beta Was this translation helpful? Give feedback.
-
New Metal Build vs Vulkan build ! ./build/bin/llama-bench -ngl 99 -m ../Models/llama-2-7b-q4_0.gguf ggml_metal_init: found device: AMD Radeon RX 6900 XT
./build/bin/llama-bench -ngl 99 -m /Users/xionz/Models/llama-2-7b-q4_0.gguf -sm none -mg 0 ggml_vulkan: 0 = AMD Radeon RX 6900 XT (MoltenVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
build: 79c1160 (6123) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
This is similar to the Apple Silicon benchmark thread, but for Vulkan! Many improvements have been made to the Vulkan backend and I think it's good to consolidate and discuss our results here.
We'll be testing the Llama 2 7B model like the other thread to keep things consistent, and use Q4_0 as it's simple to compute and small enough to fit on a 4GB GPU. You can download it here.
Instructions
Either run the commands below or download one of our Vulkan releases. If you have multiple GPUs please run the test on a single GPU using
-sm none -mg YOUR_GPU_NUMBER
unless the model is too big to fit in VRAM.Share your llama-bench results along with the git hash and Vulkan info string in the comments. Feel free to try other models and compare backends, but only valid runs will be placed on the scoreboard.
If multiple entries are posted for the same setup I'll prioritize newer commits with substantial Vulkan updates, otherwise I'll pick the one with the highest overall score at my discretion. Performance may vary depending on driver, operating system, board manufacturer, etc. even if the chip is the same. For integrated graphics note that the memory speed and number of channels will greatly affect your inference speed!
Vulkan Scoreboard for Llama 2 7B, Q4_0 (no FA)
Vulkan Scoreboard for Llama 2 7B, Q4_0 (with FA)
Beta Was this translation helpful? Give feedback.
All reactions