-
Notifications
You must be signed in to change notification settings - Fork 154
Description
What happened?
PP Perf decrease of ~25% following the CUDA Graphs PR #689 .
I've been suspicious for a while by seat-of-the-pants impression, but didn't take time to measure until now.
Hardware is 2S EPYC 9115 w/RTX 8000 (Turing)
Tested with CUDA 12.6 and 13.0
aside: I don't observe any problems using CUDA 13.0
aside2: It seems unfortunate that there is so much refactoring, the gpt-oss addition, and whitespace changes mixed in with the graphs PR, it's at least difficult for an outsider to read and get a sense of which concept could be the cause.
-b 4096 -ub 4096
-mla 2 -fa -fmoe
-ngl 999 -ot exps=CPU
-m Kimi-K2-0905-DQ4_K.gguf
quant is my own, but the same config as anikifoss' K2.
-DGGML_CUDA_USE_GRAPHS=OFF
in the build invocation was only added in the CUDA Graphs commit, normally I build without this line regardless of ON/OFF value.
Commit prior to CUDA Graphs (e082df4)
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
4096 | 1024 | 0 | 44.583 | 91.87 | 93.665 | 10.93 |
4096 | 1024 | 4096 | 46.379 | 88.32 | 98.202 | 10.43 |
4096 | 1024 | 8192 | 48.673 | 84.15 | 104.752 | 9.78 |
4096 | 1024 | 12288 | 51.006 | 80.30 | 111.377 | 9.19 |
4096 | 1024 | 16384 | 53.209 | 76.98 | 119.035 | 8.60 |
4096 | 1024 | 20480 | 55.804 | 73.40 | 130.798 | 7.83 |
4096 | 1024 | 24576 | 57.336 | 71.44 | 137.132 | 7.47 |
4096 | 1024 | 28672 | 59.623 | 68.70 | 144.862 | 7.07 |
CUDA_GRAPHS (fc06bc9) (GGML_CUDA_USE_GRAPHS=default)
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
4096 | 1024 | 0 | 61.110 | 67.03 | 93.308 | 10.97 |
4096 | 1024 | 4096 | 63.197 | 64.81 | 96.553 | 10.61 |
4096 | 1024 | 8192 | 65.332 | 62.69 | 101.798 | 10.06 |
4096 | 1024 | 12288 | 67.504 | 60.68 | 107.475 | 9.53 |
4096 | 1024 | 16384 | 69.694 | 58.77 | 113.288 | 9.04 |
4096 | 1024 | 20480 | 71.882 | 56.98 | 121.111 | 8.46 |
4096 | 1024 | 24576 | 74.105 | 55.27 | 128.320 | 7.98 |
4096 | 1024 | 28672 | 76.336 | 53.66 | 136.589 | 7.50 |
CUDA_GRAPHS (fc06bc9) (GGML_CUDA_USE_GRAPHS=off)
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
4096 | 1024 | 0 | 61.094 | 67.04 | 93.771 | 10.92 |
4096 | 1024 | 4096 | 63.175 | 64.84 | 96.601 | 10.60 |
4096 | 1024 | 8192 | 65.287 | 62.74 | 101.879 | 10.05 |
4096 | 1024 | 12288 | 67.474 | 60.71 | 107.304 | 9.54 |
4096 | 1024 | 16384 | 69.637 | 58.82 | 113.538 | 9.02 |
4096 | 1024 | 20480 | 71.847 | 57.01 | 121.361 | 8.44 |
4096 | 1024 | 24576 | 74.121 | 55.26 | 128.583 | 7.96 |
4096 | 1024 | 28672 | 76.310 | 53.68 | 135.303 | 7.57 |
HEAD (c519d41)
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
4096 | 1024 | 0 | 62.670 | 65.36 | 93.086 | 11.00 |
4096 | 1024 | 4096 | 64.768 | 63.24 | 96.333 | 10.63 |
4096 | 1024 | 8192 | 66.890 | 61.23 | 101.669 | 10.07 |
4096 | 1024 | 12288 | 69.054 | 59.32 | 107.363 | 9.54 |
4096 | 1024 | 16384 | 71.233 | 57.50 | 113.207 | 9.05 |
4096 | 1024 | 20480 | 73.414 | 55.79 | 121.171 | 8.45 |
4096 | 1024 | 24576 | 75.642 | 54.15 | 128.810 | 7.95 |
4096 | 1024 | 28672 | 77.859 | 52.61 | 136.124 | 7.52 |
4096 | 1024 | 32768 | 80.073 | 51.15 | 145.548 | 7.04 |
cmake -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_NATIVE=ON \
-DGGML_CCACHE=OFF \
-DGGML_CUDA=ON \
-DBUILD_SHARED_LIBS=OFF \
-DGGML_SCHED_MAX_COPIES=1 \
-DGGML_VULKAN=OFF \
-DGGML_RPC=OFF \
-DGGML_BLAS=OFF \
-DGGML_CUDA_F16=ON \
-DGGML_CUDA_USE_GRAPHS=OFF \
-DGGML_CUDA_IQK_FORCE_BF16=1
cmake --build build --config Release -j16
Name and Version
version: 3881 (c519d41)
built with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux