Skip to content

Bug: PP Perf Regression after CUDA Graphs PR#689 #765

@usrlocalben

Description

@usrlocalben

What happened?

PP Perf decrease of ~25% following the CUDA Graphs PR #689 .

I've been suspicious for a while by seat-of-the-pants impression, but didn't take time to measure until now.

Hardware is 2S EPYC 9115 w/RTX 8000 (Turing)

Tested with CUDA 12.6 and 13.0

aside: I don't observe any problems using CUDA 13.0

aside2: It seems unfortunate that there is so much refactoring, the gpt-oss addition, and whitespace changes mixed in with the graphs PR, it's at least difficult for an outsider to read and get a sense of which concept could be the cause.

-b 4096 -ub 4096
-mla 2 -fa -fmoe
-ngl 999 -ot exps=CPU
-m Kimi-K2-0905-DQ4_K.gguf

quant is my own, but the same config as anikifoss' K2.

-DGGML_CUDA_USE_GRAPHS=OFF in the build invocation was only added in the CUDA Graphs commit, normally I build without this line regardless of ON/OFF value.

Commit prior to CUDA Graphs (e082df4)

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 44.583 91.87 93.665 10.93
4096 1024 4096 46.379 88.32 98.202 10.43
4096 1024 8192 48.673 84.15 104.752 9.78
4096 1024 12288 51.006 80.30 111.377 9.19
4096 1024 16384 53.209 76.98 119.035 8.60
4096 1024 20480 55.804 73.40 130.798 7.83
4096 1024 24576 57.336 71.44 137.132 7.47
4096 1024 28672 59.623 68.70 144.862 7.07

CUDA_GRAPHS (fc06bc9) (GGML_CUDA_USE_GRAPHS=default)

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 61.110 67.03 93.308 10.97
4096 1024 4096 63.197 64.81 96.553 10.61
4096 1024 8192 65.332 62.69 101.798 10.06
4096 1024 12288 67.504 60.68 107.475 9.53
4096 1024 16384 69.694 58.77 113.288 9.04
4096 1024 20480 71.882 56.98 121.111 8.46
4096 1024 24576 74.105 55.27 128.320 7.98
4096 1024 28672 76.336 53.66 136.589 7.50

CUDA_GRAPHS (fc06bc9) (GGML_CUDA_USE_GRAPHS=off)

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 61.094 67.04 93.771 10.92
4096 1024 4096 63.175 64.84 96.601 10.60
4096 1024 8192 65.287 62.74 101.879 10.05
4096 1024 12288 67.474 60.71 107.304 9.54
4096 1024 16384 69.637 58.82 113.538 9.02
4096 1024 20480 71.847 57.01 121.361 8.44
4096 1024 24576 74.121 55.26 128.583 7.96
4096 1024 28672 76.310 53.68 135.303 7.57

HEAD (c519d41)

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 62.670 65.36 93.086 11.00
4096 1024 4096 64.768 63.24 96.333 10.63
4096 1024 8192 66.890 61.23 101.669 10.07
4096 1024 12288 69.054 59.32 107.363 9.54
4096 1024 16384 71.233 57.50 113.207 9.05
4096 1024 20480 73.414 55.79 121.171 8.45
4096 1024 24576 75.642 54.15 128.810 7.95
4096 1024 28672 77.859 52.61 136.124 7.52
4096 1024 32768 80.073 51.15 145.548 7.04
cmake -B build \
        -DCMAKE_BUILD_TYPE=Release \
        -DGGML_NATIVE=ON \
        -DGGML_CCACHE=OFF \
        -DGGML_CUDA=ON \
        -DBUILD_SHARED_LIBS=OFF \
        -DGGML_SCHED_MAX_COPIES=1 \
        -DGGML_VULKAN=OFF \
        -DGGML_RPC=OFF \
        -DGGML_BLAS=OFF \
        -DGGML_CUDA_F16=ON \
        -DGGML_CUDA_USE_GRAPHS=OFF \ 
        -DGGML_CUDA_IQK_FORCE_BF16=1

cmake --build build --config Release -j16

Name and Version

version: 3881 (c519d41)
built with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions