Skip to content

Conversation

@ikawrakow
Copy link
Owner

I only have a 2xGPU system, so no way to test the best graph splitting strategy on a multi-GPU system.
On the main branch I'm forcing a second graph split when combining partial tensor-parallel results. But this may not be the best strategy, so this PR removes the second split.

Please test with split mode "graph" on your multi-GPU system and let me know if this PR gives a better performance.

@Ph0rk0z
Copy link

Ph0rk0z commented Dec 2, 2025

So I did some 4gpu testing.. on my system the results not so great:

    CUDA_VISIBLE_DEVICES=0,1,2,3 numactl --interleave=all ./bin/llama-sweep-bench \
    -m /models/GLM-4.6-GGUF-UD-Q3_K_XL/GLM-4.6-UD-Q3_K_XL-00001-of-00004.gguf \
    -t 48 \
    -c 32768 \
    --numa distribute \
    -ngl 94 \
    -ctk q8_0 \
    -ctv q8_0 \
    -rtr \
    -ub 1024 \
    -gr \
    -ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13)\.ffn_.*_exps.=CUDA0" \
    -ot "blk\.(14|15|16|17|18|19|20|21|22|23|24)\.ffn_.*_exps.=CUDA1" \
    -ot "blk\.(25|26|27|28|29|30|31|32|33|34|35)\.ffn_.*_exps.=CUDA2" \
    -ot "blk\.(36|37|38|39|40|41|42|43|44|45|46)\.ffn_.*_exps.=CUDA3" \
    -ot "\.ffn_.*_exps.=CPU" \
    -sm graph \
    -cuda offload-batch-size=7,fusion=1 \
    -mqkv \

Baseline

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 8.594 119.16 16.887 15.16
1024 256 1024 8.443 121.28 17.253 14.84
1024 256 2048 8.543 119.86 17.776 14.40
1024 256 3072 8.595 119.14 18.456 13.87
1024 256 4096 8.633 118.61 19.056 13.43
1024 256 5120 8.637 118.55 19.719 12.98
1024 256 6144 8.682 117.94 20.188 12.68
1024 256 7168 8.764 116.84 20.540 12.46
1024 256 8192 8.970 114.16 20.908 12.24
1024 256 9216 8.953 114.37 21.391 11.97
1024 256 10240 9.193 111.39 21.808 11.74
1024 256 11264 9.029 113.42 22.371 11.44
1024 256 12288 9.266 110.51 22.837 11.21
1024 256 13312 9.278 110.37 23.335 10.97

Baseline with layers as req for split. GR/MQKV/Fusion

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 9.070 112.90 16.654 15.37
1024 256 1024 9.145 111.97 17.146 14.93
1024 256 2048 9.101 112.51 17.611 14.54
1024 256 3072 9.139 112.05 18.254 14.02
1024 256 4096 9.219 111.08 18.804 13.61
1024 256 5120 9.260 110.58 19.565 13.08
1024 256 6144 9.290 110.22 19.968 12.82
1024 256 7168 9.410 108.82 20.347 12.58
1024 256 8192 9.395 108.99 20.716 12.36

Main branch -sm graph

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 10.749 95.27 27.024 9.47
1024 256 1024 10.662 96.04 27.035 9.47
1024 256 2048 10.932 93.67 27.031 9.47
1024 256 3072 10.717 95.55 27.086 9.45
1024 256 4096 10.750 95.25 27.805 9.21

Pull 1026

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 10.766 95.12 26.725 9.58
1024 256 1024 10.717 95.55 26.669 9.60
1024 256 2048 10.849 94.39 26.769 9.56
1024 256 3072 10.745 95.30 26.937 9.50
1024 256 4096 10.797 94.84 27.076 9.45
1024 256 5120 10.776 95.02 27.276 9.39

Pull 1027

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 10.749 95.26 27.028 9.47
1024 256 1024 10.695 95.74 27.165 9.42
1024 256 2048 11.123 92.06 27.213 9.41
1024 256 3072 10.751 95.25 27.264 9.39
1024 256 4096 10.744 95.31 27.583 9.28

GPU use is a little bit up. I left off MQKV for the graph runs since you said it doesn't work. Also had to remove a layer off each GPU or it would go OOM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants