Is this better for multi-GPU and split mode "graph"? #1027

ikawrakow · 2025-12-02T08:48:52Z

I only have a 2xGPU system, so no way to test the best graph splitting strategy on a multi-GPU system.
On the main branch I'm forcing a second graph split when combining partial tensor-parallel results. But this may not be the best strategy, so this PR removes the second split.

Please test with split mode "graph" on your multi-GPU system and let me know if this PR gives a better performance.

Ph0rk0z · 2025-12-02T12:39:44Z

So I did some 4gpu testing.. on my system the results not so great:

    CUDA_VISIBLE_DEVICES=0,1,2,3 numactl --interleave=all ./bin/llama-sweep-bench \
    -m /models/GLM-4.6-GGUF-UD-Q3_K_XL/GLM-4.6-UD-Q3_K_XL-00001-of-00004.gguf \
    -t 48 \
    -c 32768 \
    --numa distribute \
    -ngl 94 \
    -ctk q8_0 \
    -ctv q8_0 \
    -rtr \
    -ub 1024 \
    -gr \
    -ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13)\.ffn_.*_exps.=CUDA0" \
    -ot "blk\.(14|15|16|17|18|19|20|21|22|23|24)\.ffn_.*_exps.=CUDA1" \
    -ot "blk\.(25|26|27|28|29|30|31|32|33|34|35)\.ffn_.*_exps.=CUDA2" \
    -ot "blk\.(36|37|38|39|40|41|42|43|44|45|46)\.ffn_.*_exps.=CUDA3" \
    -ot "\.ffn_.*_exps.=CPU" \
    -sm graph \
    -cuda offload-batch-size=7,fusion=1 \
    -mqkv \

Baseline

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	8.594	119.16	16.887	15.16
1024	256	1024	8.443	121.28	17.253	14.84
1024	256	2048	8.543	119.86	17.776	14.40
1024	256	3072	8.595	119.14	18.456	13.87
1024	256	4096	8.633	118.61	19.056	13.43
1024	256	5120	8.637	118.55	19.719	12.98
1024	256	6144	8.682	117.94	20.188	12.68
1024	256	7168	8.764	116.84	20.540	12.46
1024	256	8192	8.970	114.16	20.908	12.24
1024	256	9216	8.953	114.37	21.391	11.97
1024	256	10240	9.193	111.39	21.808	11.74
1024	256	11264	9.029	113.42	22.371	11.44
1024	256	12288	9.266	110.51	22.837	11.21
1024	256	13312	9.278	110.37	23.335	10.97

Baseline with layers as req for split. GR/MQKV/Fusion

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	9.070	112.90	16.654	15.37
1024	256	1024	9.145	111.97	17.146	14.93
1024	256	2048	9.101	112.51	17.611	14.54
1024	256	3072	9.139	112.05	18.254	14.02
1024	256	4096	9.219	111.08	18.804	13.61
1024	256	5120	9.260	110.58	19.565	13.08
1024	256	6144	9.290	110.22	19.968	12.82
1024	256	7168	9.410	108.82	20.347	12.58
1024	256	8192	9.395	108.99	20.716	12.36

Main branch -sm graph

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	10.749	95.27	27.024	9.47
1024	256	1024	10.662	96.04	27.035	9.47
1024	256	2048	10.932	93.67	27.031	9.47
1024	256	3072	10.717	95.55	27.086	9.45
1024	256	4096	10.750	95.25	27.805	9.21

Pull 1026

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	10.766	95.12	26.725	9.58
1024	256	1024	10.717	95.55	26.669	9.60
1024	256	2048	10.849	94.39	26.769	9.56
1024	256	3072	10.745	95.30	26.937	9.50
1024	256	4096	10.797	94.84	27.076	9.45
1024	256	5120	10.776	95.02	27.276	9.39

Pull 1027

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	10.749	95.26	27.028	9.47
1024	256	1024	10.695	95.74	27.165	9.42
1024	256	2048	11.123	92.06	27.213	9.41
1024	256	3072	10.751	95.25	27.264	9.39
1024	256	4096	10.744	95.31	27.583	9.28

GPU use is a little bit up. I left off MQKV for the graph runs since you said it doesn't work. Also had to remove a layer off each GPU or it would go OOM.

Is this better for multi-GPU and split mode "graph"?

3a91548

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is this better for multi-GPU and split mode "graph"? #1027

Is this better for multi-GPU and split mode "graph"? #1027

Uh oh!

ikawrakow commented Dec 2, 2025

Uh oh!

Ph0rk0z commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Is this better for multi-GPU and split mode "graph"? #1027

Are you sure you want to change the base?

Is this better for multi-GPU and split mode "graph"? #1027

Uh oh!

Conversation

ikawrakow commented Dec 2, 2025

Uh oh!

Ph0rk0z commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants