how row split works #15036

Kas1o · 2025-08-02T13:00:10Z

Kas1o
Aug 2, 2025

in my guess, it will split weight matrix to multiple matrix by row. copy each of them to diffrent device and with same complete input.
then, It should be able to increase the work rate during multi-GPU running. and decrease the decode latency.
but after test, it even worse than normal layer split. my 2080ti * 2 only get 3.11 tps output.

run command:

./llama-server -m model_path -fa -cb --ctx-size 32768 -ngl 99 -np 1 -ts 1,1 -sm row

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

how row split works #15036

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

how row split works #15036

Uh oh!

Kas1o Aug 2, 2025

Replies: 0 comments

Kas1o
Aug 2, 2025