Replies: 1 comment 3 replies
-
Can you post the |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am benching gpt-oss-120b with llama-batched-bench and I am seeing very nice speedups all to way to batchsize of 32. However, I am not seeing those same improvements with llama-server. As a matter of fact, the aggregate token generation speed actually drops as I increase concurrency and only recovers at batchsize=16 or higher. I am using the following tool to test concurrent requests: https://github.com/Yoosu-L/llmapibenchmark
I am using
LLAMA_SET_ROWS=1
for the split KV-cache. I AM getting a warning about not usingswa-full
, so I am not sure if that is related, but llama-batched bench didn't require any changes in that department to see nice speedups.I am using the Vulkan backend on a gfx1151 Strix Halo API. I am testing with the pro, radv and amdvlk drivers on Linux with similar results. Any ideas?
Beta Was this translation helpful? Give feedback.
All reactions