You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am working on a project and have found DeepSeek R1 671B Q8 to work flawlessly for the project's needs. I conducted this testing via API using framework.ai. Due to privacy concerns, I need to run the instance locally.
Now, I am trying to determine what hardware to purchase, but a major question remains unanswered: throughput. '--max-batch-size'
After reviewing various videos, guides, articles, and documentation, I've seen reported performance ranging from 1 to 15 tokens per second when running outside a traditional H200 stack.
However, I haven't found any mention of throughput in relation to concurrency. I may have a flawed understanding of how DeepSeek scales with concurrent requests, but what scaling can be expected? Can it even handle concurrent requests?
If I had a setup achieving 6 tokens per second on a single request, would I expect 3 tokens per second with two concurrent requests? And then 1 token per second with six concurrent requests? If so, the total throughput would remain 6 tokens per second.
From what I understand, this may not be the case.
I'm looking for an explanation of how concurrent requests are handled, if possible, and how to calculate the maximum theoretical throughput based on hardware configuration. My project would work well with a system that can handle concurrent requests with a total throughput of 25 tokens per second, even if spread across 25 requests. I am trying to determine how to achieve this.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I am working on a project and have found DeepSeek R1 671B Q8 to work flawlessly for the project's needs. I conducted this testing via API using framework.ai. Due to privacy concerns, I need to run the instance locally.
Now, I am trying to determine what hardware to purchase, but a major question remains unanswered: throughput. '--max-batch-size'
After reviewing various videos, guides, articles, and documentation, I've seen reported performance ranging from 1 to 15 tokens per second when running outside a traditional H200 stack.
However, I haven't found any mention of throughput in relation to concurrency. I may have a flawed understanding of how DeepSeek scales with concurrent requests, but what scaling can be expected? Can it even handle concurrent requests?
If I had a setup achieving 6 tokens per second on a single request, would I expect 3 tokens per second with two concurrent requests? And then 1 token per second with six concurrent requests? If so, the total throughput would remain 6 tokens per second.
From what I understand, this may not be the case.
I'm looking for an explanation of how concurrent requests are handled, if possible, and how to calculate the maximum theoretical throughput based on hardware configuration. My project would work well with a system that can handle concurrent requests with a total throughput of 25 tokens per second, even if spread across 25 requests. I am trying to determine how to achieve this.
Any help would be greatly appreciated.
Beta Was this translation helpful? Give feedback.
All reactions