How does concurrency of requests scale with hardware? #1377

pkeogan · 2025-06-09T07:19:23Z

pkeogan
Jun 9, 2025

I am working on a project and have found DeepSeek R1 671B Q8 to work flawlessly for the project's needs. I conducted this testing via API using framework.ai. Due to privacy concerns, I need to run the instance locally.

Now, I am trying to determine what hardware to purchase, but a major question remains unanswered: throughput. '--max-batch-size'

After reviewing various videos, guides, articles, and documentation, I've seen reported performance ranging from 1 to 15 tokens per second when running outside a traditional H200 stack.

However, I haven't found any mention of throughput in relation to concurrency. I may have a flawed understanding of how DeepSeek scales with concurrent requests, but what scaling can be expected? Can it even handle concurrent requests?

If I had a setup achieving 6 tokens per second on a single request, would I expect 3 tokens per second with two concurrent requests? And then 1 token per second with six concurrent requests? If so, the total throughput would remain 6 tokens per second.

From what I understand, this may not be the case.

I'm looking for an explanation of how concurrent requests are handled, if possible, and how to calculate the maximum theoretical throughput based on hardware configuration. My project would work well with a system that can handle concurrent requests with a total throughput of 25 tokens per second, even if spread across 25 requests. I am trying to determine how to achieve this.

Any help would be greatly appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How does concurrency of requests scale with hardware? #1377

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How does concurrency of requests scale with hardware? #1377

Uh oh!

pkeogan Jun 9, 2025

Replies: 0 comments

pkeogan
Jun 9, 2025