chore(http-client): tune tcp settings, update defaults/utils by viraatc · Pull Request #168 · mlcommons/endpoints

viraatc · 2026-03-12T08:21:30Z

What does this PR do?

changes:

new variable-throughput-server to imitate real world long-tailed LLM servers at various throughputs (0-10_000)
fix: reduce socket buf sizes (causing dropped connections in offline/max-concurrency scenarios) WAS overkill
fix: tune keepalive probe settings to help connection drop-rates
chore: use drain pattern in benchmark-httpclient.py instead of sleep
chore: update default min workers 8->10, update affinity plan to assign 4 cores to main-rank for more stable perf

Variable-Throughput-Server:

* **Variable output lengths** — lognormal distribution, configurable mean + spread
* **Per-request response rate** — lognormal distribution (responses/sec per request)
* **First-chunk latency (TTFT)** — lognormal delay before first data
* **Per-chunk latency jitter** — lognormal inter-chunk delays in streaming mode

Two mutually exclusive timing modes:

* **Response-rate mode** (``--response-rate-mean``): controls total response time
  per request.  TPOT is derived from ``(1/rate - TTFT) / num_chunks``.
* **Inter-chunk mode** (``--inter-chunk-latency``): controls per-chunk delay
  directly.  Total response time = TTFT + num_chunks × TPOT.

Usage::

    # Basic non-streaming
    python -m inference_endpoint.testing.variable_throughput_server --stats

    # Offline with response-rate jitter
    python -m inference_endpoint.testing.variable_throughput_server --stats \\
        --output-len-mean 1000 --output-len-spread 0.4 \\
        --response-rate-mean 10000 --response-rate-spread 2.0

    # Streaming with inter-chunk latency (ms) + TTFT (s)
    python -m inference_endpoint.testing.variable_throughput_server --stream --stats \\
        --stream-interval 2 \\
        --inter-chunk-latency 20 --inter-chunk-spread 0.05 \\
        --first-chunk-latency 0.1 --first-chunk-spread 0.02

Example:
Deepseek-r1 interactive streaming threshold:

    
    python -m inference_endpoint.testing.variable_throughput_server --stats --num-workers 8 \\
    --inter-chunk-latency 15 --first-chunk-latency 1.5 --stream --stream-interval 10

endpoints run:

(max-conc=16k)
Summary 
 Total samples issued: 20000
 Total samples completed: 20000
 Duration: 45.59 seconds
 QPS: 438.73
 TPS: 392897.72
 
 
Latency Breakdowns 
 TTFT:
   Min: 246.16 ms
   Max: 10555.38 ms
   Median: 1403.47 ms
   Avg.: 1563.97 ms
   Std Dev.: 754.75 ms
 
   Percentiles:
   99.9: 5845.89 ms
     99: 4103.93 ms
     97: 3316.63 ms
     95: 2960.90 ms
     90: 2527.95 ms
     80: 2069.87 ms
     75: 1915.98 ms
     50: 1403.47 ms
     25: 1039.96 ms
     10: 790.67 ms
      5: 667.05 ms
      1: 492.16 ms
 
 TPOT (request_weighted):
   Min: 16.62 ms
   Max: 17.08 ms
   Median: 16.73 ms
   Avg.: 16.74 ms
   Std Dev.: 0.07 ms
 
   Percentiles:
   99.9: 17.00 ms
     99: 16.91 ms
     97: 16.87 ms
     95: 16.85 ms
     90: 16.83 ms
     80: 16.80 ms
     75: 16.78 ms
     50: 16.73 ms
     25: 16.69 ms
     10: 16.66 ms
      5: 16.65 ms
      1: 16.64 ms

NOTE:

spread controls coefficient of variation (CV), larger means more spread out (0.1 consider low)
most low latencies simulations will require high stream-interval (server gets CPU limited, too many short timers waking up at same time)
control server-side overall throughput with --max-concurrency, similar effect as client side max-concurrency

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
[] Documentation updated (if needed)

github-actions · 2026-03-12T08:21:40Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist · 2026-03-12T08:21:49Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the HTTP client's resilience and testing infrastructure. It introduces a sophisticated VariableResponseServer to better mimic real-world LLM inference characteristics, allowing for more accurate client benchmarking. Concurrently, it refines core TCP socket configurations, specifically reducing buffer sizes and adjusting keepalive probes, to improve connection stability and error detection, particularly under high load. The benchmarking utility itself also sees an optimization for more efficient result processing.

Highlights

Introduced Variable Throughput Server: A new variable-throughput-server was added to simulate realistic long-tailed LLM server behaviors with varying throughputs (0-10,000).
Optimized TCP Settings: Socket buffer sizes were reduced from 4MB to 128KB to prevent dropped connections in high-concurrency scenarios, and TCP keepalive probe settings were tuned for faster detection of dead connections.
Improved Benchmarking Utility: The benchmark-httpclient.py utility was updated to use a "drain pattern" instead of asyncio.sleep(0) for more efficient processing of received results.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/inference_endpoint/endpoint_client/http.py
- Tuned TCP keepalive probe settings (TCP_KEEPIDLE, TCP_KEEPCNT) for faster connection drop detection.
- Reduced SO_RCVBUF and SO_SNDBUF from 4MB to 128KB, with updated comments explaining the buffer sizing rationale.
src/inference_endpoint/testing/max_throughput_server.py
- Added a check for the main process ID in the signal handler to ensure only the main process handles signals, preventing child processes from prematurely exiting.
src/inference_endpoint/testing/variable_throughput_server.py
- Added a new VariableResponseServer to simulate OpenAI-compatible LLM API servers with variable response lengths, rates, and inter-chunk latency jitter.
- Implemented _TokenBucket for global rate limiting and VariableResponseProtocol for handling HTTP requests with sampled output lengths and streaming capabilities.
- Included CLI arguments for configuring server behavior, such as output length distribution, response rate, streaming options, and worker processes.
src/inference_endpoint/utils/benchmark_httpclient.py
- Modified the receiver coroutine to implement a "fast drain" pattern, synchronously polling all available results before awaiting the next response.
- Updated the logic for checking completion, stall, and overall time limits within the receiver.

Activity

No specific activity (comments, reviews, progress) was provided in the context.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new variable-throughput-server for more realistic testing, tunes TCP settings, and refactors the benchmark client. The changes are generally positive, but I've identified a few critical issues in the new server related to potential division-by-zero errors from unvalidated user inputs. Additionally, there's an inconsistency in the TCP keepalive setting comments and a minor code duplication in the benchmark client that could be improved.

src/inference_endpoint/testing/variable_throughput_server.py

src/inference_endpoint/endpoint_client/http.py

src/inference_endpoint/utils/benchmark_httpclient.py

Copilot

Pull request overview

This PR updates the HTTP client benchmarking and transport defaults to improve stability under high concurrency, and adds a new OpenAI-compatible test server that simulates more realistic (long-tailed) response sizes and streaming behavior.

Changes:

Update the benchmark HTTP client receiver loop to drain completed results via poll() before blocking on recv().
Add variable_throughput_server.py, a multi-process OpenAI-compatible stub server with variable output lengths and optional streaming jitter.
Tune TCP socket defaults (keepalive probes + smaller socket buffers) and harden test-server signal handling in max_throughput_server.py.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File	Description
`src/inference_endpoint/utils/benchmark_httpclient.py`	Adjust receiver logic to drain queued results via `poll()` before awaiting `recv()`.
`src/inference_endpoint/testing/variable_throughput_server.py`	New realistic variable-length / streaming-jitter test server with multi-worker SO_REUSEPORT support.
`src/inference_endpoint/testing/max_throughput_server.py`	Ensure SIGINT/SIGTERM handler doesn’t run full shutdown logic in worker processes.
`src/inference_endpoint/endpoint_client/http.py`	Update keepalive probe settings and reduce socket buffer sizes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

src/inference_endpoint/endpoint_client/http.py

src/inference_endpoint/testing/variable_throughput_server.py

src/inference_endpoint/utils/benchmark_httpclient.py

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

src/inference_endpoint/testing/variable_throughput_server.py

src/inference_endpoint/utils/benchmark_httpclient.py

src/inference_endpoint/endpoint_client/http.py

src/inference_endpoint/endpoint_client/cpu_affinity.py

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 9 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

src/inference_endpoint/utils/benchmark_httpclient.py

src/inference_endpoint/main.py

src/inference_endpoint/endpoint_client/worker.py

src/inference_endpoint/testing/variable_throughput_server.py

src/inference_endpoint/endpoint_client/config.py

src/inference_endpoint/endpoint_client/cpu_affinity.py

src/inference_endpoint/testing/variable_throughput_server.py

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

src/inference_endpoint/testing/variable_throughput_server.py

src/inference_endpoint/utils/benchmark_httpclient.py

src/inference_endpoint/endpoint_client/config.py

src/inference_endpoint/endpoint_client/worker.py

src/inference_endpoint/endpoint_client/http.py

src/inference_endpoint/testing/variable_throughput_server.py

nvzhihanj · 2026-03-23T05:36:44Z

Review Council — Multi-AI Code Review Council

Reviewed by: Codex + Claude | Depth: thorough

Found 6 new issues across 3 files (after deduplicating against 28 existing review comments).

🔴 Must Fix (high)

#	File	Line	Category	Reviewer(s)	Summary
1	`worker.py`	234	bug	Claude	Removing fatal `sys.exit(1)` when `warmed == 0` lets workers silently proceed with zero connections

🟡 Should Fix (medium)

#	File	Line	Category	Reviewer(s)	Summary
2	`worker.py`	207	bug	Codex	`max(1, ...)` over-allocates sockets when `max_connections < num_workers`
3	`http.py`	68	performance	Claude	32x socket buffer reduction (4MB→128KB) may regress offline throughput
4	`variable_throughput_server.py`	1	testing	Claude	947-line new file with zero test coverage

🔵 Consider (low)

#	File	Line	Category	Reviewer(s)	Summary
5	`variable_throughput_server.py`	580	concurrency	Claude	Fire-and-forget tasks may be GC'd before completion
6	`variable_throughput_server.py`	767	performance	Claude	Unconditional `gc.disable()` with no relaxed mode

Note: Several issues found by both reviewers were already covered by the 28 existing review comments on this PR (recv() blocking forever, DEFAULT_LOADGEN_CORES=5 on small machines, max_concurrency splitting, TCP_KEEPCNT timing, _lognormal_params validation). These were deduplicated and not re-posted.

🤖 Generated with Claude Code

src/inference_endpoint/endpoint_client/worker.py

src/inference_endpoint/endpoint_client/http.py

src/inference_endpoint/endpoint_client/worker.py

- worker: restore fatal exit when warmup establishes 0 connections - worker: warn when max_connections < num_workers (over-allocation) - config: resolve min_required_connections=-1 regardless of max_connections - cpu_affinity: update DEFAULT_LOADGEN_CORES comment to explain why 5 - benchmark_httpclient: extract _process_result helper, add receiver_done event so sender stops when receiver decides to stop - variable_throughput_server: validate output_len_mean > 0, warn on max_concurrency < num_workers, prevent task GC via _tasks set Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move os.environ.setdefault below imports to fix E402 ruff errors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/inference_endpoint/utils/benchmark_httpclient.py

src/inference_endpoint/endpoint_client/worker.py

src/inference_endpoint/endpoint_client/config.py

src/inference_endpoint/endpoint_client/cpu_affinity.py

viraatc requested a review from a team as a code owner March 12, 2026 08:21

Copilot AI review requested due to automatic review settings March 12, 2026 08:21

github-actions bot requested review from arekay-nv and nvzhihanj March 12, 2026 08:21

Copilot started reviewing on behalf of viraatc March 12, 2026 08:22 View session

gemini-code-assist bot reviewed Mar 12, 2026

View reviewed changes

Copilot AI reviewed Mar 12, 2026

View reviewed changes

viraatc changed the title ~~chore(http-client): tune tcp settings, update utils~~ WIP: chore(http-client): tune tcp settings, update utils Mar 12, 2026

viraatc changed the title ~~WIP: chore(http-client): tune tcp settings, update utils~~ chore(http-client): tune tcp settings, update defaults/utils Mar 12, 2026

Copilot AI review requested due to automatic review settings March 12, 2026 21:52

Copilot started reviewing on behalf of viraatc March 12, 2026 21:52 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings March 13, 2026 21:23

Copilot started reviewing on behalf of viraatc March 13, 2026 21:23 View session

viraatc added 5 commits March 13, 2026 14:30

tune tcp settings, update utils

280ec47

updates

9bd7a94

updates

9661ec8

updates

f5f4fc6

updates

0bb540a

viraatc force-pushed the feat/viraatc-perf-utils branch from 0f0fae0 to 0bb540a Compare March 13, 2026 21:30

Copilot AI reviewed Mar 13, 2026

View reviewed changes

viraatc added 2 commits March 13, 2026 14:40

updates

c4016c6

updates

1be0e47

Copilot AI review requested due to automatic review settings March 13, 2026 22:07

Copilot started reviewing on behalf of viraatc March 13, 2026 22:08 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

viraatc added 2 commits March 13, 2026 15:17

updates

cb44b1d

update

c73131e

Copilot AI review requested due to automatic review settings March 13, 2026 22:18

Copilot started reviewing on behalf of viraatc March 13, 2026 22:18 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

src/inference_endpoint/testing/variable_throughput_server.py Show resolved Hide resolved

src/inference_endpoint/utils/benchmark_httpclient.py Outdated Show resolved Hide resolved

src/inference_endpoint/endpoint_client/config.py Show resolved Hide resolved

Merge branch 'main' into feat/viraatc-perf-utils

e5bdfe3

nvzhihanj reviewed Mar 23, 2026

View reviewed changes

src/inference_endpoint/endpoint_client/worker.py Show resolved Hide resolved

nvzhihanj reviewed Mar 23, 2026

View reviewed changes

src/inference_endpoint/endpoint_client/worker.py Show resolved Hide resolved

nvzhihanj reviewed Mar 23, 2026

View reviewed changes

src/inference_endpoint/endpoint_client/http.py Show resolved Hide resolved

nvzhihanj reviewed Mar 23, 2026

View reviewed changes

src/inference_endpoint/testing/variable_throughput_server.py Show resolved Hide resolved

nvzhihanj reviewed Mar 23, 2026

View reviewed changes

src/inference_endpoint/testing/variable_throughput_server.py Show resolved Hide resolved

nvzhihanj reviewed Mar 23, 2026

View reviewed changes

src/inference_endpoint/testing/variable_throughput_server.py Show resolved Hide resolved

arekay-nv approved these changes Mar 23, 2026

View reviewed changes

src/inference_endpoint/endpoint_client/worker.py Show resolved Hide resolved

src/inference_endpoint/endpoint_client/http.py Show resolved Hide resolved

src/inference_endpoint/endpoint_client/worker.py Show resolved Hide resolved

Copilot AI review requested due to automatic review settings March 23, 2026 20:37

Copilot started reviewing on behalf of viraatc March 23, 2026 20:38 View session

viraatc and others added 2 commits March 23, 2026 13:40

test

3aa95f4

fix: move module-level imports to top of file in benchmark_httpclient

e692a24

Move os.environ.setdefault below imports to fix E402 ruff errors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI reviewed Mar 23, 2026

View reviewed changes

viraatc merged commit 56971a5 into main Mar 23, 2026
4 checks passed

viraatc deleted the feat/viraatc-perf-utils branch March 23, 2026 22:48

github-actions bot locked and limited conversation to collaborators Mar 23, 2026

Conversation

viraatc commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Example: Deepseek-r1 interactive streaming threshold:

Type of change

Related issues

Testing

Checklist

Uh oh!

github-actions bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Mar 12, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nvzhihanj commented Mar 23, 2026

Review Council — Multi-AI Code Review Council

🔴 Must Fix (high)

🟡 Should Fix (medium)

viraatc commented Mar 12, 2026 •

edited

Loading

Example:
Deepseek-r1 interactive streaming threshold:

github-actions bot commented Mar 12, 2026 •

edited

Loading