Skip to content

chore(http-client): tune tcp settings, update defaults/utils#168

Merged
viraatc merged 13 commits intomainfrom
feat/viraatc-perf-utils
Mar 23, 2026
Merged

chore(http-client): tune tcp settings, update defaults/utils#168
viraatc merged 13 commits intomainfrom
feat/viraatc-perf-utils

Conversation

@viraatc
Copy link
Copy Markdown
Collaborator

@viraatc viraatc commented Mar 12, 2026

What does this PR do?

changes:

  • new variable-throughput-server to imitate real world long-tailed LLM servers at various throughputs (0-10_000)
  • fix: reduce socket buf sizes (causing dropped connections in offline/max-concurrency scenarios) WAS overkill
  • fix: tune keepalive probe settings to help connection drop-rates
  • chore: use drain pattern in benchmark-httpclient.py instead of sleep
  • chore: update default min workers 8->10, update affinity plan to assign 4 cores to main-rank for more stable perf

Variable-Throughput-Server:

* **Variable output lengths** — lognormal distribution, configurable mean + spread
* **Per-request response rate** — lognormal distribution (responses/sec per request)
* **First-chunk latency (TTFT)** — lognormal delay before first data
* **Per-chunk latency jitter** — lognormal inter-chunk delays in streaming mode

Two mutually exclusive timing modes:

* **Response-rate mode** (``--response-rate-mean``): controls total response time
  per request.  TPOT is derived from ``(1/rate - TTFT) / num_chunks``.
* **Inter-chunk mode** (``--inter-chunk-latency``): controls per-chunk delay
  directly.  Total response time = TTFT + num_chunks × TPOT.

Usage::

    # Basic non-streaming
    python -m inference_endpoint.testing.variable_throughput_server --stats

    # Offline with response-rate jitter
    python -m inference_endpoint.testing.variable_throughput_server --stats \\
        --output-len-mean 1000 --output-len-spread 0.4 \\
        --response-rate-mean 10000 --response-rate-spread 2.0

    # Streaming with inter-chunk latency (ms) + TTFT (s)
    python -m inference_endpoint.testing.variable_throughput_server --stream --stats \\
        --stream-interval 2 \\
        --inter-chunk-latency 20 --inter-chunk-spread 0.05 \\
        --first-chunk-latency 0.1 --first-chunk-spread 0.02

Example:
Deepseek-r1 interactive streaming threshold:

    
    python -m inference_endpoint.testing.variable_throughput_server --stats --num-workers 8 \\
    --inter-chunk-latency 15 --first-chunk-latency 1.5 --stream --stream-interval 10

endpoints run:

(max-conc=16k)
Summary 
 Total samples issued: 20000
 Total samples completed: 20000
 Duration: 45.59 seconds
 QPS: 438.73
 TPS: 392897.72
 
 
Latency Breakdowns 
 TTFT:
   Min: 246.16 ms
   Max: 10555.38 ms
   Median: 1403.47 ms
   Avg.: 1563.97 ms
   Std Dev.: 754.75 ms
 
   Percentiles:
   99.9: 5845.89 ms
     99: 4103.93 ms
     97: 3316.63 ms
     95: 2960.90 ms
     90: 2527.95 ms
     80: 2069.87 ms
     75: 1915.98 ms
     50: 1403.47 ms
     25: 1039.96 ms
     10: 790.67 ms
      5: 667.05 ms
      1: 492.16 ms
 
 TPOT (request_weighted):
   Min: 16.62 ms
   Max: 17.08 ms
   Median: 16.73 ms
   Avg.: 16.74 ms
   Std Dev.: 0.07 ms
 
   Percentiles:
   99.9: 17.00 ms
     99: 16.91 ms
     97: 16.87 ms
     95: 16.85 ms
     90: 16.83 ms
     80: 16.80 ms
     75: 16.78 ms
     50: 16.73 ms
     25: 16.69 ms
     10: 16.66 ms
      5: 16.65 ms
      1: 16.64 ms

NOTE:

  • spread controls coefficient of variation (CV), larger means more spread out (0.1 consider low)
  • most low latencies simulations will require high stream-interval (server gets CPU limited, too many short timers waking up at same time)
  • control server-side overall throughput with --max-concurrency, similar effect as client side max-concurrency

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • [] Documentation updated (if needed)

@viraatc viraatc requested a review from a team as a code owner March 12, 2026 08:21
Copilot AI review requested due to automatic review settings March 12, 2026 08:21
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 12, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@github-actions github-actions bot requested review from arekay-nv and nvzhihanj March 12, 2026 08:21
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the HTTP client's resilience and testing infrastructure. It introduces a sophisticated VariableResponseServer to better mimic real-world LLM inference characteristics, allowing for more accurate client benchmarking. Concurrently, it refines core TCP socket configurations, specifically reducing buffer sizes and adjusting keepalive probes, to improve connection stability and error detection, particularly under high load. The benchmarking utility itself also sees an optimization for more efficient result processing.

Highlights

  • Introduced Variable Throughput Server: A new variable-throughput-server was added to simulate realistic long-tailed LLM server behaviors with varying throughputs (0-10,000).
  • Optimized TCP Settings: Socket buffer sizes were reduced from 4MB to 128KB to prevent dropped connections in high-concurrency scenarios, and TCP keepalive probe settings were tuned for faster detection of dead connections.
  • Improved Benchmarking Utility: The benchmark-httpclient.py utility was updated to use a "drain pattern" instead of asyncio.sleep(0) for more efficient processing of received results.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/inference_endpoint/endpoint_client/http.py
    • Tuned TCP keepalive probe settings (TCP_KEEPIDLE, TCP_KEEPCNT) for faster connection drop detection.
    • Reduced SO_RCVBUF and SO_SNDBUF from 4MB to 128KB, with updated comments explaining the buffer sizing rationale.
  • src/inference_endpoint/testing/max_throughput_server.py
    • Added a check for the main process ID in the signal handler to ensure only the main process handles signals, preventing child processes from prematurely exiting.
  • src/inference_endpoint/testing/variable_throughput_server.py
    • Added a new VariableResponseServer to simulate OpenAI-compatible LLM API servers with variable response lengths, rates, and inter-chunk latency jitter.
    • Implemented _TokenBucket for global rate limiting and VariableResponseProtocol for handling HTTP requests with sampled output lengths and streaming capabilities.
    • Included CLI arguments for configuring server behavior, such as output length distribution, response rate, streaming options, and worker processes.
  • src/inference_endpoint/utils/benchmark_httpclient.py
    • Modified the receiver coroutine to implement a "fast drain" pattern, synchronously polling all available results before awaiting the next response.
    • Updated the logic for checking completion, stall, and overall time limits within the receiver.
Activity
  • No specific activity (comments, reviews, progress) was provided in the context.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new variable-throughput-server for more realistic testing, tunes TCP settings, and refactors the benchmark client. The changes are generally positive, but I've identified a few critical issues in the new server related to potential division-by-zero errors from unvalidated user inputs. Additionally, there's an inconsistency in the TCP keepalive setting comments and a minor code duplication in the benchmark client that could be improved.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the HTTP client benchmarking and transport defaults to improve stability under high concurrency, and adds a new OpenAI-compatible test server that simulates more realistic (long-tailed) response sizes and streaming behavior.

Changes:

  • Update the benchmark HTTP client receiver loop to drain completed results via poll() before blocking on recv().
  • Add variable_throughput_server.py, a multi-process OpenAI-compatible stub server with variable output lengths and optional streaming jitter.
  • Tune TCP socket defaults (keepalive probes + smaller socket buffers) and harden test-server signal handling in max_throughput_server.py.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
src/inference_endpoint/utils/benchmark_httpclient.py Adjust receiver logic to drain queued results via poll() before awaiting recv().
src/inference_endpoint/testing/variable_throughput_server.py New realistic variable-length / streaming-jitter test server with multi-worker SO_REUSEPORT support.
src/inference_endpoint/testing/max_throughput_server.py Ensure SIGINT/SIGTERM handler doesn’t run full shutdown logic in worker processes.
src/inference_endpoint/endpoint_client/http.py Update keepalive probe settings and reduce socket buffer sizes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@viraatc viraatc changed the title chore(http-client): tune tcp settings, update utils WIP: chore(http-client): tune tcp settings, update utils Mar 12, 2026
@viraatc viraatc changed the title WIP: chore(http-client): tune tcp settings, update utils chore(http-client): tune tcp settings, update defaults/utils Mar 12, 2026
Copilot AI review requested due to automatic review settings March 12, 2026 21:52
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot AI review requested due to automatic review settings March 13, 2026 21:23
@viraatc viraatc force-pushed the feat/viraatc-perf-utils branch from 0f0fae0 to 0bb540a Compare March 13, 2026 21:30
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot AI review requested due to automatic review settings March 13, 2026 22:07
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot AI review requested due to automatic review settings March 13, 2026 22:18
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@nvzhihanj
Copy link
Copy Markdown
Collaborator

Review Council — Multi-AI Code Review Council

Reviewed by: Codex + Claude | Depth: thorough

Found 6 new issues across 3 files (after deduplicating against 28 existing review comments).

🔴 Must Fix (high)

# File Line Category Reviewer(s) Summary
1 worker.py 234 bug Claude Removing fatal sys.exit(1) when warmed == 0 lets workers silently proceed with zero connections

🟡 Should Fix (medium)

# File Line Category Reviewer(s) Summary
2 worker.py 207 bug Codex max(1, ...) over-allocates sockets when max_connections < num_workers
3 http.py 68 performance Claude 32x socket buffer reduction (4MB→128KB) may regress offline throughput
4 variable_throughput_server.py 1 testing Claude 947-line new file with zero test coverage

🔵 Consider (low)

# File Line Category Reviewer(s) Summary
5 variable_throughput_server.py 580 concurrency Claude Fire-and-forget tasks may be GC'd before completion
6 variable_throughput_server.py 767 performance Claude Unconditional gc.disable() with no relaxed mode

Note: Several issues found by both reviewers were already covered by the 28 existing review comments on this PR (recv() blocking forever, DEFAULT_LOADGEN_CORES=5 on small machines, max_concurrency splitting, TCP_KEEPCNT timing, _lognormal_params validation). These were deduplicated and not re-posted.

🤖 Generated with Claude Code

- worker: restore fatal exit when warmup establishes 0 connections
- worker: warn when max_connections < num_workers (over-allocation)
- config: resolve min_required_connections=-1 regardless of max_connections
- cpu_affinity: update DEFAULT_LOADGEN_CORES comment to explain why 5
- benchmark_httpclient: extract _process_result helper, add receiver_done
  event so sender stops when receiver decides to stop
- variable_throughput_server: validate output_len_mean > 0, warn on
  max_concurrency < num_workers, prevent task GC via _tasks set

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 23, 2026 20:37
viraatc and others added 2 commits March 23, 2026 13:40
Move os.environ.setdefault below imports to fix E402 ruff errors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@viraatc viraatc merged commit 56971a5 into main Mar 23, 2026
4 checks passed
@viraatc viraatc deleted the feat/viraatc-perf-utils branch March 23, 2026 22:48
@github-actions github-actions bot locked and limited conversation to collaborators Mar 23, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants