Skip to content

feat: add custom http apis, drop aiohttp#92

Merged
viraatc merged 3 commits intomainfrom
feat/viraatc-http-transport
Jan 21, 2026
Merged

feat: add custom http apis, drop aiohttp#92
viraatc merged 3 commits intomainfrom
feat/viraatc-http-transport

Conversation

@viraatc
Copy link
Copy Markdown
Collaborator

@viraatc viraatc commented Jan 10, 2026

What does this PR do?

requires both:

  1. feat: add zmq transport protocol #90 (zmq transport)
  2. feat: enable cpu affinity, pin loadgen to CPU-0 #69 (CPU affinity)

MR includes:

  • http.py (httptools based http1.1 protocol, running on uvloop)
  • eager-task-factory (main, http-client, worker)
  • relaxed-gc (worker-procs)
  • optimized worker.py flow

new impl (http.py) improves over old (aiohttp) by:

  • custom TCP pool impl (LIFO vs aiohttp's random-based)
  • reduced overhead (minimal implementation, no cookies handling)
  • uses deque vs slower asyncio.Queue

dependencies updated:

updated pyproject.toml:
    "orjson==3.11.5",
    "pyzmq==27.1.0",
    "uvloop==0.22.1",
    "msgspec==0.20.0",
    "httptools==0.7.1"

old:
    "orjson==3.11.0",
    "aiohttp==3.12.15",
    "pyzmq==27.0.2",
    "uvloop==0.21.0",
    "msgspec==0.19.0",
  • dropped aiohttp, added httptools directly

(1)

better error-rates when oversubscribing ephemeral port limit.
example: offline mode with 60k queries:

vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8000 &

inference-endpoint benchmark offline --endpoint http://localhost:8000  --model Qwen/Qwen2.5-0.5B-Instruct --max-output-tokens 2 --num-samples 60000 --streaming on --timeout 900 --dataset tests/datasets/ds_samples.pkl

old:

QPS: 563.80
TPS: 733.76
Errors: 20956 (Cannot Assign Given Address, Connection Timed Out, more)

new: (max-connections = "auto-min" ie. default of 1024 TCP connections)

QPS: 721.62
TPS: 1443.23
(No Errors :))

new: (max-connections = "auto" ie. ephemeral port limit 22k in this case)

QPS: 595.75
TPS: 1170.80
Errors: 1042 (Connection timed out - needs a fresh TCP socket to continue transmission)

better error-rates at high issue rates.
example: offline mode with 20k queries (within ephemeral port limit)

vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8000 &

inference-endpoint benchmark offline --endpoint http://localhost:8000  --model Qwen/Qwen2.5-0.5B-Instruct --max-output-tokens 2 --num-samples 20000 --streaming on --timeout 900 --dataset tests/datasets/ds_samples.pkl 
old:

QPS: 532.04
TPS: 959.06
Errors: 1974

new: (max-connections = "auto-min" ie. default of 1024 TCP connections)

QPS: 696.28
TPS: 1392.56
(No Errors :))

new: (max-connections = "auto" ie. ephemeral port limit 22k in this case)

QPS: 648.97  (within r2r)
TPS: 1297.94
(No Errors :))

(2) higher throughput

Benchmark                   Throughput Speedup      p99 Improvement
----------------------------------------------------------------------
Request Building                         2.20x                2.73x
Pool Acquire/Release                     5.11x                5.79x
Full Request                             7.81x                7.43x
Streaming                                3.19x                4.94x

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

@viraatc viraatc requested a review from a team as a code owner January 10, 2026 07:29
Copilot AI review requested due to automatic review settings January 10, 2026 07:29
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 10, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes aiohttp as a dependency and replaces it with a custom HTTP client implementation built directly on asyncio's Protocol/Transport API, using httptools for HTTP parsing. The change also removes numerous ZMQ configuration parameters from test fixtures and client instantiation, replacing them with a cleaner transport abstraction.

Changes:

  • Replaced aiohttp with custom HTTP client using httptools (llhttp parser)
  • Refactored ZMQ transport into protocol-based abstraction with factory pattern
  • Added CPU affinity utilities and automatic worker count detection based on NUMA topology

Reviewed changes

Copilot reviewed 40 out of 40 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/inference_endpoint/endpoint_client/http.py New custom HTTP client implementation with connection pooling
src/inference_endpoint/endpoint_client/transport/ New transport abstraction layer for worker IPC
src/inference_endpoint/endpoint_client/worker.py Refactored to use new HTTP client and transport abstractions
src/inference_endpoint/endpoint_client/configs.py Removed AioHttpConfig and ZMQConfig, added new configuration options
src/inference_endpoint/utils/cpu_affinity.py New CPU affinity utilities for NUMA-aware worker pinning
tests/unit/endpoint_client/test_http.py Unit tests for HTTP request template builder
tests/unit/endpoint_client/transport/test_zmq.py Unit tests for ZMQ transport layer
Multiple test files Updated to use simplified client instantiation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @viraatc, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly overhauls the client-side networking and worker management infrastructure. The core purpose is to achieve higher performance and greater control over HTTP requests and inter-process communication by moving away from aiohttp to a custom httptools-based solution. The changes also introduce sophisticated resource management capabilities, such as NUMA-aware CPU pinning and flexible streaming response handling, to ensure optimal benchmark execution and endpoint interaction.

Highlights

  • Custom HTTP Client Implementation: Replaced the aiohttp library with a custom HTTP/1.1 client built upon httptools. This change provides more granular control over network interactions and aims to improve performance characteristics.
  • Abstracted Inter-Process Communication (IPC): Introduced a new abstraction layer for communication between the main client and its worker processes. This layer is currently implemented using ZeroMQ, enhancing modularity and future flexibility.
  • Enhanced Worker Management and CPU Affinity: Implemented advanced worker management features, including dynamic worker count determination based on NUMA topology, CPU affinity pinning for both the load generator and worker processes, and configurable garbage collection strategies for workers to optimize latency and resource utilization.
  • Server-Sent Events (SSE) Accumulator Abstraction: Introduced a protocol for handling Server-Sent Events (SSE) streaming responses, allowing for flexible accumulation and processing of data from different API types like OpenAI and SGLang.
  • Dependency Streamlining: Removed aiohttp as a core dependency, updated versions of other packages like orjson, pyzmq, uvloop, and msgspec, and added httptools to the project dependencies.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and impressive refactoring by replacing aiohttp with a custom, high-performance HTTP client. The new implementation is well-structured, introducing a clean transport abstraction layer and performance-oriented features like a custom connection pool, CPU affinity management, and GC tuning. The code is modular, and the tests have been diligently updated to reflect the new architecture. My feedback includes a few suggestions for improvement, such as simplifying the async handling in the probe command and preventing potential duplicate HTTP headers. Overall, this is a high-quality contribution that significantly enhances the client's performance and maintainability.

Copilot AI review requested due to automatic review settings January 10, 2026 14:09
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 40 out of 40 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@nvzhihanj
Copy link
Copy Markdown
Collaborator

will review after rebase.

@viraatc viraatc force-pushed the feat/viraatc-http-transport branch from 21c0fea to db18136 Compare January 12, 2026 22:11
Copilot AI review requested due to automatic review settings January 12, 2026 22:13
@viraatc viraatc force-pushed the feat/viraatc-http-transport branch from db18136 to be224de Compare January 12, 2026 22:13
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 39 out of 39 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@viraatc viraatc changed the base branch from main to feat/viraatc-zmq-transport January 12, 2026 22:15
@viraatc viraatc changed the base branch from feat/viraatc-zmq-transport to main January 12, 2026 22:16
@viraatc viraatc force-pushed the feat/viraatc-http-transport branch from be224de to 1f33f7d Compare January 12, 2026 23:43
@viraatc viraatc changed the base branch from main to feat/viraatc-zmq-transport January 12, 2026 23:44
@viraatc viraatc changed the base branch from feat/viraatc-zmq-transport to main January 13, 2026 00:14
@viraatc viraatc changed the base branch from main to feat/viraatc-zmq-transport January 13, 2026 02:15
@viraatc viraatc mentioned this pull request Jan 13, 2026
5 tasks
@viraatc viraatc force-pushed the feat/viraatc-http-transport branch from b68d3df to c29c860 Compare January 13, 2026 03:32
Copy link
Copy Markdown
Collaborator

@arekay-nv arekay-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work - Thanks!

Copilot AI review requested due to automatic review settings January 17, 2026 02:49
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 32 out of 32 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings January 17, 2026 07:07
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 35 out of 35 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings January 17, 2026 08:51
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 34 out of 34 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@viraatc viraatc force-pushed the feat/viraatc-http-transport branch from e0f8601 to 6069095 Compare January 17, 2026 09:10
Copilot AI review requested due to automatic review settings January 20, 2026 06:52
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 36 out of 36 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings January 21, 2026 12:25
@viraatc viraatc force-pushed the feat/viraatc-http-transport branch from 77c044b to caa3e38 Compare January 21, 2026 12:25
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 38 out of 38 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

tests/integration/endpoint_client/test_worker.py:1

  • The port number changed from 99999 to 59999. Port 99999 is outside the valid port range (1-65535), so 59999 is correct. However, consider using a port that's less likely to be in use, such as a higher ephemeral port or documenting why 59999 was chosen.
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings January 21, 2026 12:33
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 38 out of 38 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@viraatc viraatc merged commit e8178be into main Jan 21, 2026
4 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Jan 21, 2026
@viraatc viraatc deleted the feat/viraatc-http-transport branch February 6, 2026 23:07
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants