feat: add custom http apis, drop aiohttp by viraatc · Pull Request #92 · mlcommons/endpoints

viraatc · 2026-01-10T07:29:36Z

What does this PR do?

requires both:

feat: add zmq transport protocol #90 (zmq transport)
feat: enable cpu affinity, pin loadgen to CPU-0 #69 (CPU affinity)

MR includes:

http.py (httptools based http1.1 protocol, running on uvloop)
eager-task-factory (main, http-client, worker)
relaxed-gc (worker-procs)
optimized worker.py flow

new impl (http.py) improves over old (aiohttp) by:

custom TCP pool impl (LIFO vs aiohttp's random-based)
reduced overhead (minimal implementation, no cookies handling)
uses deque vs slower asyncio.Queue

dependencies updated:

updated pyproject.toml:
    "orjson==3.11.5",
    "pyzmq==27.1.0",
    "uvloop==0.22.1",
    "msgspec==0.20.0",
    "httptools==0.7.1"

old:
    "orjson==3.11.0",
    "aiohttp==3.12.15",
    "pyzmq==27.0.2",
    "uvloop==0.21.0",
    "msgspec==0.19.0",

dropped aiohttp, added httptools directly

(1)

better error-rates when oversubscribing ephemeral port limit.
example: offline mode with 60k queries:

vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8000 &

inference-endpoint benchmark offline --endpoint http://localhost:8000  --model Qwen/Qwen2.5-0.5B-Instruct --max-output-tokens 2 --num-samples 60000 --streaming on --timeout 900 --dataset tests/datasets/ds_samples.pkl


old:

QPS: 563.80
TPS: 733.76
Errors: 20956 (Cannot Assign Given Address, Connection Timed Out, more)

new: (max-connections = "auto-min" ie. default of 1024 TCP connections)

QPS: 721.62
TPS: 1443.23
(No Errors :))

new: (max-connections = "auto" ie. ephemeral port limit 22k in this case)

QPS: 595.75
TPS: 1170.80
Errors: 1042 (Connection timed out - needs a fresh TCP socket to continue transmission)

better error-rates at high issue rates.
example: offline mode with 20k queries (within ephemeral port limit)

vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8000 &

inference-endpoint benchmark offline --endpoint http://localhost:8000  --model Qwen/Qwen2.5-0.5B-Instruct --max-output-tokens 2 --num-samples 20000 --streaming on --timeout 900 --dataset tests/datasets/ds_samples.pkl

old:

QPS: 532.04
TPS: 959.06
Errors: 1974

new: (max-connections = "auto-min" ie. default of 1024 TCP connections)

QPS: 696.28
TPS: 1392.56
(No Errors :))

new: (max-connections = "auto" ie. ephemeral port limit 22k in this case)

QPS: 648.97  (within r2r)
TPS: 1297.94
(No Errors :))

(2) higher throughput

Benchmark                   Throughput Speedup      p99 Improvement
----------------------------------------------------------------------
Request Building                         2.20x                2.73x
Pool Acquire/Release                     5.11x                5.79x
Full Request                             7.81x                7.43x
Streaming                                3.19x                4.94x

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

github-actions · 2026-01-10T07:29:46Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Copilot

Pull request overview

This PR removes aiohttp as a dependency and replaces it with a custom HTTP client implementation built directly on asyncio's Protocol/Transport API, using httptools for HTTP parsing. The change also removes numerous ZMQ configuration parameters from test fixtures and client instantiation, replacing them with a cleaner transport abstraction.

Changes:

Replaced aiohttp with custom HTTP client using httptools (llhttp parser)
Refactored ZMQ transport into protocol-based abstraction with factory pattern
Added CPU affinity utilities and automatic worker count detection based on NUMA topology

Reviewed changes

Copilot reviewed 40 out of 40 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`src/inference_endpoint/endpoint_client/http.py`	New custom HTTP client implementation with connection pooling
`src/inference_endpoint/endpoint_client/transport/`	New transport abstraction layer for worker IPC
`src/inference_endpoint/endpoint_client/worker.py`	Refactored to use new HTTP client and transport abstractions
`src/inference_endpoint/endpoint_client/configs.py`	Removed AioHttpConfig and ZMQConfig, added new configuration options
`src/inference_endpoint/utils/cpu_affinity.py`	New CPU affinity utilities for NUMA-aware worker pinning
`tests/unit/endpoint_client/test_http.py`	Unit tests for HTTP request template builder
`tests/unit/endpoint_client/transport/test_zmq.py`	Unit tests for ZMQ transport layer
Multiple test files	Updated to use simplified client instantiation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

gemini-code-assist · 2026-01-10T07:31:08Z

Summary of Changes

Hello @viraatc, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly overhauls the client-side networking and worker management infrastructure. The core purpose is to achieve higher performance and greater control over HTTP requests and inter-process communication by moving away from aiohttp to a custom httptools-based solution. The changes also introduce sophisticated resource management capabilities, such as NUMA-aware CPU pinning and flexible streaming response handling, to ensure optimal benchmark execution and endpoint interaction.

Highlights

Custom HTTP Client Implementation: Replaced the aiohttp library with a custom HTTP/1.1 client built upon httptools. This change provides more granular control over network interactions and aims to improve performance characteristics.
Abstracted Inter-Process Communication (IPC): Introduced a new abstraction layer for communication between the main client and its worker processes. This layer is currently implemented using ZeroMQ, enhancing modularity and future flexibility.
Enhanced Worker Management and CPU Affinity: Implemented advanced worker management features, including dynamic worker count determination based on NUMA topology, CPU affinity pinning for both the load generator and worker processes, and configurable garbage collection strategies for workers to optimize latency and resource utilization.
Server-Sent Events (SSE) Accumulator Abstraction: Introduced a protocol for handling Server-Sent Events (SSE) streaming responses, allowing for flexible accumulation and processing of data from different API types like OpenAI and SGLang.
Dependency Streamlining: Removed aiohttp as a core dependency, updated versions of other packages like orjson, pyzmq, uvloop, and msgspec, and added httptools to the project dependencies.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

src/inference_endpoint/sglang/accumulator.py

src/inference_endpoint/openai/accumulator.py

src/inference_endpoint/utils/cpu_affinity.py

src/inference_endpoint/endpoint_client/transport/transport.py

src/inference_endpoint/endpoint_client/transport/zmq.py

src/inference_endpoint/endpoint_client/transport/transport.py

gemini-code-assist

Code Review

This pull request introduces a significant and impressive refactoring by replacing aiohttp with a custom, high-performance HTTP client. The new implementation is well-structured, introducing a clean transport abstraction layer and performance-oriented features like a custom connection pool, CPU affinity management, and GC tuning. The code is modular, and the tests have been diligently updated to reflect the new architecture. My feedback includes a few suggestions for improvement, such as simplifying the async handling in the probe command and preventing potential duplicate HTTP headers. Overall, this is a high-quality contribution that significantly enhances the client's performance and maintainability.

src/inference_endpoint/endpoint_client/http.py

src/inference_endpoint/commands/probe.py

src/inference_endpoint/endpoint_client/worker.py

Copilot

Pull request overview

Copilot reviewed 40 out of 40 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/inference_endpoint/endpoint_client/worker.py

src/inference_endpoint/endpoint_client/transport/transport.py

src/inference_endpoint/endpoint_client/transport/zmq.py

src/inference_endpoint/endpoint_client/transport/transport.py

nvzhihanj · 2026-01-12T06:17:56Z

will review after rebase.

Copilot

Pull request overview

Copilot reviewed 39 out of 39 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/inference_endpoint/endpoint_client/transport/transport.py

src/inference_endpoint/endpoint_client/http.py

tests/integration/endpoint_client/test_worker.py

src/inference_endpoint/endpoint_client/worker.py

tests/unit/endpoint_client/test_http.py

src/inference_endpoint/endpoint_client/utils.py

arekay-nv

Great work - Thanks!

src/inference_endpoint/endpoint_client/config.py

pyproject.toml

tests/integration/test_end_to_end_oracle.py

src/inference_endpoint/endpoint_client/utils.py

src/inference_endpoint/endpoint_client/worker.py

src/inference_endpoint/endpoint_client/http.py

src/inference_endpoint/main.py

Copilot

Pull request overview

Copilot reviewed 32 out of 32 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/integration/endpoint_client/test_worker.py

tests/integration/endpoint_client/test_worker_manager.py

src/inference_endpoint/endpoint_client/utils.py

src/inference_endpoint/endpoint_client/http.py

Copilot

Pull request overview

Copilot reviewed 35 out of 35 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/integration/endpoint_client/test_worker.py

tests/integration/endpoint_client/test_worker_manager.py

src/inference_endpoint/endpoint_client/worker.py

src/inference_endpoint/dataset_manager/factory.py

tests/unit/endpoint_client/test_http.py

Copilot

Pull request overview

Copilot reviewed 34 out of 34 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/inference_endpoint/endpoint_client/http.py

src/inference_endpoint/endpoint_client/worker.py

src/inference_endpoint/endpoint_client/config.py

Copilot

Copilot reviewed 36 out of 36 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/inference_endpoint/endpoint_client/config.py

src/inference_endpoint/dataset_manager/factory.py

src/inference_endpoint/endpoint_client/http.py

src/inference_endpoint/endpoint_client/cpu_affinity.py

src/inference_endpoint/endpoint_client/http.py

Copilot

Pull request overview

Copilot reviewed 38 out of 38 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

tests/integration/endpoint_client/test_worker.py:1

The port number changed from 99999 to 59999. Port 99999 is outside the valid port range (1-65535), so 59999 is correct. However, consider using a port that's less likely to be in use, such as a higher ephemeral port or documenting why 59999 was chosen.

# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 38 out of 38 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

viraatc requested a review from a team as a code owner January 10, 2026 07:29

Copilot AI review requested due to automatic review settings January 10, 2026 07:29

github-actions bot requested review from arekay-nv and nvzhihanj January 10, 2026 07:29

Copilot AI reviewed Jan 10, 2026

View reviewed changes

github-code-quality bot found potential problems Jan 10, 2026

View reviewed changes

gemini-code-assist bot reviewed Jan 10, 2026

View reviewed changes

src/inference_endpoint/endpoint_client/http.py Outdated Show resolved Hide resolved

src/inference_endpoint/commands/probe.py Show resolved Hide resolved

src/inference_endpoint/endpoint_client/worker.py Outdated Show resolved Hide resolved

Copilot AI review requested due to automatic review settings January 10, 2026 14:09

Copilot AI reviewed Jan 10, 2026

View reviewed changes

github-code-quality bot found potential problems Jan 10, 2026

View reviewed changes

viraatc force-pushed the feat/viraatc-http-transport branch from 21c0fea to db18136 Compare January 12, 2026 22:11

Copilot AI review requested due to automatic review settings January 12, 2026 22:13

viraatc force-pushed the feat/viraatc-http-transport branch from db18136 to be224de Compare January 12, 2026 22:13

Copilot AI reviewed Jan 12, 2026

View reviewed changes

viraatc changed the base branch from main to feat/viraatc-zmq-transport January 12, 2026 22:15

viraatc changed the base branch from feat/viraatc-zmq-transport to main January 12, 2026 22:16

github-code-quality bot found potential problems Jan 12, 2026

View reviewed changes

src/inference_endpoint/endpoint_client/transport/transport.py Fixed Show fixed Hide fixed

src/inference_endpoint/endpoint_client/http.py Fixed Show fixed Hide fixed

viraatc force-pushed the feat/viraatc-http-transport branch from be224de to 1f33f7d Compare January 12, 2026 23:43

viraatc changed the base branch from main to feat/viraatc-zmq-transport January 12, 2026 23:44

github-code-quality bot found potential problems Jan 12, 2026

View reviewed changes

tests/integration/endpoint_client/test_worker.py Fixed Show fixed Hide fixed

src/inference_endpoint/endpoint_client/worker.py Fixed Show fixed Hide fixed

viraatc changed the base branch from feat/viraatc-zmq-transport to main January 13, 2026 00:14

viraatc changed the base branch from main to feat/viraatc-zmq-transport January 13, 2026 02:15

viraatc mentioned this pull request Jan 13, 2026

http-endpoint-client performance #94

Closed

5 tasks

viraatc force-pushed the feat/viraatc-http-transport branch from b68d3df to c29c860 Compare January 13, 2026 03:32

viraatc commented Jan 13, 2026

View reviewed changes

src/inference_endpoint/endpoint_client/worker.py Outdated Show resolved Hide resolved

viraatc commented Jan 13, 2026

View reviewed changes

tests/unit/endpoint_client/test_http.py Outdated Show resolved Hide resolved

viraatc commented Jan 13, 2026

View reviewed changes

tests/unit/endpoint_client/test_http.py Outdated Show resolved Hide resolved

github-code-quality bot found potential problems Jan 16, 2026

View reviewed changes

src/inference_endpoint/endpoint_client/utils.py Fixed Show fixed Hide fixed

arekay-nv approved these changes Jan 16, 2026

View reviewed changes

viraatc commented Jan 16, 2026

View reviewed changes

src/inference_endpoint/endpoint_client/http.py Outdated Show resolved Hide resolved

viraatc commented Jan 16, 2026

View reviewed changes

src/inference_endpoint/endpoint_client/http.py Show resolved Hide resolved

viraatc commented Jan 16, 2026

View reviewed changes

src/inference_endpoint/main.py Outdated Show resolved Hide resolved

nvzhihanj approved these changes Jan 17, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings January 17, 2026 02:49

Copilot AI reviewed Jan 17, 2026

View reviewed changes

tests/integration/endpoint_client/test_worker.py Outdated Show resolved Hide resolved

tests/integration/endpoint_client/test_worker_manager.py Outdated Show resolved Hide resolved

github-code-quality bot found potential problems Jan 17, 2026

View reviewed changes

src/inference_endpoint/endpoint_client/utils.py Dismissed Show dismissed Hide dismissed

src/inference_endpoint/endpoint_client/http.py Fixed Show fixed Hide fixed

Copilot AI review requested due to automatic review settings January 17, 2026 07:07

Copilot AI reviewed Jan 17, 2026

View reviewed changes

github-code-quality bot found potential problems Jan 17, 2026

View reviewed changes

tests/unit/endpoint_client/test_http.py Dismissed Show dismissed Hide dismissed

tests/unit/endpoint_client/test_http.py Dismissed Show dismissed Hide dismissed

Copilot AI review requested due to automatic review settings January 17, 2026 08:51

Copilot AI reviewed Jan 17, 2026

View reviewed changes

src/inference_endpoint/endpoint_client/http.py Show resolved Hide resolved

src/inference_endpoint/endpoint_client/worker.py Show resolved Hide resolved

src/inference_endpoint/endpoint_client/config.py Show resolved Hide resolved

viraatc force-pushed the feat/viraatc-http-transport branch from e0f8601 to 6069095 Compare January 17, 2026 09:10

Copilot AI review requested due to automatic review settings January 20, 2026 06:52

Copilot AI reviewed Jan 20, 2026

View reviewed changes

src/inference_endpoint/endpoint_client/config.py Show resolved Hide resolved

src/inference_endpoint/endpoint_client/config.py Show resolved Hide resolved

src/inference_endpoint/dataset_manager/factory.py Outdated Show resolved Hide resolved

github-code-quality bot found potential problems Jan 20, 2026

View reviewed changes

src/inference_endpoint/endpoint_client/http.py Fixed Show fixed Hide fixed

github-code-quality bot found potential problems Jan 21, 2026

View reviewed changes

add custom http

caa3e38

Copilot AI review requested due to automatic review settings January 21, 2026 12:25

viraatc force-pushed the feat/viraatc-http-transport branch from 77c044b to caa3e38 Compare January 21, 2026 12:25

Copilot AI reviewed Jan 21, 2026

View reviewed changes

viraatc added 2 commits January 21, 2026 04:28

update

8c73b01

update

8f1904b

Copilot AI review requested due to automatic review settings January 21, 2026 12:33

Copilot AI reviewed Jan 21, 2026

View reviewed changes

viraatc merged commit e8178be into main Jan 21, 2026
4 checks passed

github-actions bot locked and limited conversation to collaborators Jan 21, 2026

viraatc deleted the feat/viraatc-http-transport branch February 6, 2026 23:07

Conversation

viraatc commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Type of change

Related issues

Testing

Checklist

Uh oh!

github-actions bot commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

gemini-code-assist bot commented Jan 10, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nvzhihanj commented Jan 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arekay-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

viraatc commented Jan 10, 2026 •

edited

Loading

github-actions bot commented Jan 10, 2026 •

edited

Loading