Skip to content

Conversation

leventov
Copy link
Contributor

@leventov leventov commented Sep 23, 2025

Problem

I have 14 Python threads doing streaming LLM requests concurrently.

Expectation: this should be mostly network-IO-bound workload, so this should scale reasonably well.

Reality: the performance is practically serial, i.e., this executes just a little faster than L * N / nCPU, where L is the latency of one such request (streaming or not), N is the number of requests, and nCPU is the number of (v)CPUs available for the Python process.

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

  • I have Added testing in the tests/litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • I have added a screenshot of my new test passing locally
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem

Type

🆕 New Feature
🐛 Bug Fix

Changes

  1. Guard executor.submit() with if not litellm.disable_streaming_logging in the hot path in streaming_handler.py's __next__(). This is a no-brainer change, since run_success_logging_and_cache_storage() is exactly a no-op if litellm.disable_streaming_logging is True, so submitting a no-op to an executor doesn't make any sense.

  2. Update dependency to httpcore to Don't hold lock unless necessary in PoolByteStream.close() encode/httpcore#1038.

  3. Make sync transport configurable via litellm.sync_transport. I could have avoided this by changing client altogether, but this is a quality of life change.

Then, in my code when I use litellm, I pre-configure the HTTPTransport in the following way:

# This block of code reproduces the default args used in
# litellm.llms.custom_httpx.http_handler.HTTPHandler.__init__() + initial_connections.
from litellm.llms.custom_httpx.http_handler import get_ssl_configuration
ssl_config = get_ssl_configuration(ssl_verify=None)
cert = os.getenv("SSL_CERTIFICATE", litellm.ssl_certificate)
from httpx import Limits
limits = Limits(max_connections=1000, max_keepalive_connections=1000)
from httpx._transports.default import HTTPTransport
from httpcore import Origin
initial_connections = {
    Origin(b"https", b"generativelanguage.googleapis.com", 443): 14
    Origin(b"https", b"openrouter.ai", 443): 14
}
litellm.sync_transport = HTTPTransport(verify=ssl_config, cert=cert, limits=limits)
pool = litellm.sync_transport._pool
for origin, num_conn in initial_connections.items():
    for _ in range(num_conn):
        pool._connections.append(pool.create_connection(origin=origin))

@CLAassistant
Copy link

CLAassistant commented Sep 23, 2025

CLA assistant check
All committers have signed the CLA.

Copy link

vercel bot commented Sep 23, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
litellm Ready Ready Preview Comment Sep 23, 2025 4:07pm

@krrishdholakia krrishdholakia merged commit 2e3e7de into BerriAI:main Sep 24, 2025
4 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants