Skip to content

Profiling concurrency of streaming sync completions and import warm-up question #14852

@leventov

Description

@leventov

This is the follow-up of #14816.

Here's the code that I used for profiling LiteLLM and finding unnecessary lock contentions, load_test_litellm.py:

# HOW TO RUN THIS TEST:
# uv pip install py-spy
# STREAMING=1 sudo uv run py-spy record -o profile.svg --idle -r 15 -- python load_test_litellm.py
#
# --idle shows traces from langfuse that are clearly just waiting, but --idle is necessary to actually
# understand where the code is blocking due to unnecessary locks. To tell apart unnecessary locks
# from benign waiting, use common sense. Also, compare this run with and without 
# STREAMING=1 as env.

import logging
import os
import time
from datetime import datetime
from concurrent.futures import ThreadPoolExecutor, as_completed

from litellm import completion
from litellm.types.utils import PromptTokensDetailsWrapper


def _now_tenths() -> str:
    dt = datetime.now()
    return f"{dt.strftime('%H:%M:%S')}.{int(dt.microsecond / 100000)}"


def _send_one_request(index: int, streaming: bool = True) -> None:
    start_wall = _now_tenths()
    start = time.perf_counter()

    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": f"Create a short story about the number {index} in the style of Lewis Carroll."},
    ]

    try:
        # Keep it simple; we do not use the response content, only timings
        output = completion(
            model="gemini/gemini-2.5-flash",
            messages=messages,
            stream=streaming,
            max_tokens=1000,
        )
        for _ in output:
            pass
        status = "ok"
    except Exception as exc:  # noqa: BLE001 - surface any error for visibility
        status = f"error({type(exc).__name__}: {exc})"
        logging.exception(f"Error sending request {index}: {exc}")

    end = time.perf_counter()
    end_wall = _now_tenths()
    duration = round(end - start, 1)
    print(f"[{index:02d}] start={start_wall} end={end_wall} dur={duration:.1f}s status={status}")


def main() -> int:
    # Defaults: 14 concurrent requests, match router.py's _executor capacity
    num_requests = int(os.environ.get("NUM_REQUESTS") or "14")
    max_workers = int(os.environ.get("MAX_WORKERS") or "14")
    streaming = bool(os.environ.get("STREAMING"))

    # Force initialize Pydantic models with OpenAI models (heavy)
    _ = PromptTokensDetailsWrapper.model_validate({})
    import litellm.types.utils  # noqa: F401

    litellm.disable_streaming_logging = True
    litellm.disable_end_user_cost_tracking = True

    # This is specific to our deployment, but you can change to
    # your integration or remove the callback.
    litellm.success_callback.append('langfuse')

    overall_start = _now_tenths()
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {
            executor.submit(_send_one_request, i, streaming): i
            for i in range(num_requests)
        }
        for f in as_completed(futures):
            print(f"Future {futures[f]} completed at {_now_tenths()}")
    overall_end = _now_tenths()
    print(f"All done. window_start={overall_start} window_end={overall_end}")
    return 0


if __name__ == "__main__":
    raise SystemExit(main())

Pydantic and OpenAI import warm-up for _calculate_usage_per_chunk()

I see in the profile a lot of time spent here:

__next__ (streaming_handler.py:1668)
stream_chunk_builder (litellm/main.py:6138)
calculate_usage (streaming_chunk_builder_utils.py:478)
_calculate_usage_per_chunk (streaming_chunk_builder_utils.py:384)
_find_and_load (<frozen importlib._bootstrap>:1357)
...

Adding import litellm.types.utils in the code seemingly doesn't help. I don't understand this because seemingly if I already imported litellm.types.utils and even created a dummy PromptTokensDetailsWrapper outside the streaming threads, why do streaming threads still block in _find_and_load from streaming_chunk_builder_utils.py? Does anyone understand this, maybe I misunderstand how Python imports work?

Avoid Pydantic in the hot streaming path

In any case, I think it would be good to avoid creating Pydantic models in the hot streaming path, sync or async likewise. Pydantic != 'great for performance'; Pydantic actually is grossly suboptimal compared to vanilla Python's dataclasses.

HTTP/2 bottleneck

I've tried to switch to HTTP/2 transport in httpcore dependency via creating HTTPTransport(..., http2=True, ...), but it only made performance worse. I was hitting another lock contention site, different from the one fixed in encode/httpcore#1038. I backed out of this. Also, it's unclear what is the HTTP/2 status in httpcore and/or LiteLLM, looks somewhat like abandonware that nobody uses? But this seems bad because HTTP/1.1 is ancient?

It's not an urgent problem for me right now, but it's a good problem to work on for people who would like to contribute to LiteLLM, and deliver a fix to httpcore along the way.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions