Profiling concurrency of streaming sync completions and import warm-up question

This is the follow-up of #14816.

Here's the code that I used for profiling LiteLLM and finding unnecessary lock contentions, `load_test_litellm.py`:
```py
# HOW TO RUN THIS TEST:
# uv pip install py-spy
# STREAMING=1 sudo uv run py-spy record -o profile.svg --idle -r 15 -- python load_test_litellm.py
#
# --idle shows traces from langfuse that are clearly just waiting, but --idle is necessary to actually
# understand where the code is blocking due to unnecessary locks. To tell apart unnecessary locks
# from benign waiting, use common sense. Also, compare this run with and without 
# STREAMING=1 as env.

import logging
import os
import time
from datetime import datetime
from concurrent.futures import ThreadPoolExecutor, as_completed

from litellm import completion
from litellm.types.utils import PromptTokensDetailsWrapper


def _now_tenths() -> str:
    dt = datetime.now()
    return f"{dt.strftime('%H:%M:%S')}.{int(dt.microsecond / 100000)}"


def _send_one_request(index: int, streaming: bool = True) -> None:
    start_wall = _now_tenths()
    start = time.perf_counter()

    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": f"Create a short story about the number {index} in the style of Lewis Carroll."},
    ]

    try:
        # Keep it simple; we do not use the response content, only timings
        output = completion(
            model="gemini/gemini-2.5-flash",
            messages=messages,
            stream=streaming,
            max_tokens=1000,
        )
        for _ in output:
            pass
        status = "ok"
    except Exception as exc:  # noqa: BLE001 - surface any error for visibility
        status = f"error({type(exc).__name__}: {exc})"
        logging.exception(f"Error sending request {index}: {exc}")

    end = time.perf_counter()
    end_wall = _now_tenths()
    duration = round(end - start, 1)
    print(f"[{index:02d}] start={start_wall} end={end_wall} dur={duration:.1f}s status={status}")


def main() -> int:
    # Defaults: 14 concurrent requests, match router.py's _executor capacity
    num_requests = int(os.environ.get("NUM_REQUESTS") or "14")
    max_workers = int(os.environ.get("MAX_WORKERS") or "14")
    streaming = bool(os.environ.get("STREAMING"))

    # Force initialize Pydantic models with OpenAI models (heavy)
    _ = PromptTokensDetailsWrapper.model_validate({})
    import litellm.types.utils  # noqa: F401

    litellm.disable_streaming_logging = True
    litellm.disable_end_user_cost_tracking = True

    # This is specific to our deployment, but you can change to
    # your integration or remove the callback.
    litellm.success_callback.append('langfuse')

    overall_start = _now_tenths()
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {
            executor.submit(_send_one_request, i, streaming): i
            for i in range(num_requests)
        }
        for f in as_completed(futures):
            print(f"Future {futures[f]} completed at {_now_tenths()}")
    overall_end = _now_tenths()
    print(f"All done. window_start={overall_start} window_end={overall_end}")
    return 0


if __name__ == "__main__":
    raise SystemExit(main())
```

## Pydantic and OpenAI import warm-up for _calculate_usage_per_chunk()

I see in the profile a lot of time spent here:

```
__next__ (streaming_handler.py:1668)
stream_chunk_builder (litellm/main.py:6138)
calculate_usage (streaming_chunk_builder_utils.py:478)
_calculate_usage_per_chunk (streaming_chunk_builder_utils.py:384)
_find_and_load (<frozen importlib._bootstrap>:1357)
...
```

Adding `import litellm.types.utils` in the code seemingly doesn't help. I don't understand this because seemingly if I already imported `litellm.types.utils` and even created a dummy `PromptTokensDetailsWrapper` outside the streaming threads, why do streaming threads still block in `_find_and_load ` from `streaming_chunk_builder_utils.py`? Does anyone understand this, maybe I misunderstand how Python imports work?

## Avoid Pydantic in the hot streaming path

In any case, I think **it would be good to avoid creating Pydantic models in the hot streaming path, sync or async likewise.** Pydantic != 'great for performance'; Pydantic actually is grossly suboptimal compared to vanilla Python's dataclasses.

## HTTP/2 bottleneck

I've tried to switch to HTTP/2 transport in `httpcore` dependency via creating `HTTPTransport(..., http2=True, ...)`, but it only made performance **worse**. I was hitting another lock contention site, different from the one fixed in https://github.com/encode/httpcore/pull/1038. I backed out of this. Also, it's unclear what is the HTTP/2 status in `httpcore` and/or LiteLLM, looks somewhat like abandonware that nobody uses? But this seems bad because HTTP/1.1 is ancient?

It's not an urgent problem for me right now, but it's a good problem to work on for people who would like to contribute to LiteLLM, and deliver a fix to `httpcore` along the way.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Profiling concurrency of streaming sync completions and import warm-up question #14852

Pydantic and OpenAI import warm-up for _calculate_usage_per_chunk()

Avoid Pydantic in the hot streaming path

HTTP/2 bottleneck

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Profiling concurrency of streaming sync completions and import warm-up question #14852

Description

Pydantic and OpenAI import warm-up for _calculate_usage_per_chunk()

Avoid Pydantic in the hot streaming path

HTTP/2 bottleneck

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions