-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Description
This is the follow-up of #14816.
Here's the code that I used for profiling LiteLLM and finding unnecessary lock contentions, load_test_litellm.py
:
# HOW TO RUN THIS TEST:
# uv pip install py-spy
# STREAMING=1 sudo uv run py-spy record -o profile.svg --idle -r 15 -- python load_test_litellm.py
#
# --idle shows traces from langfuse that are clearly just waiting, but --idle is necessary to actually
# understand where the code is blocking due to unnecessary locks. To tell apart unnecessary locks
# from benign waiting, use common sense. Also, compare this run with and without
# STREAMING=1 as env.
import logging
import os
import time
from datetime import datetime
from concurrent.futures import ThreadPoolExecutor, as_completed
from litellm import completion
from litellm.types.utils import PromptTokensDetailsWrapper
def _now_tenths() -> str:
dt = datetime.now()
return f"{dt.strftime('%H:%M:%S')}.{int(dt.microsecond / 100000)}"
def _send_one_request(index: int, streaming: bool = True) -> None:
start_wall = _now_tenths()
start = time.perf_counter()
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": f"Create a short story about the number {index} in the style of Lewis Carroll."},
]
try:
# Keep it simple; we do not use the response content, only timings
output = completion(
model="gemini/gemini-2.5-flash",
messages=messages,
stream=streaming,
max_tokens=1000,
)
for _ in output:
pass
status = "ok"
except Exception as exc: # noqa: BLE001 - surface any error for visibility
status = f"error({type(exc).__name__}: {exc})"
logging.exception(f"Error sending request {index}: {exc}")
end = time.perf_counter()
end_wall = _now_tenths()
duration = round(end - start, 1)
print(f"[{index:02d}] start={start_wall} end={end_wall} dur={duration:.1f}s status={status}")
def main() -> int:
# Defaults: 14 concurrent requests, match router.py's _executor capacity
num_requests = int(os.environ.get("NUM_REQUESTS") or "14")
max_workers = int(os.environ.get("MAX_WORKERS") or "14")
streaming = bool(os.environ.get("STREAMING"))
# Force initialize Pydantic models with OpenAI models (heavy)
_ = PromptTokensDetailsWrapper.model_validate({})
import litellm.types.utils # noqa: F401
litellm.disable_streaming_logging = True
litellm.disable_end_user_cost_tracking = True
# This is specific to our deployment, but you can change to
# your integration or remove the callback.
litellm.success_callback.append('langfuse')
overall_start = _now_tenths()
with ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = {
executor.submit(_send_one_request, i, streaming): i
for i in range(num_requests)
}
for f in as_completed(futures):
print(f"Future {futures[f]} completed at {_now_tenths()}")
overall_end = _now_tenths()
print(f"All done. window_start={overall_start} window_end={overall_end}")
return 0
if __name__ == "__main__":
raise SystemExit(main())
Pydantic and OpenAI import warm-up for _calculate_usage_per_chunk()
I see in the profile a lot of time spent here:
__next__ (streaming_handler.py:1668)
stream_chunk_builder (litellm/main.py:6138)
calculate_usage (streaming_chunk_builder_utils.py:478)
_calculate_usage_per_chunk (streaming_chunk_builder_utils.py:384)
_find_and_load (<frozen importlib._bootstrap>:1357)
...
Adding import litellm.types.utils
in the code seemingly doesn't help. I don't understand this because seemingly if I already imported litellm.types.utils
and even created a dummy PromptTokensDetailsWrapper
outside the streaming threads, why do streaming threads still block in _find_and_load
from streaming_chunk_builder_utils.py
? Does anyone understand this, maybe I misunderstand how Python imports work?
Avoid Pydantic in the hot streaming path
In any case, I think it would be good to avoid creating Pydantic models in the hot streaming path, sync or async likewise. Pydantic != 'great for performance'; Pydantic actually is grossly suboptimal compared to vanilla Python's dataclasses.
HTTP/2 bottleneck
I've tried to switch to HTTP/2 transport in httpcore
dependency via creating HTTPTransport(..., http2=True, ...)
, but it only made performance worse. I was hitting another lock contention site, different from the one fixed in encode/httpcore#1038. I backed out of this. Also, it's unclear what is the HTTP/2 status in httpcore
and/or LiteLLM, looks somewhat like abandonware that nobody uses? But this seems bad because HTTP/1.1 is ancient?
It's not an urgent problem for me right now, but it's a good problem to work on for people who would like to contribute to LiteLLM, and deliver a fix to httpcore
along the way.