Performance Overhead of LiteLLM #11275

MadsRC · 2025-05-30T18:20:11Z

MadsRC
May 30, 2025

I am a happy, happy user of LiteLLM, but some of my users have asked me why streaming through LiteLLM seems "slow" or as if tokens are being batched in LiteLLM before being send over. Not having a great answer (other than that the amount of serialization and deserialization that LiteLLM is bound to hurt in some way...), I decided to look into it a bit.

A short python script was written that request streamed inference from LiteLLM's OpenAI endpoints and then directly from Bedrock afterwards. The prompt are identical for the two requests, and so is the temperature and max tokens. It should be noted that the LiteLLM instance is running locally here, from a fresh git clone of main from today (44a69421ead8dcde0f2a5c9be995c49eaa9a1fea). It is configured to forward requests to the same AWS Bedrock Inference Profile as the one the script connects directly.

My config.yaml is:

general_settings:
      use_x_forwarded_for: true
model_list:
      - model_name: "claude-3.7-sonnet"
        litellm_params:
          model: bedrock/us.anthropic.claude-3-7-sonnet-20250219-v1:0
          aws_region_name: us-east-1
          model_id: "arn:aws:bedrock:us-east-1:thisisanawsaccount:application-inference-profile/cmsmt8g9vtpl"
          input_cost_per_token: 0.000003
          output_cost_per_token: 0.000015

Here's a short video of the output:

Screen.Recording.2025-05-30.at.19.59.27.mov

The color difference signifies the chunk content as it arrives, with the color switching between two tones to signify new chunks. Notice how the OpenAI one (which is LiteLLM in this instance) feels "chunkier" or more "laggier", where the Bedrock output feels more smooth - Remember, both of these requests eventually are served by the same model in Bedrock...

The script is written to output the chunks as they arrive.

The model queried is us.anthropic.claude-3-7-sonnet-20250219-v1

I also write the chunks to a log, with a timestamp - one chunk per line.

The line count (thus chunk count) in these logs are:

wc -l *.log
      57 bedrock_chunks.log
      62 openai_chunks.log
     119 total

The content of the bedrock_chunks.log file is:

2025-05-30T19:59:37.641925: 'Call'
2025-05-30T19:59:37.642231: ' me Ishm'
2025-05-30T19:59:37.675134: 'ael. Some years'
2025-05-30T19:59:37.731819: ' ago—never min'
2025-05-30T19:59:37.812985: 'd how long precisely—'
2025-05-30T19:59:37.875215: 'having little or no'
2025-05-30T19:59:37.931973: ' money in my pur'
2025-05-30T19:59:37.993513: 'se, and nothing'
2025-05-30T19:59:38.191846: ' particular to interest me'
2025-05-30T19:59:38.286262: ' on shore, I'
2025-05-30T19:59:38.316075: ' thought I would sail'
2025-05-30T19:59:38.373197: ' about a little an'
2025-05-30T19:59:38.425838: 'd see the watery'
2025-05-30T19:59:38.481428: ' part of the worl'
2025-05-30T19:59:38.533119: 'd. It is a'
2025-05-30T19:59:38.587225: ' way I have of'
2025-05-30T19:59:38.645749: ' driving off the sp'
2025-05-30T19:59:38.700654: 'leen and regulating'
2025-05-30T19:59:38.768278: ' the circulation.'
2025-05-30T19:59:38.822171: ' Whenever I find myself'
2025-05-30T19:59:38.884980: ' growing grim about'
2025-05-30T19:59:38.955300: ' the mouth; whenever'
2025-05-30T19:59:39.017166: ' it is a '
2025-05-30T19:59:39.064177: 'damp, drizz'
2025-05-30T19:59:39.118174: 'ly November in my'
2025-05-30T19:59:39.171825: ' soul; whenever I'
2025-05-30T19:59:39.226031: ' find myself involunt'
2025-05-30T19:59:39.280856: 'arily pausing before'
2025-05-30T19:59:39.333781: ' coffin warehouses'
2025-05-30T19:59:39.389205: ', and bringing up'
2025-05-30T19:59:39.446249: ' the rear of every'
2025-05-30T19:59:39.567450: ' funeral I meet;'
2025-05-30T19:59:39.659364: ' and especially whenever my'
2025-05-30T19:59:39.729902: ' hypos get such'
2025-05-30T19:59:39.900165: ' an upper hand of'
2025-05-30T19:59:39.976063: ' me, that it'
2025-05-30T19:59:40.091186: ' requires a strong moral'
2025-05-30T19:59:40.173288: ' principle to prevent me'
2025-05-30T19:59:40.236249: ' from deliberately stepping into'
2025-05-30T19:59:40.301907: ' the street, an'
2025-05-30T19:59:40.374953: 'd methodically knocking'
2025-05-30T19:59:40.456499: " people's hats"
2025-05-30T19:59:40.598467: ' off—then, I account it high'
2025-05-30T19:59:40.664234: ' time to get to'
2025-05-30T19:59:40.727176: ' sea as soon as'
2025-05-30T19:59:40.785774: ' I can. This'
2025-05-30T19:59:40.910301: ' is my substitute for pistol and ball'
2025-05-30T19:59:41.038298: '. With a philosophical flourish C'
2025-05-30T19:59:41.091638: 'ato throws himself upon'
2025-05-30T19:59:41.204009: ' his sword; I quietly take to the'
2025-05-30T19:59:41.389726: ' ship. There is nothing surprising in this'
2025-05-30T19:59:41.469212: '. If they but'
2025-05-30T19:59:41.561498: ' knew it, almost all men in their'
2025-05-30T19:59:41.692634: ' degree, some time or other, cher'
2025-05-30T19:59:41.749028: 'ish very nearly the'
2025-05-30T19:59:41.808571: ' same feelings towards the'
2025-05-30T19:59:41.860522: ' ocean with me.'

The content of the openai_chunks.log file is:

cat openai_chunks.log 
2025-05-30T19:59:32.087546: 'Call'
2025-05-30T19:59:32.087828: ' me Ishm'
2025-05-30T19:59:32.087985: 'ael. Some years'
2025-05-30T19:59:32.088131: ' ago—never min'
2025-05-30T19:59:32.088278: 'd how long precisely—'
2025-05-30T19:59:32.088406: 'having little or no'
2025-05-30T19:59:32.088529: ' money in my pur'
2025-05-30T19:59:32.088652: 'se, and nothing'
2025-05-30T19:59:32.088774: ' particular to interest me'
2025-05-30T19:59:32.261625: ' on shore, I'
2025-05-30T19:59:32.261951: ' thought I would sail'
2025-05-30T19:59:32.262815: ' about a little an'
2025-05-30T19:59:32.262995: 'd see the watery'
2025-05-30T19:59:32.573847: ' part of the worl'
2025-05-30T19:59:32.574160: 'd. It is a'
2025-05-30T19:59:32.576000: ' way I have of'
2025-05-30T19:59:32.576405: ' driving off the sp'
2025-05-30T19:59:32.578156: 'leen and regulating'
2025-05-30T19:59:32.578477: ' the circulation.'
2025-05-30T19:59:32.950437: ' Whenever I find myself'
2025-05-30T19:59:32.950797: ' growing grim about'
2025-05-30T19:59:32.951957: ' the mouth; whenever'
2025-05-30T19:59:32.952254: ' it is a '
2025-05-30T19:59:32.953128: 'damp, drizz'
2025-05-30T19:59:33.683345: 'ly November in my'
2025-05-30T19:59:33.683709: ' soul; whenever I'
2025-05-30T19:59:33.684863: ' find myself involunt'
2025-05-30T19:59:33.685134: 'arily pausing before'
2025-05-30T19:59:33.685984: ' coffin warehouses'
2025-05-30T19:59:34.034638: ', and bringing up'
2025-05-30T19:59:34.035016: ' the rear of every'
2025-05-30T19:59:34.036241: ' funeral I meet;'
2025-05-30T19:59:34.036511: ' and especially whenever my'
2025-05-30T19:59:34.037453: ' hypos get such'
2025-05-30T19:59:34.289990: ' an upper hand of'
2025-05-30T19:59:34.290436: ' me, that it'
2025-05-30T19:59:34.292718: ' requires a strong moral'
2025-05-30T19:59:34.293353: ' principle to prevent me'
2025-05-30T19:59:34.293972: ' from deliberately stepping into'
2025-05-30T19:59:34.600625: ' the street, an'
2025-05-30T19:59:34.600960: 'd methodically knocking'
2025-05-30T19:59:34.602999: " people's hats off—then,"
2025-05-30T19:59:34.603293: ' I account it high'
2025-05-30T19:59:34.604165: ' time to get to'
2025-05-30T19:59:34.868632: ' sea as soon as'
2025-05-30T19:59:34.869024: ' I can. This'
2025-05-30T19:59:34.870206: ' is my substitute for'
2025-05-30T19:59:34.870390: ' pistol and ball'
2025-05-30T19:59:34.871117: '. With a philosophical'
2025-05-30T19:59:35.382044: ' flourish C'
2025-05-30T19:59:35.382925: 'ato throws himself upon'
2025-05-30T19:59:35.383921: ' his sword; I'
2025-05-30T19:59:35.384263: ' quietly take to the'
2025-05-30T19:59:35.385390: ' ship. There is'
2025-05-30T19:59:35.760772: ' nothing surprising in this'
2025-05-30T19:59:35.760984: '. If they but'
2025-05-30T19:59:35.761781: ' knew it, almost'
2025-05-30T19:59:35.761921: ' all men in their'
2025-05-30T19:59:35.762445: ' degree, some time'
2025-05-30T19:59:35.971681: ' or other, cher'
2025-05-30T19:59:35.972043: 'ish very nearly the same feelings towards the'
2025-05-30T19:59:35.974649: ' ocean with me.'

Is anyone else seeing something similar, or have an explanation for why this is happening?

ishaan-jaff · 2025-05-30T21:45:54Z

ishaan-jaff
May 30, 2025
Maintainer

Hi @MadsRC thanks for this. Curious, do you see this behavior when just using the LiteLLM Python SDK. It might be the way we set the chunk size when calling bedrock models.

2 replies

MadsRC May 31, 2025
Author

I tried adding the LiteLLM Python SDK, and configured it to talk directly to Bedrock, to my comparison script. I'd upload it here, but the file is just a bit too big.

I am getting the same kind of chunky/laggy output from it as I am when using the OpenAI SDK and pointing it to a LiteLLM proxy. It is not as smooth as when talking directly to Bedrock via Boto3.

manojkumargogula Aug 13, 2025

Hi @ishaan-jaff , Using the proxy and python sdk to call the bedrock models, it is taking a lot of time while getting through the proxy, while directly using the boto3, the requests is being served quicker.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Performance Overhead of LiteLLM #11275

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Performance Overhead of LiteLLM #11275

Uh oh!

MadsRC May 30, 2025

Replies: 1 comment · 2 replies

Uh oh!

ishaan-jaff May 30, 2025 Maintainer

Uh oh!

MadsRC May 31, 2025 Author

Uh oh!

manojkumargogula Aug 13, 2025

MadsRC
May 30, 2025

Replies: 1 comment 2 replies

ishaan-jaff
May 30, 2025
Maintainer

MadsRC May 31, 2025
Author