Skip to content

openai_trtllm return 200 directly to the client when TTFT is greater than 15 seconds #53

@mynameiskeen

Description

@mynameiskeen

Hi @npuichigo , thanks for this wonderful project, I am having a issue bugged me for a while, please kindly help, thanks a lot.

I am using llmperf to benchmark tritonserver with tensorrt-llm backend and using openai_trtllm as the openai compatible proxy. When the benchmark was running under high concurrency ( like > 20 concurrent requests), the llmperf benchmark failed with error:

2024-08-12 23:57:47,133 INFO worker.py:1781 -- Started a local Ray instance.
  0%|          | 0/200 [00:00<?, ?it/s](OpenAIChatCompletionsClient pid=1151673) Warning Or Error: Expecting value: line 1 column 1 (char 0)
(OpenAIChatCompletionsClient pid=1151673) -1
Traceback (most recent call last):
  File "/data/inference_benchmark/llmperf/token_benchmark_ray.py", line 462, in <module>
    run_token_benchmark(
  File "/data/inference_benchmark/llmperf/token_benchmark_ray.py", line 303, in run_token_benchmark
    summary, individual_responses = get_token_throughput_latencies(
  File "/data/inference_benchmark/llmperf/token_benchmark_ray.py", line 122, in get_token_throughput_latencies
    request_metrics[common_metrics.REQ_OUTPUT_THROUGHPUT] = num_output_tokens / request_metrics[common_metrics.E2E_LAT]
ZeroDivisionError: division by zero
 20%|██        | 40/200 [00:49<03:16,  1.23s/it]

I added the print in the token_benchmark_ray.py under the code line 113/114 to print the response from openai_trtllm.

        if not (iter % num_concurrent_requests):
            outs = req_launcher.get_next_ready()
            all_metrics = []
            for out in outs:
                print("-----------------out is :", out)
                request_metrics, gen_text, _ = out
                print("-----------------Gen text is :", gen_text)
                num_output_tokens = get_token_length(gen_text)
                if num_output_tokens: 

It seems that the failed request returned with empty response body:

# error_code is "-1"
-----------------out is : ({'error_code': -1, 'error_msg': '', 'inter_token_latency_s': 0, 'ttft_s': 0, 'end_to_end_latency_s': 0, 'request_output_throughput_token_per_s': 0, 'number_total_tokens': 6001, 'number_output_tokens': 1, 'number_input_tokens': 6000}, 

'... which I will keep so chary\nThe ", 6000), sampling_params={'max_tokens': 500}, llm_api='openai', metadata=None))'

# gen_text was empty
-----------------Gen text is : 

So, I used tcpdump to capture all the requests to openai_trtllm, and found that the failed request was returned with a 200 code after exactly 15 seconds with maybe only one or two tokens ( I don't know if they're tokens from tritonserver or not) . And meanwhile tritonserver had no errors. Please see the screen shoots

image

image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions