openai_trtllm return 200 directly to the client when TTFT is greater than 15 seconds

Hi @npuichigo , thanks for this wonderful project, I am having a issue bugged me for a while, please kindly help, thanks a lot.

I am using [llmperf](https://github.com/ray-project/llmperf) to benchmark tritonserver with tensorrt-llm backend and using openai_trtllm as the openai compatible proxy. When the benchmark was running under high concurrency ( like > 20 concurrent requests), the llmperf benchmark failed with error:

```bash
2024-08-12 23:57:47,133 INFO worker.py:1781 -- Started a local Ray instance.
  0%|          | 0/200 [00:00<?, ?it/s](OpenAIChatCompletionsClient pid=1151673) Warning Or Error: Expecting value: line 1 column 1 (char 0)
(OpenAIChatCompletionsClient pid=1151673) -1
Traceback (most recent call last):
  File "/data/inference_benchmark/llmperf/token_benchmark_ray.py", line 462, in <module>
    run_token_benchmark(
  File "/data/inference_benchmark/llmperf/token_benchmark_ray.py", line 303, in run_token_benchmark
    summary, individual_responses = get_token_throughput_latencies(
  File "/data/inference_benchmark/llmperf/token_benchmark_ray.py", line 122, in get_token_throughput_latencies
    request_metrics[common_metrics.REQ_OUTPUT_THROUGHPUT] = num_output_tokens / request_metrics[common_metrics.E2E_LAT]
ZeroDivisionError: division by zero
 20%|██        | 40/200 [00:49<03:16,  1.23s/it]
```

I added the print in the [token_benchmark_ray.py](https://github.com/ray-project/llmperf/blob/main/token_benchmark_ray.py) under the code line 113/114 to print the response from openai_trtllm. 

```python
        if not (iter % num_concurrent_requests):
            outs = req_launcher.get_next_ready()
            all_metrics = []
            for out in outs:
                print("-----------------out is :", out)
                request_metrics, gen_text, _ = out
                print("-----------------Gen text is :", gen_text)
                num_output_tokens = get_token_length(gen_text)
                if num_output_tokens: 
``` 

It seems that the failed request returned with empty response body:

```bash
# error_code is "-1"
-----------------out is : ({'error_code': -1, 'error_msg': '', 'inter_token_latency_s': 0, 'ttft_s': 0, 'end_to_end_latency_s': 0, 'request_output_throughput_token_per_s': 0, 'number_total_tokens': 6001, 'number_output_tokens': 1, 'number_input_tokens': 6000}, 

'... which I will keep so chary\nThe ", 6000), sampling_params={'max_tokens': 500}, llm_api='openai', metadata=None))'

# gen_text was empty
-----------------Gen text is : 
```

So, I used tcpdump to capture all the requests to openai_trtllm, and found that the failed request was returned with a 200 code after exactly 15 seconds with maybe only one or two tokens ( I don't know if they're tokens from tritonserver or not) . And meanwhile tritonserver had no errors. Please see the screen shoots

![image](https://github.com/user-attachments/assets/e14d4def-3137-4e92-bbfb-c2f9a9f24bff)

![image](https://github.com/user-attachments/assets/862c9845-cdf6-4ca7-86f0-160248496a7b)






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

openai_trtllm return 200 directly to the client when TTFT is greater than 15 seconds #53

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

openai_trtllm return 200 directly to the client when TTFT is greater than 15 seconds #53

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions