-
Notifications
You must be signed in to change notification settings - Fork 30
Description
Hi @npuichigo , thanks for this wonderful project, I am having a issue bugged me for a while, please kindly help, thanks a lot.
I am using llmperf to benchmark tritonserver with tensorrt-llm backend and using openai_trtllm as the openai compatible proxy. When the benchmark was running under high concurrency ( like > 20 concurrent requests), the llmperf benchmark failed with error:
2024-08-12 23:57:47,133 INFO worker.py:1781 -- Started a local Ray instance.
0%| | 0/200 [00:00<?, ?it/s](OpenAIChatCompletionsClient pid=1151673) Warning Or Error: Expecting value: line 1 column 1 (char 0)
(OpenAIChatCompletionsClient pid=1151673) -1
Traceback (most recent call last):
File "/data/inference_benchmark/llmperf/token_benchmark_ray.py", line 462, in <module>
run_token_benchmark(
File "/data/inference_benchmark/llmperf/token_benchmark_ray.py", line 303, in run_token_benchmark
summary, individual_responses = get_token_throughput_latencies(
File "/data/inference_benchmark/llmperf/token_benchmark_ray.py", line 122, in get_token_throughput_latencies
request_metrics[common_metrics.REQ_OUTPUT_THROUGHPUT] = num_output_tokens / request_metrics[common_metrics.E2E_LAT]
ZeroDivisionError: division by zero
20%|██ | 40/200 [00:49<03:16, 1.23s/it]I added the print in the token_benchmark_ray.py under the code line 113/114 to print the response from openai_trtllm.
if not (iter % num_concurrent_requests):
outs = req_launcher.get_next_ready()
all_metrics = []
for out in outs:
print("-----------------out is :", out)
request_metrics, gen_text, _ = out
print("-----------------Gen text is :", gen_text)
num_output_tokens = get_token_length(gen_text)
if num_output_tokens: It seems that the failed request returned with empty response body:
# error_code is "-1"
-----------------out is : ({'error_code': -1, 'error_msg': '', 'inter_token_latency_s': 0, 'ttft_s': 0, 'end_to_end_latency_s': 0, 'request_output_throughput_token_per_s': 0, 'number_total_tokens': 6001, 'number_output_tokens': 1, 'number_input_tokens': 6000},
'... which I will keep so chary\nThe ", 6000), sampling_params={'max_tokens': 500}, llm_api='openai', metadata=None))'
# gen_text was empty
-----------------Gen text is : So, I used tcpdump to capture all the requests to openai_trtllm, and found that the failed request was returned with a 200 code after exactly 15 seconds with maybe only one or two tokens ( I don't know if they're tokens from tritonserver or not) . And meanwhile tritonserver had no errors. Please see the screen shoots

