-
Notifications
You must be signed in to change notification settings - Fork 57
Description
Proposal to improve performance
No response
Report of performance regression
I am testing ucm on 910b with vllm-ascend 0.9.2 rc1 and ucm v0.1.0 with vllm's benchmark script benchmark_serving.py. I tried several config: w/o ucm nfsstore, w/o --enforce-eager. the test model is DeepSeek-R1-Distill-Qwen-7B.
Briefly, If I use the config with nfsstore and no --enforce-eager, the performance of throughput is better than other situation, but the context of the reply is completely garbled text.
Here is the perfromance script I used for all the test:
python /vllm-workspace/vllm/benchmarks/benchmark_serving.py --trust-remote-code --model /DeepSeek_Series/DeepSeek-R1-Distill-Qwen-7B --dataset-name random --random-input-len 1024 --random-output-len 1024 --num-prompts 10 --request-rate 0.26 --max-concurrency 1 --metric-percentiles 90 --base-url http://localhost:8848
Here is the command I use ,
vllm serve ${MODEL_PATH} \ --max-model-len 5000 \ --host localhost \ --port 8848 \ -tp 1 \ --gpu-memory-utilization 0.8 \ --max-num-batched-tokens 30000 \ --block-size 128 \ --trust-remote-code \ --enable-prefix-caching \ --kv-transfer-config '{ "kv_connector": "UCMConnector", "kv_connector_module_path": "ucm.integration.vllm.ucm_connector", "kv_role": "kv_both", "kv_connector_extra_config": { "UCM_CONFIG_FILE": "/vllm-workspace/unified-cache-management/examples/ucm_config_example.yaml" } }'
The Performance list of the config with nfsstore and no --enforce-eager
Successful requests: 10
Benchmark duration (s): 145.01
Total input tokens: 10229
Total generated tokens: 10240
Request throughput (req/s): 0.07
Output token throughput (tok/s): 70.61
Total Token throughput (tok/s): 141.15
---------------Time to First Token----------------
Mean TTFT (ms): 75.14
Median TTFT (ms): 74.80
P90 TTFT (ms): 82.86
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 14.10
Median TPOT (ms): 14.10
P90 TPOT (ms): 14.11
---------------Inter-token Latency----------------
Mean ITL (ms): 14.10
Median ITL (ms): 14.07
P90 ITL (ms): 14.20
The curl command I sent to the started server
curl http://localhost:8848/v1/completions -H "Content-Type: application/json" -d '{ "model": "/DeepSeek_Series/DeepSeek-R1-Distill-Qwen-7B", "prompt": "生活中,人们常用认可度判别事物,区分高下。请写一篇字文章,谈谈你对“认可度的认识和思考。", "max_tokens": 900, "temperature": 0 }'
And the result I got from the llm
{"id":"cmpl-6252a20a49284a3da43713179a637af3","object":"text_completion","created":1764915145,"model":"/DeepSeek_Series/DeepSeek-R1-Distill-Qwen-7B","choices":[{"index":0,"text":" underside implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":31,"total_tokens":931,"completion_tokens":900,"prompt_tokens_details":null},"kv_transfer_params":null}
If I use the config with nfsstore and add --enforce-eager , everything goes right but the speed is not better than config with --enforce-eager and no ucm.
Here is the performance result of config with nfsstore and add --enforce-eager
Successful requests: 10
Benchmark duration (s): 254.54
Total input tokens: 10229
Total generated tokens: 7360
Request throughput (req/s): 0.04
Output token throughput (tok/s): 28.92
Total Token throughput (tok/s): 69.10
---------------Time to First Token----------------
Mean TTFT (ms): 150.19
Median TTFT (ms): 141.96
P90 TTFT (ms): 182.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 34.50
Median TPOT (ms): 34.56
P90 TPOT (ms): 34.79
---------------Inter-token Latency----------------
Mean ITL (ms): 34.42
Median ITL (ms): 34.33
P90 ITL (ms): 34.98
And the performance of the config with --enforce-eager and no ucm
Successful requests: 10
Benchmark duration (s): 296.11
Total input tokens: 10229
Total generated tokens: 8495
Request throughput (req/s): 0.03
Output token throughput (tok/s): 28.69
Total Token throughput (tok/s): 63.23
---------------Time to First Token----------------
Mean TTFT (ms): 77.79
Median TTFT (ms): 79.92
P90 TTFT (ms): 80.42
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 34.76
Median TPOT (ms): 34.73
P90 TPOT (ms): 34.96
---------------Inter-token Latency----------------
Mean ITL (ms): 34.81
Median ITL (ms): 34.75
P90 ITL (ms): 35.22
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
The output of `python collect_env.py`