Skip to content

[Performance]: throughput drops with parameter --enforce-eager and receive garbled text without this parameter #473

@bonmorz

Description

@bonmorz

Proposal to improve performance

No response

Report of performance regression

I am testing ucm on 910b with vllm-ascend 0.9.2 rc1 and ucm v0.1.0 with vllm's benchmark script benchmark_serving.py. I tried several config: w/o ucm nfsstore, w/o --enforce-eager. the test model is DeepSeek-R1-Distill-Qwen-7B.

Briefly, If I use the config with nfsstore and no --enforce-eager, the performance of throughput is better than other situation, but the context of the reply is completely garbled text.

Here is the perfromance script I used for all the test:
python /vllm-workspace/vllm/benchmarks/benchmark_serving.py --trust-remote-code --model /DeepSeek_Series/DeepSeek-R1-Distill-Qwen-7B --dataset-name random --random-input-len 1024 --random-output-len 1024 --num-prompts 10 --request-rate 0.26 --max-concurrency 1 --metric-percentiles 90 --base-url http://localhost:8848

Here is the command I use ,
vllm serve ${MODEL_PATH} \ --max-model-len 5000 \ --host localhost \ --port 8848 \ -tp 1 \ --gpu-memory-utilization 0.8 \ --max-num-batched-tokens 30000 \ --block-size 128 \ --trust-remote-code \ --enable-prefix-caching \ --kv-transfer-config '{ "kv_connector": "UCMConnector", "kv_connector_module_path": "ucm.integration.vllm.ucm_connector", "kv_role": "kv_both", "kv_connector_extra_config": { "UCM_CONFIG_FILE": "/vllm-workspace/unified-cache-management/examples/ucm_config_example.yaml" } }'

The Performance list of the config with nfsstore and no --enforce-eager
Successful requests: 10
Benchmark duration (s): 145.01
Total input tokens: 10229
Total generated tokens: 10240
Request throughput (req/s): 0.07
Output token throughput (tok/s): 70.61
Total Token throughput (tok/s): 141.15
---------------Time to First Token----------------
Mean TTFT (ms): 75.14
Median TTFT (ms): 74.80
P90 TTFT (ms): 82.86
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 14.10
Median TPOT (ms): 14.10
P90 TPOT (ms): 14.11
---------------Inter-token Latency----------------
Mean ITL (ms): 14.10
Median ITL (ms): 14.07
P90 ITL (ms): 14.20

The curl command I sent to the started server
curl http://localhost:8848/v1/completions -H "Content-Type: application/json" -d '{ "model": "/DeepSeek_Series/DeepSeek-R1-Distill-Qwen-7B", "prompt": "生活中,人们常用认可度判别事物,区分高下。请写一篇字文章,谈谈你对“认可度的认识和思考。", "max_tokens": 900, "temperature": 0 }'

And the result I got from the llm
{"id":"cmpl-6252a20a49284a3da43713179a637af3","object":"text_completion","created":1764915145,"model":"/DeepSeek_Series/DeepSeek-R1-Distill-Qwen-7B","choices":[{"index":0,"text":" underside implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign0 implode insign","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":31,"total_tokens":931,"completion_tokens":900,"prompt_tokens_details":null},"kv_transfer_params":null}

If I use the config with nfsstore and add --enforce-eager , everything goes right but the speed is not better than config with --enforce-eager and no ucm.

Here is the performance result of config with nfsstore and add --enforce-eager

Successful requests: 10
Benchmark duration (s): 254.54
Total input tokens: 10229
Total generated tokens: 7360
Request throughput (req/s): 0.04
Output token throughput (tok/s): 28.92
Total Token throughput (tok/s): 69.10
---------------Time to First Token----------------
Mean TTFT (ms): 150.19
Median TTFT (ms): 141.96
P90 TTFT (ms): 182.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 34.50
Median TPOT (ms): 34.56
P90 TPOT (ms): 34.79
---------------Inter-token Latency----------------
Mean ITL (ms): 34.42
Median ITL (ms): 34.33
P90 ITL (ms): 34.98

And the performance of the config with --enforce-eager and no ucm

Successful requests: 10
Benchmark duration (s): 296.11
Total input tokens: 10229
Total generated tokens: 8495
Request throughput (req/s): 0.03
Output token throughput (tok/s): 28.69
Total Token throughput (tok/s): 63.23
---------------Time to First Token----------------
Mean TTFT (ms): 77.79
Median TTFT (ms): 79.92
P90 TTFT (ms): 80.42
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 34.76
Median TPOT (ms): 34.73
P90 TPOT (ms): 34.96
---------------Inter-token Latency----------------
Mean ITL (ms): 34.81
Median ITL (ms): 34.75
P90 ITL (ms): 35.22

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions