Proposal to improve performance
No response
Report of performance regression
I am testing vllm online service on 910b3 with vllm-ascend 0.9.2 rc1 and ucm v0.1.0. When I conduct offline tests, BLOCK files can be generated.But when I conduct online tests, BLOCK files can not be generated, but the prefix cache hit rate changed.
The following is my startup command:
ASCEND_RT_VISIBLE_DEVICES=6 vllm serve /app/model/Qwen2.5-14B-Instruct \
--served-model-name vllm_nfs_offload \
--max-model-len 4096 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.8 \
--trust-remote-code \
--port 5006 \
--kv-transfer-config '{
"kv_connector": "UCMConnector",
"kv_connector_module_path": "ucm.integration.vllm.ucm_connector",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"ucm_connectors": [
{
"ucm_connector_name": "UcmNfsStore",
"ucm_connector_config": {
"storage_backends": "/app/storage",
"use_direct": false
}
}
],
"ucm_sparse_config": {
"KVStarMultiStep": {
"init_window_sz": 1,
"local_window_sz": 2,
"sparse_ratio": 0.25,
"retrieval_stride": 8,
"blk_repre_dim_prune_ratio": 0.25,
"blk_repre_inner_token_merge": 2
}
}
}
}'
The following is part of the log output of the first request:
The following is part of the log output of the second request:
The content under storage_backends:

Misc discussion on performance
No response
Your current environment (if you think it is necessary)
