[feat] cherry-pick KVComp in NPU -- HBM version into the 0.2.0-release branch #619

wangwenxin0312 · 2026-01-04T02:26:25Z

Purpose

What this PR does / why we need it?
KVComp in NPU -- HBM version in 0.2.0-release

Modifications

Does this PR introduce any user-facing change?

Test

How was this patch tested?

This PR adds NPU (Ascend) device support for `KvCompOnDevice` functionality (HBM version) in unified-cache-management. The implementation enables efficient KV cache compression and sparse attention on Ascend NPU hardware, complementing the existing CUDA support. **Key features:** - NPU device detection and initialization for `KvCompOnDevice` - NPU-optimized hash computation and caching using `ucm_custom_ops.reshape_and_cache_bnsd` - NPU-specific hamming distance top-k computation using `ucm_custom_ops.hamming_dist_top_k` for efficient KV block selection - NPU-specific metadata building (`build_decode_attention_meta_npu`) that handles NPU tensor layouts and memory pinning - NPU-specific KV hash cache initialization (`initialize_kv_hash_cache_tensors_npu`) with proper shape handling for NPU **Why we need it:** - Speedup attention computation during decoding phase on Ascend NPU platforms (Atlas 800I A2) - Provides performance benefits through hardware-accelerated operations on NPU - Maintains feature parity with CUDA implementation while leveraging NPU-specific optimizations **Dependencies:** - Requires `ucm_custom_ops` module for NPU-specific kernel operations  **Yes, this PR introduces user-facing changes:** 1. **New NPU device support**: Users can now use `KvCompOnDevice` with `device_type="npu"` in their vLLM configuration 2. **New initialization method**: Added `initialize_kv_hash_cache_tensors_npu()` method for NPU-specific KV hash cache initialization 3. **New metadata building method**: Added `build_decode_attention_meta_npu()` method that handles NPU-specific tensor operations and memory management 4. **NPU-specific configuration**: Added NPU-specific configuration parameters: - `max_batch_size` for NPU batch size limits - `hamming_keep_chunks_head` and `hamming_keep_chunks_tail` for hamming distance computation - Sequence length threshold (`seq_len_threshhold`) for decode phase detection **Implementation details:** - NPU code paths are conditionally compiled based on `torch.npu.is_available()` - Uses `ucm_custom_ops` for NPU-optimized operations: - `reshape_and_cache_bnsd()` for hash cache storage - `hamming_dist_top_k()` for efficient top-k block selection - Properly handles NPU tensor layouts (BNSD format) and memory pinning for CPU-NPU transfers - Maintains backward compatibility with existing CUDA implementation  Here is the running result ```bash (VllmWorker rank=5 pid=112176) [NPU KVComp Debug] layer_name: model.layers.45.self_attn.attn, khash_cache is None (VllmWorker rank=5 pid=112176) [NPU KVComp Debug] layer_name: model.layers.46.self_attn.attn, is_rollback_layer=False, is_skip_hash_layer=True, k_cache_shape: torch.Size([9081, 128, 1, 128]) (VllmWorker rank=5 pid=112176) [NPU KVComp Debug] layer_name: model.layers.46.self_attn.attn, khash_cache is None (VllmWorker rank=5 pid=112176) [NPU KVComp Debug] layer_name: model.layers.47.self_attn.attn, is_rollback_layer=False, is_skip_hash_layer=True, k_cache_shape: torch.Size([9081, 128, 1, 128]) (VllmWorker rank=5 pid=112176) INFO 12-28 22:33:57 [core.py:172] init engine (profile, create kv cache, warmup model) took 25.74 seconds INFO 12-28 22:33:59 [factory.py:74] Creating v1 connector with name: UCMConnector and engine_id: ae921cd1-63c3-4d1e-bfa1-d65f0c688ffa WARNING 12-28 22:33:59 [base.py:71] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design. WARNING 12-28 22:33:59 [base.py:71] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design. [2025-12-28 22:33:59] - ucm.integration.vllm.ucm_connector - INFO [ucm_connector.py:106] NPU device is available. [2025-12-28 22:33:59] - ucm.utils - INFO [utils.py:81] Using kv_connector_extra_config from terminal input [2025-12-28 22:33:59] - ucm.utils - INFO [utils.py:89] Using UCM with config: {'ucm_connectors': [{'ucm_connector_name': 'UcmNfsStore', 'ucm_connector_config': {'storage_backends': '/docker/d00808955/kv_cache', 'use_direct': False}}], 'ucm_sparse_config': {'KvCompOnDevice': {'kvcompOnDevice_config_path': '/docker/d00808955/ucm-github/ucm/sparse/kvcomp/configs/kvcomp_qwen3_32B_config.json', 'max_batch_size': 30}}} [2025-12-28 22:33:59] - ucm.integration.vllm.ucm_connector - INFO [ucm_connector.py:124] self.launch_config: {'ucm_connectors': [{'ucm_connector_name': 'UcmNfsStore', 'ucm_connector_config': {'storage_backends': '/docker/d00808955/kv_cache', 'use_direct': False}}], 'ucm_sparse_config': {'KvCompOnDevice': {'kvcompOnDevice_config_path': '/docker/d00808955/ucm-github/ucm/sparse/kvcomp/configs/kvcomp_qwen3_32B_config.json', 'max_batch_size': 30}}} [2025-12-28 22:33:59] - ucm.store.factory_v1 - INFO [factory_v1.py:56] Creating connector with name: UcmNfsStore [2025-12-28 22:33:59.231572][UC][I] PcStore-(Release). [112089,112089][pcstore.cc:89,ShowConfig] [2025-12-28 22:33:59.231620][UC][I] Set UC::StorageBackends to ["/docker/d00808955/kv_cache/kv"]. [112089,112089][pcstore.cc:90,ShowConfig] [2025-12-28 22:33:59.231625][UC][I] Set UC::BlockSize to 0. [112089,112089][pcstore.cc:91,ShowConfig] [2025-12-28 22:33:59.231628][UC][I] Set UC::TransferEnable to false. [112089,112089][pcstore.cc:92,ShowConfig] [2025-12-28 22:33:59.231631][UC][I] Set UC::UniqueId to ae921cd1-63c3-4d1e-bfa1-d65f0c688ffa. [112089,112089][pcstore.cc:93,ShowConfig] [2025-12-28 22:33:59.231633][UC][I] Set UC::IoSize to 262144. [112089,112089][pcstore.cc:94,ShowConfig] [2025-12-28 22:33:59.231635][UC][I] Set UC::IoDirect to false. [112089,112089][pcstore.cc:95,ShowConfig] [2025-12-28 22:33:59.231638][UC][I] Set UC::LocalRankSize to 1. [112089,112089][pcstore.cc:96,ShowConfig] [2025-12-28 22:33:59.231641][UC][I] Set UC::DeviceId to -1. [112089,112089][pcstore.cc:97,ShowConfig] [2025-12-28 22:33:59.231643][UC][I] Set UC::StreamNumber to 8. [112089,112089][pcstore.cc:98,ShowConfig] [2025-12-28 22:33:59.231646][UC][I] Set UC::BufferNumber to 4096. [112089,112089][pcstore.cc:99,ShowConfig] [2025-12-28 22:33:59.231648][UC][I] Set UC::TimeoutMs to 30000. [112089,112089][pcstore.cc:100,ShowConfig] [2025-12-28 22:33:59.231650][UC][I] Set UC::ScatterGatherEnable to false. [112089,112089][pcstore.cc:101,ShowConfig] [2025-12-28 22:33:59.231652][UC][I] Set UC::ShardDataDir to true. [112089,112089][pcstore.cc:102,ShowConfig] [2025-12-28 22:33:59] - ucm.utils - INFO [utils.py:81] Using kv_connector_extra_config from terminal input [2025-12-28 22:33:59] - ucm.utils - INFO [utils.py:89] Using UCM with config: {'ucm_connectors': [{'ucm_connector_name': 'UcmNfsStore', 'ucm_connector_config': {'storage_backends': '/docker/d00808955/kv_cache', 'use_direct': False}}], 'ucm_sparse_config': {'KvCompOnDevice': {'kvcompOnDevice_config_path': '/docker/d00808955/ucm-github/ucm/sparse/kvcomp/configs/kvcomp_qwen3_32B_config.json', 'max_batch_size': 30}}} [2025-12-28 22:33:59] - ucm.utils - INFO [utils.py:81] Using kv_connector_extra_config from terminal input [2025-12-28 22:33:59] - ucm.utils - INFO [utils.py:89] Using UCM with config: {'ucm_connectors': [{'ucm_connector_name': 'UcmNfsStore', 'ucm_connector_config': {'storage_backends': '/docker/d00808955/kv_cache', 'use_direct': False}}], 'ucm_sparse_config': {'KvCompOnDevice': {'kvcompOnDevice_config_path': '/docker/d00808955/ucm-github/ucm/sparse/kvcomp/configs/kvcomp_qwen3_32B_config.json', 'max_batch_size': 30}}} [2025-12-28 22:33:59] - ucm.sparse.state - INFO [state.py:51] Initializing UCM sparse agent with method: {'KvCompOnDevice': {'kvcompOnDevice_config_path': '/docker/d00808955/ucm-github/ucm/sparse/kvcomp/configs/kvcomp_qwen3_32B_config.json', 'max_batch_size': 30}} [2025-12-28 22:33:59] - ucm.utils - INFO [utils.py:81] Using kv_connector_extra_config from terminal input [2025-12-28 22:33:59] - ucm.utils - INFO [utils.py:89] Using UCM with config: {'ucm_connectors': [{'ucm_connector_name': 'UcmNfsStore', 'ucm_connector_config': {'storage_backends': '/docker/d00808955/kv_cache', 'use_direct': False}}], 'ucm_sparse_config': {'KvCompOnDevice': {'kvcompOnDevice_config_path': '/docker/d00808955/ucm-github/ucm/sparse/kvcomp/configs/kvcomp_qwen3_32B_config.json', 'max_batch_size': 30}}} [2025-12-28 22:33:59] - ucm.sparse.factory - INFO [factory.py:43] Creating sparse method with name: KvCompOnDevice [2025-12-28 22:33:59] - ucm.utils - INFO [utils.py:81] Using kv_connector_extra_config from terminal input [2025-12-28 22:33:59] - ucm.utils - INFO [utils.py:89] Using UCM with config: {'ucm_connectors': [{'ucm_connector_name': 'UcmNfsStore', 'ucm_connector_config': {'storage_backends': '/docker/d00808955/kv_cache', 'use_direct': False}}], 'ucm_sparse_config': {'KvCompOnDevice': {'kvcompOnDevice_config_path': '/docker/d00808955/ucm-github/ucm/sparse/kvcomp/configs/kvcomp_qwen3_32B_config.json', 'max_batch_size': 30}}} [2025-12-28 22:33:59] - ucm.sparse.kvcomp.kvcomp_hbm - INFO [kvcomp_hbm.py:122] read kvcomp config file : /docker/d00808955/ucm-github/ucm/sparse/kvcomp/configs/kvcomp_qwen3_32B_config.json INFO 12-28 22:33:59 [scheduler.py:99] UCM Sparse initialized successfully: <ucm.sparse.kvcomp.kvcomp_hbm.KvCompOnDevice object at 0xfffd90d9ad40> INFO 12-28 22:33:59 [platform.py:161] Compilation disabled, using eager mode by default Loaded prompt from: prompts/batch-10k/longprompt1-1.txt Loaded prompt from: prompts/batch-10k/longprompt1-2.txt Loaded prompt from: prompts/batch-10k/longprompt1-3.txt Loaded prompt from: prompts/batch-10k/longprompt1-4.txt Loaded prompt from: prompts/batch-10k/longprompt1-5.txt num_requests: 5 length of prompt 0 is: 10756 length of prompt 1 is: 10756 length of prompt 2 is: 10756 length of prompt 3 is: 10756 length of prompt 4 is: 10756 INFO 12-28 22:34:00 [chat_utils.py:444] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this. Adding requests: 100%|██████████| 5/5 [00:00<00:00, 876.08it/s] Processed prompts: 100%|██████████| 5/5 [01:47<00:00, 21.49s/it, est. speed input: 449.81 toks/s, output: 18.69 toks/s] Prompt(short): '1\n请阅读以上小说，并回答问题：桃园三结义的三个人分别是谁？他们结义的誓词主要内容是什么？\n不要重复问题，不要重复输出，用简短的语句给出答案。\n小说如下：\n\n 滚滚长江东逝水，浪花淘尽英雄。是非......授人以柄，功必不成，反生乱矣。”何进笑曰：“此懦夫之见也！”傍边一人鼓掌大笑曰：“此事易如反掌，何必多议！”视之，乃曹操也。正是：欲除君侧宵人乱，须听朝中智士谋。不知曹操说出甚话来，且听下文分解。\n\n' with words 10756 Generated text: '<think>\n好的，用户问的是《三国演义》中桃园三结义的三个人分别是谁，以及他们的结义誓词主要内容是什么。首先，我需要确认用户提供的小说内容是否包含相关段落。用户给出的小说开头是《三国演义》的开篇词，接着是关于桃园三结义的描述。不过，用户提供的文本似乎不完整，可能只截取了部分段落。\n\n根据我已有的知识，桃园三结义的三个人是刘备、关羽和张飞。他们结义的誓词通常包括同生共死、共同进退、不求同年同月同日生，但求同年同月同日死。不过，用户提供的文本中可能没有详细描述誓词，所以需要仔细检查是否有遗漏的信息。\n\n用户提供的文本中提到“桃园三结义的三个人分别是谁？他们结义的誓言主要内容是什么？”，但给出的小说内容并没有详细描述结义的具体过程或誓词。因此，我需要依赖已有的知识来回答，但必须确保答案准确，并且不超出小说内容的范围。\n\n可能用户提供的文本是《三国演义》的开头部分，而桃园三结义的故事在后续章节中。因此，正确的做法是根据原著内容回答，但需要指出用户提供的文本中没有详细描述誓词，所以答案可能基于一般常识。不过用户可能希望得到基于他们提供的文本的答案，所以需要确认文本中是否有相关信息。\n\n检查用户提供的文本，发现其中并没有详细描述三人的结义过程或誓词。因此，可能需要依赖外部知识，但用户可能希望答案仅基于提供的文本。不过，根据常见的《三国演义》内容，桃园三结义是刘备、关羽、张飞，誓词是“不求同年同月同日生，但求同年同月同日死”。但用户提供的文本中没有提到这些，所以可能需要指出这一点，但用户可能希望答案基于小说内容，而小说内容可能不完整。\n\n因此，正确的做法是根据用户提供的文本中是否有相关信息，如果没有，可能需要说明无法从提供的文本中找到答案，但根据常识回答。不过用户可能希望直接回答，所以需要结合常识和文本中的信息。\n</think>\n\n桃园三结义的三人是刘备、关羽、张飞。 \n誓词主要内容为：不求同年同月同日生，但愿同年同月同日死，同心协力，救困扶危。' with words 834 Prompt(short): '2\n请阅读以上小说，并回答问题：桃园三结义的三个人分别是谁？他们结义的誓词主要内容是什么？\n不要重复问题，不要重复输出，用简短的语句给出答案。\n小说如下：\n\n 滚滚长江东逝水，浪花淘尽英雄。是非......授人以柄，功必不成，反生乱矣。”何进笑曰：“此懦夫之见也！”傍边一人鼓掌大笑曰：“此事易如反掌，何必多议！”视之，乃曹操也。正是：欲除君侧宵人乱，须听朝中智士谋。不知曹操说出甚话来，且听下文分解。\n\n' with words 10756 Generated text: '<think>\n好的，我需要回答用户关于《三国演义》中桃园三结义的三个人是谁以及他们的誓词主要内容的问题。首先，我得回忆一下小说中的相关情节。桃园三结义是刘备、关羽和张飞，他们是在涿郡张飞的桃园中结义。誓词部分，我记得他们誓言是“不求同年同月同日生，但求同年同月同日死”，强调同生共死的兄弟情谊。需要确认是否有其他内容，比如是否提到共同匡扶汉室，但根据常见版本，核心是同生死。所以答案应该是这三个人的名字和誓词的主要部分。\n</think>\n\n桃园三结义的三人是刘备、关羽、张飞。 \n誓词主要内容为“不求同年同月同日生，但求同年同月同日死”，誓言同心协力，生死相依。' with words 284 ``` Also, the pre-commit check is passed as <img width="1128" height="143" alt="image" src="https://github.com/user-attachments/assets/1e003294-6518-4344-aad6-705ee87ee2ec" />  ---------

wangwenxin0312 requested review from harrisonyhq, hek14, leideng, mag1c-h, pengwwang, qyh111, wuhuxiao and ygwpz as code owners January 4, 2026 02:26

wangwenxin0312 force-pushed the dev_wwx_020 branch 2 times, most recently from ccc3670 to ad9f472 Compare January 4, 2026 02:32

yuanzhg078 requested review from yuanzhg078 and removed request for yuanzhg078 January 4, 2026 02:36

qyh111 previously approved these changes Jan 4, 2026

View reviewed changes

leideng and others added 2 commits January 5, 2026 11:09

[fix] offline_inference_kvcomphbm

955b0e4

wangwenxin0312 force-pushed the dev_wwx_020 branch from ad9f472 to 955b0e4 Compare January 5, 2026 03:09

wangwenxin0312 dismissed qyh111’s stale review via 1f70d5c January 5, 2026 03:18

wangwenxin0312 force-pushed the dev_wwx_020 branch from 1f70d5c to 955b0e4 Compare January 5, 2026 03:22

mag1c-h approved these changes Jan 5, 2026

View reviewed changes

mag1c-h requested review from Infinite666 and yuanzhg078 January 5, 2026 03:30

yuanzhg078 approved these changes Jan 5, 2026

View reviewed changes

Infinite666 approved these changes Jan 5, 2026

View reviewed changes

mag1c-h merged commit be5bcbc into ModelEngine-Group:0.2.0-release Jan 5, 2026
6 checks passed

wangwenxin0312 deleted the dev_wwx_020 branch January 5, 2026 11:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat] cherry-pick KVComp in NPU -- HBM version into the 0.2.0-release branch #619

[feat] cherry-pick KVComp in NPU -- HBM version into the 0.2.0-release branch #619

Uh oh!

wangwenxin0312 commented Jan 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[feat] cherry-pick KVComp in NPU -- HBM version into the 0.2.0-release branch #619

[feat] cherry-pick KVComp in NPU -- HBM version into the 0.2.0-release branch #619

Uh oh!

Conversation

wangwenxin0312 commented Jan 4, 2026

Purpose

Modifications

Test

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants