Skip to content

Conversation

@wangwenxin0312
Copy link
Contributor

Purpose

What this PR does / why we need it?
KVComp in NPU -- HBM version in 0.2.0-release

Modifications

Does this PR introduce any user-facing change?

Test

How was this patch tested?

@wangwenxin0312 wangwenxin0312 force-pushed the dev_wwx_020 branch 2 times, most recently from ccc3670 to ad9f472 Compare January 4, 2026 02:32
@yuanzhg078 yuanzhg078 requested review from yuanzhg078 and removed request for yuanzhg078 January 4, 2026 02:36
qyh111
qyh111 previously approved these changes Jan 4, 2026
leideng and others added 2 commits January 5, 2026 11:09
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ OUR OFFICIAL WEBSITE.

-->

This PR adds NPU (Ascend) device support for `KvCompOnDevice`
functionality (HBM version) in unified-cache-management. The
implementation enables efficient KV cache compression and sparse
attention on Ascend NPU hardware, complementing the existing CUDA
support.

**Key features:**
- NPU device detection and initialization for `KvCompOnDevice`
- NPU-optimized hash computation and caching using
`ucm_custom_ops.reshape_and_cache_bnsd`
- NPU-specific hamming distance top-k computation using
`ucm_custom_ops.hamming_dist_top_k` for efficient KV block selection
- NPU-specific metadata building (`build_decode_attention_meta_npu`)
that handles NPU tensor layouts and memory pinning
- NPU-specific KV hash cache initialization
(`initialize_kv_hash_cache_tensors_npu`) with proper shape handling for
NPU

**Why we need it:**
- Speedup attention computation during decoding phase on Ascend NPU
platforms (Atlas 800I A2)
- Provides performance benefits through hardware-accelerated operations
on NPU
- Maintains feature parity with CUDA implementation while leveraging
NPU-specific optimizations

**Dependencies:**
- Requires `ucm_custom_ops` module for NPU-specific kernel operations

<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

**Yes, this PR introduces user-facing changes:**

1. **New NPU device support**: Users can now use `KvCompOnDevice` with
`device_type="npu"` in their vLLM configuration
2. **New initialization method**: Added
`initialize_kv_hash_cache_tensors_npu()` method for NPU-specific KV hash
cache initialization
3. **New metadata building method**: Added
`build_decode_attention_meta_npu()` method that handles NPU-specific
tensor operations and memory management
4. **NPU-specific configuration**: Added NPU-specific configuration
parameters:
   - `max_batch_size` for NPU batch size limits
- `hamming_keep_chunks_head` and `hamming_keep_chunks_tail` for hamming
distance computation
- Sequence length threshold (`seq_len_threshhold`) for decode phase
detection

**Implementation details:**
- NPU code paths are conditionally compiled based on
`torch.npu.is_available()`
- Uses `ucm_custom_ops` for NPU-optimized operations:
  - `reshape_and_cache_bnsd()` for hash cache storage
  - `hamming_dist_top_k()` for efficient top-k block selection
- Properly handles NPU tensor layouts (BNSD format) and memory pinning
for CPU-NPU transfers
- Maintains backward compatibility with existing CUDA implementation

<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
Here is the running result
```bash
(VllmWorker rank=5 pid=112176) [NPU KVComp Debug] layer_name: model.layers.45.self_attn.attn, khash_cache is None
(VllmWorker rank=5 pid=112176) [NPU KVComp Debug] layer_name: model.layers.46.self_attn.attn, is_rollback_layer=False, is_skip_hash_layer=True, k_cache_shape: torch.Size([9081, 128, 1, 128])
(VllmWorker rank=5 pid=112176) [NPU KVComp Debug] layer_name: model.layers.46.self_attn.attn, khash_cache is None
(VllmWorker rank=5 pid=112176) [NPU KVComp Debug] layer_name: model.layers.47.self_attn.attn, is_rollback_layer=False, is_skip_hash_layer=True, k_cache_shape: torch.Size([9081, 128, 1, 128])
(VllmWorker rank=5 pid=112176) INFO 12-28 22:33:57 [core.py:172] init engine (profile, create kv cache, warmup model) took 25.74 seconds
INFO 12-28 22:33:59 [factory.py:74] Creating v1 connector with name: UCMConnector and engine_id: ae921cd1-63c3-4d1e-bfa1-d65f0c688ffa
WARNING 12-28 22:33:59 [base.py:71] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
WARNING 12-28 22:33:59 [base.py:71] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
[2025-12-28 22:33:59] - ucm.integration.vllm.ucm_connector - INFO [ucm_connector.py:106] NPU device is available.
[2025-12-28 22:33:59] - ucm.utils - INFO [utils.py:81] Using kv_connector_extra_config from terminal input
[2025-12-28 22:33:59] - ucm.utils - INFO [utils.py:89] Using UCM with config: {'ucm_connectors': [{'ucm_connector_name': 'UcmNfsStore', 'ucm_connector_config': {'storage_backends': '/docker/d00808955/kv_cache', 'use_direct': False}}], 'ucm_sparse_config': {'KvCompOnDevice': {'kvcompOnDevice_config_path': '/docker/d00808955/ucm-github/ucm/sparse/kvcomp/configs/kvcomp_qwen3_32B_config.json', 'max_batch_size': 30}}}
[2025-12-28 22:33:59] - ucm.integration.vllm.ucm_connector - INFO [ucm_connector.py:124] self.launch_config: {'ucm_connectors': [{'ucm_connector_name': 'UcmNfsStore', 'ucm_connector_config': {'storage_backends': '/docker/d00808955/kv_cache', 'use_direct': False}}], 'ucm_sparse_config': {'KvCompOnDevice': {'kvcompOnDevice_config_path': '/docker/d00808955/ucm-github/ucm/sparse/kvcomp/configs/kvcomp_qwen3_32B_config.json', 'max_batch_size': 30}}}
[2025-12-28 22:33:59] - ucm.store.factory_v1 - INFO [factory_v1.py:56] Creating connector with name: UcmNfsStore
[2025-12-28 22:33:59.231572][UC][I] PcStore-(Release). [112089,112089][pcstore.cc:89,ShowConfig]
[2025-12-28 22:33:59.231620][UC][I] Set UC::StorageBackends to ["/docker/d00808955/kv_cache/kv"]. [112089,112089][pcstore.cc:90,ShowConfig]
[2025-12-28 22:33:59.231625][UC][I] Set UC::BlockSize to 0. [112089,112089][pcstore.cc:91,ShowConfig]
[2025-12-28 22:33:59.231628][UC][I] Set UC::TransferEnable to false. [112089,112089][pcstore.cc:92,ShowConfig]
[2025-12-28 22:33:59.231631][UC][I] Set UC::UniqueId to ae921cd1-63c3-4d1e-bfa1-d65f0c688ffa. [112089,112089][pcstore.cc:93,ShowConfig]
[2025-12-28 22:33:59.231633][UC][I] Set UC::IoSize to 262144. [112089,112089][pcstore.cc:94,ShowConfig]
[2025-12-28 22:33:59.231635][UC][I] Set UC::IoDirect to false. [112089,112089][pcstore.cc:95,ShowConfig]
[2025-12-28 22:33:59.231638][UC][I] Set UC::LocalRankSize to 1. [112089,112089][pcstore.cc:96,ShowConfig]
[2025-12-28 22:33:59.231641][UC][I] Set UC::DeviceId to -1. [112089,112089][pcstore.cc:97,ShowConfig]
[2025-12-28 22:33:59.231643][UC][I] Set UC::StreamNumber to 8. [112089,112089][pcstore.cc:98,ShowConfig]
[2025-12-28 22:33:59.231646][UC][I] Set UC::BufferNumber to 4096. [112089,112089][pcstore.cc:99,ShowConfig]
[2025-12-28 22:33:59.231648][UC][I] Set UC::TimeoutMs to 30000. [112089,112089][pcstore.cc:100,ShowConfig]
[2025-12-28 22:33:59.231650][UC][I] Set UC::ScatterGatherEnable to false. [112089,112089][pcstore.cc:101,ShowConfig]
[2025-12-28 22:33:59.231652][UC][I] Set UC::ShardDataDir to true. [112089,112089][pcstore.cc:102,ShowConfig]
[2025-12-28 22:33:59] - ucm.utils - INFO [utils.py:81] Using kv_connector_extra_config from terminal input
[2025-12-28 22:33:59] - ucm.utils - INFO [utils.py:89] Using UCM with config: {'ucm_connectors': [{'ucm_connector_name': 'UcmNfsStore', 'ucm_connector_config': {'storage_backends': '/docker/d00808955/kv_cache', 'use_direct': False}}], 'ucm_sparse_config': {'KvCompOnDevice': {'kvcompOnDevice_config_path': '/docker/d00808955/ucm-github/ucm/sparse/kvcomp/configs/kvcomp_qwen3_32B_config.json', 'max_batch_size': 30}}}
[2025-12-28 22:33:59] - ucm.utils - INFO [utils.py:81] Using kv_connector_extra_config from terminal input
[2025-12-28 22:33:59] - ucm.utils - INFO [utils.py:89] Using UCM with config: {'ucm_connectors': [{'ucm_connector_name': 'UcmNfsStore', 'ucm_connector_config': {'storage_backends': '/docker/d00808955/kv_cache', 'use_direct': False}}], 'ucm_sparse_config': {'KvCompOnDevice': {'kvcompOnDevice_config_path': '/docker/d00808955/ucm-github/ucm/sparse/kvcomp/configs/kvcomp_qwen3_32B_config.json', 'max_batch_size': 30}}}
[2025-12-28 22:33:59] - ucm.sparse.state - INFO [state.py:51] Initializing UCM sparse agent with method: {'KvCompOnDevice': {'kvcompOnDevice_config_path': '/docker/d00808955/ucm-github/ucm/sparse/kvcomp/configs/kvcomp_qwen3_32B_config.json', 'max_batch_size': 30}}
[2025-12-28 22:33:59] - ucm.utils - INFO [utils.py:81] Using kv_connector_extra_config from terminal input
[2025-12-28 22:33:59] - ucm.utils - INFO [utils.py:89] Using UCM with config: {'ucm_connectors': [{'ucm_connector_name': 'UcmNfsStore', 'ucm_connector_config': {'storage_backends': '/docker/d00808955/kv_cache', 'use_direct': False}}], 'ucm_sparse_config': {'KvCompOnDevice': {'kvcompOnDevice_config_path': '/docker/d00808955/ucm-github/ucm/sparse/kvcomp/configs/kvcomp_qwen3_32B_config.json', 'max_batch_size': 30}}}
[2025-12-28 22:33:59] - ucm.sparse.factory - INFO [factory.py:43] Creating sparse method with name: KvCompOnDevice
[2025-12-28 22:33:59] - ucm.utils - INFO [utils.py:81] Using kv_connector_extra_config from terminal input
[2025-12-28 22:33:59] - ucm.utils - INFO [utils.py:89] Using UCM with config: {'ucm_connectors': [{'ucm_connector_name': 'UcmNfsStore', 'ucm_connector_config': {'storage_backends': '/docker/d00808955/kv_cache', 'use_direct': False}}], 'ucm_sparse_config': {'KvCompOnDevice': {'kvcompOnDevice_config_path': '/docker/d00808955/ucm-github/ucm/sparse/kvcomp/configs/kvcomp_qwen3_32B_config.json', 'max_batch_size': 30}}}
[2025-12-28 22:33:59] - ucm.sparse.kvcomp.kvcomp_hbm - INFO [kvcomp_hbm.py:122] read kvcomp config file : /docker/d00808955/ucm-github/ucm/sparse/kvcomp/configs/kvcomp_qwen3_32B_config.json
INFO 12-28 22:33:59 [scheduler.py:99] UCM Sparse initialized successfully: <ucm.sparse.kvcomp.kvcomp_hbm.KvCompOnDevice object at 0xfffd90d9ad40>
INFO 12-28 22:33:59 [platform.py:161] Compilation disabled, using eager mode by default
Loaded prompt from: prompts/batch-10k/longprompt1-1.txt
Loaded prompt from: prompts/batch-10k/longprompt1-2.txt
Loaded prompt from: prompts/batch-10k/longprompt1-3.txt
Loaded prompt from: prompts/batch-10k/longprompt1-4.txt
Loaded prompt from: prompts/batch-10k/longprompt1-5.txt
num_requests: 5
length of prompt 0 is: 10756
length of prompt 1 is: 10756
length of prompt 2 is: 10756
length of prompt 3 is: 10756
length of prompt 4 is: 10756
INFO 12-28 22:34:00 [chat_utils.py:444] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
Adding requests: 100%|██████████| 5/5 [00:00<00:00, 876.08it/s]
Processed prompts: 100%|██████████| 5/5 [01:47<00:00, 21.49s/it, est. speed input: 449.81 toks/s, output: 18.69 toks/s]
Prompt(short): '1\n请阅读以上小说,并回答问题:桃园三结义的三个人分别是谁?他们结义的誓词主要内容是什么?\n不要重复问题,不要重复输出,用简短的语句给出答案。\n小说如下:\n\n    滚滚长江东逝水,浪花淘尽英雄。是非......授人以柄,功必不成,反生乱矣。”何进笑曰:“此懦夫之见也!”傍边一人鼓掌大笑曰:“此事易如反掌,何必多议!”视之,乃曹操也。正是:欲除君侧宵人乱,须听朝中智士谋。不知曹操说出甚话来,且听下文分解。\n\n' with words 10756
Generated text: '<think>\n好的,用户问的是《三国演义》中桃园三结义的三个人分别是谁,以及他们的结义誓词主要内容是什么。首先,我需要确认用户提供的小说内容是否包含相关段落。用户给出的小说开头是《三国演义》的开篇词,接着是关于桃园三结义的描述。不过,用户提供的文本似乎不完整,可能只截取了部分段落。\n\n根据我已有的知识,桃园三结义的三个人是刘备、关羽和张飞。他们结义的誓词通常包括同生共死、共同进退、不求同年同月同日生,但求同年同月同日死。不过,用户提供的文本中可能没有详细描述誓词,所以需要仔细检查是否有遗漏的信息。\n\n用户提供的文本中提到“桃园三结义的三个人分别是谁?他们结义的誓言主要内容是什么?”,但给出的小说内容并没有详细描述结义的具体过程或誓词。因此,我需要依赖已有的知识来回答,但必须确保答案准确,并且不超出小说内容的范围。\n\n可能用户提供的文本是《三国演义》的开头部分,而桃园三结义的故事在后续章节中。因此,正确的做法是根据原著内容回答,但需要指出用户提供的文本中没有详细描述誓词,所以答案可能基于一般常识。不过用户可能希望得到基于他们提供的文本的答案,所以需要确认文本中是否有相关信息。\n\n检查用户提供的文本,发现其中并没有详细描述三人的结义过程或誓词。因此,可能需要依赖外部知识,但用户可能希望答案仅基于提供的文本。不过,根据常见的《三国演义》内容,桃园三结义是刘备、关羽、张飞,誓词是“不求同年同月同日生,但求同年同月同日死”。但用户提供的文本中没有提到这些,所以可能需要指出这一点,但用户可能希望答案基于小说内容,而小说内容可能不完整。\n\n因此,正确的做法是根据用户提供的文本中是否有相关信息,如果没有,可能需要说明无法从提供的文本中找到答案,但根据常识回答。不过用户可能希望直接回答,所以需要结合常识和文本中的信息。\n</think>\n\n桃园三结义的三人是刘备、关羽、张飞。  \n誓词主要内容为:不求同年同月同日生,但愿同年同月同日死,同心协力,救困扶危。' with words 834
Prompt(short): '2\n请阅读以上小说,并回答问题:桃园三结义的三个人分别是谁?他们结义的誓词主要内容是什么?\n不要重复问题,不要重复输出,用简短的语句给出答案。\n小说如下:\n\n    滚滚长江东逝水,浪花淘尽英雄。是非......授人以柄,功必不成,反生乱矣。”何进笑曰:“此懦夫之见也!”傍边一人鼓掌大笑曰:“此事易如反掌,何必多议!”视之,乃曹操也。正是:欲除君侧宵人乱,须听朝中智士谋。不知曹操说出甚话来,且听下文分解。\n\n' with words 10756
Generated text: '<think>\n好的,我需要回答用户关于《三国演义》中桃园三结义的三个人是谁以及他们的誓词主要内容的问题。首先,我得回忆一下小说中的相关情节。桃园三结义是刘备、关羽和张飞,他们是在涿郡张飞的桃园中结义。誓词部分,我记得他们誓言是“不求同年同月同日生,但求同年同月同日死”,强调同生共死的兄弟情谊。需要确认是否有其他内容,比如是否提到共同匡扶汉室,但根据常见版本,核心是同生死。所以答案应该是这三个人的名字和誓词的主要部分。\n</think>\n\n桃园三结义的三人是刘备、关羽、张飞。  \n誓词主要内容为“不求同年同月同日生,但求同年同月同日死”,誓言同心协力,生死相依。' with words 284
```
Also, the pre-commit check is passed as

<img width="1128" height="143" alt="image"
src="https://github.com/user-attachments/assets/1e003294-6518-4344-aad6-705ee87ee2ec"
/>

<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------
@mag1c-h mag1c-h merged commit be5bcbc into ModelEngine-Group:0.2.0-release Jan 5, 2026
6 checks passed
@wangwenxin0312 wangwenxin0312 deleted the dev_wwx_020 branch January 5, 2026 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants