-
Notifications
You must be signed in to change notification settings - Fork 524
[Perf] kv_transfer/kv_connector: aibrix_pd_reuse_connector: Reuse unaligned tail kvcache #1900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @dczhu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a performance optimization by enabling the reuse of KV cache for unaligned tail parts between prefill and decode stages. The core change involves modifying the KV connector to process the entire prompt, including non-block-aligned tails, rather than just aligned chunks. This is implemented by adding logic to pad the unaligned tail to a full block on the sending side before caching. While the changes on the sending side seem correct, I've identified a critical issue where the receiving side does not appear to handle this new logic for unaligned tails, which would lead to incomplete cache loading. I've also included a couple of suggestions to improve code quality and maintainability.
| + # Handle remaining unaligned tokens | ||
| + remaining_tokens_start = aligned_context_len + total_sent | ||
| + if remaining_tokens_start < prompt_len: | ||
| + remaining_tokens = seq_all_tokens[remaining_tokens_start:prompt_len] | ||
| + remaining_len = len(remaining_tokens) | ||
| + | ||
| + # For alloc | ||
| + remaining_tokens_rounded = round_up( | ||
| + remaining_len, | ||
| + self.cache_block_ntokens | ||
| + ) | ||
| + # Extend tokens to full block | ||
| + remaining_tokens_extended = remaining_tokens + [0] * ( | ||
| + remaining_tokens_rounded - remaining_len | ||
| + ) | ||
| + | ||
| + remaining_prefix = seq_all_tokens[:remaining_tokens_start] | ||
| + | ||
| + exists_status = self.cache.exists( | ||
| + remaining_prefix, remaining_tokens_extended | ||
| + ) | ||
| + if exists_status.is_ok(): | ||
| + num_existing_tokens = exists_status.value | ||
| + if num_existing_tokens >= remaining_len: | ||
| + return | ||
| + | ||
| + # For remaining unaligned KV cache (rounded up to block size) | ||
| + status = self.cache.allocate_for( | ||
| + remaining_prefix, remaining_tokens_extended | ||
| + ) | ||
| + if not status.is_ok(): | ||
| + log_every_n_seconds( | ||
| + logger, | ||
| + logging.ERROR, | ||
| + f"Failed to allocate for remaining unaligned part: %s", | ||
| + 3, | ||
| + str(status) | ||
| + ) | ||
| + return | ||
| + | ||
| + handle = status.value | ||
| + tensors = handle.to_tensors() | ||
| + allocated_length = len(tensors) * self.cache_block_ntokens | ||
| + | ||
| + # Slot mapping for remaining unaligned part (no padding) | ||
| + remaining_slot_mapping = seq_cached_meta.context_slot_mapping[ | ||
| + remaining_tokens_start:remaining_tokens_start + remaining_len | ||
| + ] | ||
| + | ||
| + with perf_timer() as get_kernel_offload_dur_ms: | ||
| + reshape_and_offload_multi_layer( | ||
| + tensors, | ||
| + self.layers_kv_caches, | ||
| + remaining_slot_mapping, | ||
| + self.engine_block_ntokens, | ||
| + self.kv_cache_dtype, | ||
| + self.k_scales, | ||
| + self.v_scales, | ||
| + self.block_layout.name, | ||
| + ) | ||
| + | ||
| + # Full block with padding | ||
| + status = self.cache.put( | ||
| + remaining_prefix, | ||
| + remaining_tokens_extended[:allocated_length], | ||
| + handle | ||
| + ) | ||
| + if not status.is_ok(): | ||
| + log_every_n_seconds( | ||
| + logger, | ||
| + logging.ERROR, | ||
| + f"Failed to put remaining unaligned part to cache: %s", | ||
| + 3, | ||
| + str(status) | ||
| + ) | ||
| + return | ||
| + | ||
| + put_ntokens = status.get() | ||
| + total_sent += min(put_ntokens, remaining_len) | ||
| + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new block of code to handle remaining unaligned tokens is quite large (around 80 lines) and is nested inside the _send_kv_to_cache_impl method. To improve readability and maintainability of this function, consider extracting this logic into a separate private helper method, for example, _send_unaligned_tail_to_cache.
| + remaining_tokens_start:remaining_tokens_start + remaining_len | ||
| + ] | ||
| + | ||
| + with perf_timer() as get_kernel_offload_dur_ms: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python/aibrix_kvcache/integration/vllm/patches/vllm_v0.10.2-aibrix_pd_reuse_connector.patch
Outdated
Show resolved
Hide resolved
adff722 to
4048436
Compare
| + chunk_len - num_existing_tokens | ||
| + < self.cache_block_ntokens | ||
| + ): | ||
| + continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which case will get into this if branch?
| + : new_chunk_prefix_len + chunk_len | ||
| + ] | ||
| + # Re-calc tokens_to_alloc after adjusting chunk_tokens | ||
| + if is_unaligned: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems unreachable branch right? if is_unaligned is true then it will always hit L1061
| + continue | ||
| + if is_unaligned: | ||
| + # For unaligned, check if we have enough existing tokens | ||
| + if num_existing_tokens >= chunk_len: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if num_existing_tokens < chunk_len
1129d31 to
b05037e
Compare
…ligned tail kvcache Signed-off-by: Dengcheng Zhu <[email protected]>
b05037e to
fc17469
Compare
Pull Request Description
Save and reuse the kvcache of the unaligned tail part between P and D.
Related Issues
Resolves: #[Insert issue number(s)]
Important: Before submitting, please complete the description above and review the checklist below.
Contribution Guidelines (Expand for Details)
We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:
Pull Request Title Format
Your PR title should start with one of these prefixes to indicate the nature of the change:
[Bug]: Corrections to existing functionality[CI]: Changes to build process or CI pipeline[Docs]: Updates or additions to documentation[API]: Modifications to aibrix's API or interface[CLI]: Changes or additions to the Command Line Interface[Misc]: For changes not covered above (use sparingly)Note: For changes spanning multiple categories, use multiple prefixes in order of importance.
Submission Checklist
By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.