Skip to content

Conversation

jmkuebler
Copy link
Contributor

@jmkuebler jmkuebler commented Sep 15, 2025

[WIP] see #24916 for context

Known limitations

Purpose

This PR adds a flag to skip the sliding window layers when quantizing the attention. The idea is that we can anyways not save much memory / latency in those layers, yet quantizing adds overheads. Furthermore, quantization always comes with additional quality risks. Not quantizing the SW layers leads to better performance as outlined in #24916

Turing off all SW layers gives us two groups of attention configs:

  • Full attention layers in FP8
  • SW attention layers in BF16

We currently use 2x the block size for Full attention in FP8 such that the page size equals between both groups. However, as pointed out by @heheda12345 , this currently makes it incompatible w/ prefix caching (requiring uniform block size).

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added gpt-oss Related to GPT-OSS models v1 labels Sep 15, 2025
Copy link

mergify bot commented Sep 15, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jmkuebler.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the different block_size of swa and full attn compatible with prefix caching? We only have the BlockHash for one block_size. My plan is to combine the kv cache of two full attn layers, so that it has the same size with one sliding window layer. I'm prototyping.

@robertgshaw2-redhat robertgshaw2-redhat changed the title enable skipping of SW layers for attn quant [HMA] Enabling Skipping SWA for Fp8 Quant Sep 16, 2025
@jmkuebler
Copy link
Contributor Author

Does the different block_size of swa and full attn compatible with prefix caching? We only have the BlockHash for one block_size. My plan is to combine the kv cache of two full attn layers, so that it has the same size with one sliding window layer. I'm prototyping.

@heheda12345 Thanks a lot for looking into this. Great catch. I overlooked that as all the perf benchmarking I did was disabling prefix caching. But yeah, when enabling prefix caching it breaks because of non-uniform blocks.

I was focusing on ensuring that the page size is uniform. If we also need to ensure that block size is uniform I also cannot think of a way other than combining blocks...

@heheda12345
Copy link
Collaborator

Here is my PR for supporting different data type by adjusting block_size. Can you have a try? #24949

…_quant

# Conflicts:
#	vllm/v1/core/kv_cache_utils.py
@mergify mergify bot removed the needs-rebase label Sep 17, 2025
@jmkuebler
Copy link
Contributor Author

jmkuebler commented Sep 17, 2025

Here is my PR for supporting different data type by adjusting block_size. Can you have a try? #24949

@heheda12345
This is awesome. I merged your PR locally and this enables prefix caching. When sending the same request twice, it correctly tracks the prefix cache hit and outputs the same generation 🙌 .

Probably makes sense to update my PR only once yours is merged?

@heheda12345
Copy link
Collaborator

Probably makes sense to update my PR only once yours is merged?

Sounds good.

Copy link

mergify bot commented Sep 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jmkuebler.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpt-oss Related to GPT-OSS models needs-rebase v1
Projects
Status: To Triage
Development

Successfully merging this pull request may close these issues.

2 participants