Skip to content

Conversation

@LuminolT
Copy link

@LuminolT LuminolT commented Nov 21, 2025

Purpose

This PR introduces xxHash-based hash algorithms to vLLM’s prefix caching, providing a faster and collision-resistant alternative to the SHA-256 variants added in #23673.

Why xxHash? Prefix Caching does not require cryptographic security:

  • It never stores secret data.
  • It never leaves process boundaries.
  • It is used exclusively to identify structural equivalence of prefix blocks.

Therefore, cryptographic collision resistance is unnecessary.

xxHash strikes the right balance: Much better collision resistance compared to Python builtin hash(), which is salted and unstable across processes. Much faster than SHA-256 and SHA-256+CBOR used in current prefix caching.

This change is lightweight: it only introduces the xxhash dependency and adds minimal adapter code to integrate xxHash into the existing hashing interface. We added two small benchmarking scripts to validate correctness and performance, but these can be removed in follow-up cleanup if needed.

⚠️ Only problem is xxHash introduces a third-party dependency. SHA-256 is available via Python standard library hashlib but xxHash requires an additional Python dependency xxhash.

As a result this PR adds xxHash as an optional high-performance hash. Users can opt in with:

--prefix-caching-hash-algo xxhash
--prefix-caching-hash-algo xxhash-cbor

Test Plan

1. Micro benchmark of hash functions

benchmarks/benchmark_hash.py is added for benchmarks for hash functions. This could be removed if PR is merged.

This evaluation is same as #23673, except for some configurations for xxHash.

python benchmarks/benchmark_hash.py --iterations 10000

This compares Python's builtin hash(), sha256 and xxHash.

2. Block-level hash throughput

benchmarks/benchmark_prefix_block_hash.py is added for block-level hash throughput evaluation This could be removed if PR is merged.

This test is model-agnostic, so it reflects performance on larger models as well, rather than only on the small OPT-125M model we evaluated following.

python benchmarks/benchmark_prefix_block_hash.py --num-blocks 20000 --block-size 64
python benchmarks/benchmark_prefix_block_hash.py --num-blocks 50000 --block-size 16

This measures token hashing throughput for: sha256, sha256_cbor, xxhash and xxhash_cbor.

3. Prefix caching evaluation

This is the benchmark currently provided by vLLM. We tested it using different hash algorithms and compared their results.

Following evaluations are conducted on one-card A100 server.

python benchmarks/benchmark_prefix_caching.py \
  --model facebook/opt-125m \
  --enable-prefix-caching \
  --prefix-caching-hash-algo <algo> \
  --input-length-range "128:256" \
  --output-len 128 \
  --num-prompts 200 \
  --repeat-count 5 \
  --seed 42

4. Long-running throughput & GC behavior

Same as #23673. @Jialin 's Jialin#4 is used for GC analysis. Since SHA-256 significantly reduces GC overhead, our goal is to match or further reduce GC time compared to SHA-256.

Input/Output configurations:

  • input_len=1500, output_len=1
  • input_len=1500, output_len=150

Each run generates logs:

throughput_sha256_out1_*.log
throughput_sha256_cbor_out1_*.log
throughput_xxhash_out1_*.log
throughput_xxhash_cbor_out1_*.log
...

Test Result

1. Micro benchmark of hash functions

# python benchmarks/benchmark_hash.py --iterations 10000
============================================================
HASH FUNCTION MICRO BENCHMARK
============================================================
Test data: (32-byte bytes object, 32-int tuple)
Iterations: 10,000
============================================================

Results:
  SHA256 (pickle) :     1.45 ±   0.24 μs
  xxHash (pickle) :     0.99 ±   0.16 μs
  built-in hash() :     0.19 ±   0.07 μs

============================================================
SUMMARY (relative to built-in hash())
============================================================
• SHA256 (pickle) is 7.8x slower than built-in hash()
• xxHash (pickle) is 5.3x slower than built-in hash()

👉 Among non-builtin hashers, xxHash is ~1.5× faster than SHA-256.

2. Block-level hash throughput

20k × 64 tokens and 50k × 16 tokens.

# python benchmarks/benchmark_prefix_block_hash.py --num-blocks 20000 --block-size 64
Benchmarking 4 algorithms on 20000 blocks (block size=64).
sha256         avg: 0.044319s  best: 0.043920s  throughput: 29.14M tokens/s
sha256_cbor    avg: 0.195202s  best: 0.194645s  throughput: 6.58M tokens/s
xxhash         avg: 0.035225s  best: 0.035059s  throughput: 36.51M tokens/s
xxhash_cbor    avg: 0.184370s  best: 0.183651s  throughput: 6.97M tokens/s

👉 xxHash improves block hashing throughput by 25–40%.

3. Prefix caching evaluation

Algo Time (s)
sha256 ~4.17
sha256_cbor ~4.25
xxhash ~4.12
xxhash_cbor ~4.23

👉 Performance improves 1–2% with xxHash.

We should note that our experiments were conducted only on the small OPT-125M model. In larger production service, where prefix-hit rates are substantially higher, xxHash is expected to show even greater performance advantages.

4. Long-running throughput & GC behavior

output_len = 1 (prefill-dominant)

image
  • sha256: 219.9 req/s
  • xxhash: 223.4 req/s (+1.6%)

GC:

  • Both ~300 events
  • GC runtime ≈ 1.3–1.4% of wall clock
  • Main GC distribution almost identical
  • No new long-tail events from xxHash

output_len = 150 (decode-heavy)

image
  • sha256: 77.9 req/s
  • xxhash: 78.9 req/s (+1.2%)

GC:

  • sha256: total GC ~1.92 s
  • xxhash: total GC ~1.68 s (−12.8%)
  • xxHash shows fewer and shorter >400ms GC outliers

CBOR variants

image image

Similar behavior:

  • xxHash_CBOR slightly faster than SHA256_CBOR
  • GC profile remains ~1–2% of wall clock
  • No regressions or long-tail amplification

👉 xxHash does not increase GC pressure and may even slightly reduce tail events.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces xxHash as a high-performance, non-cryptographic hashing option for prefix caching, which is a sensible optimization. The implementation is well-structured and includes comprehensive benchmarks to validate the performance gains. My review focuses on improving the robustness of handling the new optional dependency, xxhash. I've suggested a change to ensure compatibility with older versions of the xxhash library and another to make the test suite resilient to its absence. Overall, this is a solid contribution.

@mergify
Copy link

mergify bot commented Nov 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LuminolT.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 21, 2025
@mergify mergify bot removed tpu Related to Google TPUs needs-rebase labels Nov 21, 2025
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

…dling for invalid hash algorithms and introduce xxhash support

Signed-off-by: LuminolT <[email protected]>
Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LuminolT Very much appreciate the contribution, especially all the detailed analysis you put out in this PR!

cc @russellb regarding any security concerns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's worth mentioning these scripts in the documentation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out! It will be helpful for evaluating the hash algorithms used in prefix caching. I will update the relevant documentation accordingly.

@mergify
Copy link

mergify bot commented Nov 21, 2025

Documentation preview: https://vllm--29163.org.readthedocs.build/en/29163/

Copy link
Member

@russellb russellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, I appreciate the PR, including the detailed PR description, including performance numbers.

Why xxHash? Prefix Caching does not require cryptographic security:

It never stores secret data.
It never leaves process boundaries.
It is used exclusively to identify structural equivalence of prefix blocks.

Therefore, cryptographic collision resistance is unnecessary.

I disagree with this premise.

Prefix cache contents are secret data in a multi-tenant environment. A hash collision would cause undefined behavior and, at worst, leak private information. Consider this scenario:

  • Company A sends a request
  • Company B sends a request that encounters a hash collision, causing Company B's request to use cache based on Company A's request.

We had one security vulnerability where it was possible to get this to occur predictably: GHSA-rm76-4mrf-v9r8

This is a bit different, but it is another example of prefix cache security concerns. We had another case where someone demonstrated how to infer information about the contents of the prefix cache by using a timing attack to detect prefix cache hits. See this paper: https://arxiv.org/html/2411.18191v1, this PR: #17045, and this advisory: GHSA-4qjh-9fv9-r85r.

I'm not opposed to supporting this if it demonstrates a meaningful performance benefit, but I would also like to enhance our help text for the configuration to say that choosing something other than our current default of SHA256 is theoretically less secure in a multi-tenant environment and has a slightly increased risk of undefined behavior from hash collisions otherwise. Further, the code should include an additional comment clarifying the importance of the current default. I can post a separate PR adding this text, but I'd prefer it to be in place before we add other algorithms, as the tradeoffs are important.

@github-project-automation github-project-automation bot moved this from To Triage to In progress in gpt-oss Issues & Enhancements Nov 21, 2025
@NickLucche
Copy link
Collaborator

Good point, I suppose an experimental flag should've been the prerequisite for this kind of security trade-off changes in any case.

ninja # Required for xgrammar, rocm, tpu, xpu
pybase64 # fast base64 implementation
cbor2 # Required for cross-language serialization of hashable objects
xxhash # Required for fast hashing for prefix caching
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer this not be included in our main requirements file, given that it's optional. In fact, it MUST remain optional to avoid getting flagged in environments with strict security controls.

import cbor2

try:
import xxhash as _xxhash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to add a comment here that it is important for this to remain an optional dependency, as it would be considered problematic to include at all in environments with strict security controls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import xxhash as _xxhash
# It is important that this remains an optional dependency.
# It would not be allowed in environments with strict security controls,
# so it's best not to have it installed when not in use.
import xxhash as _xxhash

Copy link
Member

@russellb russellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a separate PR, I just included some suggested text additions here.

ninja # Required for xgrammar, rocm, tpu, xpu
pybase64 # fast base64 implementation
cbor2 # Required for cross-language serialization of hashable objects
xxhash # Required for fast hashing for prefix caching
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
xxhash # Required for fast hashing for prefix caching

import cbor2

try:
import xxhash as _xxhash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import xxhash as _xxhash
# It is important that this remains an optional dependency.
# It would not be allowed in environments with strict security controls,
# so it's best not to have it installed when not in use.
import xxhash as _xxhash

- "xxhash" uses Pickle serialization with xxHash (128-bit) for faster,
non-cryptographic hashing. Requires the optional ``xxhash`` package.\n
- "xxhash_cbor" combines canonical CBOR serialization with xxHash for
reproducible hashing. Requires the optional ``xxhash`` package."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
reproducible hashing. Requires the optional ``xxhash`` package."""
"""Set the hash algorithm for prefix caching:\n
- "sha256" uses Pickle for object serialization before hashing. This is the
current default, as SHA256 is the most secure choice to avoid potential
hash collisions.\n
- "sha256_cbor" provides a reproducible, cross-language compatible hash. It
serializes objects using canonical CBOR and hashes them with SHA-256.\n
- "xxhash" uses Pickle serialization with xxHash (128-bit) for faster,
non-cryptographic hashing. Requires the optional ``xxhash`` package.
IMPORTANT: Use of a hashing algorithm that is not considered
cryptographically secure theoretically increases the risk of hash collisions,
which can cause undefined behavior or even leak private information in
multi-tenant environments. Even if collisions are still very unlikely, it is
important to consider your security risk tolerance against the performance
benefits before turning this on.\n
- "xxhash_cbor" combines canonical CBOR serialization with xxHash for
reproducible hashing. Requires the optional ``xxhash`` package."""

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build deepseek Related to DeepSeek models documentation Improvements or additions to documentation frontend gpt-oss Related to GPT-OSS models kv-connector llama Related to Llama models multi-modality Related to multi-modality (#4194) new-model Requests to new models nvidia performance Performance-related issues qwen Related to Qwen models rocm Related to AMD ROCm speculative-decoding structured-output v1

Projects

Status: No status
Status: No status
Status: In progress

Development

Successfully merging this pull request may close these issues.

4 participants