[Feature] Add per-request attention capture to the OpenAI-compatible API#35014
[Feature] Add per-request attention capture to the OpenAI-compatible API#35014Parkprogrammer wants to merge 3 commits intovllm-project:mainfrom
Conversation
|
Documentation preview: https://vllm--35014.org.readthedocs.build/en/35014/ |
|
This pull request has merge conflicts that must be resolved before it can be |
…roject#11365) - Query buffering at request processing time - Post-request Q×K^T score computation with causal mask - Per-layer and layer range selection via CLI flags - Multimodal token range support (vision/language) - Shared memory IPC for efficient cross-process transfer - Integration with GPU model runner and attention layers Signed-off-by: Jehyun Park <jaheon555@g.skku.edu>
…ect#11365) - Add attn_capture and attn_capture_layers parameters to chat completions API - Return attention scores in response via attn_capture_data field - Support per-request capture control and layer selection - Add CLI flags: --enable-attention-instrumentation and --attention-instrumentation-layers - Load attention snapshots from shared memory in output processor - Include attention data in RequestOutput for client delivery Signed-off-by: Jehyun Park <jaheon555@g.skku.edu>
…ct#11365) - Python utilities for extracting and analyzing attention scores - Multimodal token classification (vision/language/generated) - Cross-modal attention measurement - OpenAI SDK and cURL usage examples - Comprehensive guide with quick start and API reference Signed-off-by: Jehyun Park <jaheon555@g.skku.edu>
d9032b6 to
76eaeea
Compare
|
Hi @Parkprogrammer, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
We have previously rejected returning hidden states directly in the request output because it adds too much IPC overhead (see #15434 (comment)), the same could be said for this PR. I suggest following a similar approach as #33118
There was a problem hiding this comment.
Code Review
The pull request introduces a per-request attention capture mechanism for the OpenAI-compatible API. While this is a valuable feature for interpretability and debugging, the current implementation has several critical performance and security issues. The most significant concerns are the introduction of GPU-CPU synchronization points during the forward pass, blocking I/O operations in the main engine loop that can hang the server for up to 30 seconds, and the use of pickle for inter-process communication, which presents a security risk. Additionally, the full attention matrix computation lacks memory safeguards, potentially leading to OOM crashes on long sequences.
| prompt_token_ids = [0] * len(self.prompt_embeds) | ||
|
|
||
| # Load attention capture data if capture was requested | ||
| attn_capture_data = None | ||
| if finished and getattr(self, "attn_capture_enabled", False): | ||
| from vllm.model_executor.layers.attention.attn_capture import ( |
There was a problem hiding this comment.
The call to load_attn_snapshot is a blocking operation that polls shared memory for up to 30 seconds with time.sleep. Since this is executed within process_outputs, which runs in the main engine loop (or the output_handler task in AsyncLLM), it will block the entire engine from processing any other requests. This is a critical issue that can cause severe latency spikes or hang the server. The data should be loaded in a non-blocking manner or offloaded to a background thread.
There was a problem hiding this comment.
The blocking load_attn_snapshot() call will be removed from output_processor. In the rework, the engine returns only an out-of-band handle (via kv_transfer_params), and no polling happens on the engine loop.
| # Detaching the query tensor for buffering. Severing CUDA computation trace | ||
| # Query tensor stays in the dict for all requests | ||
| try: | ||
| query_cpu = query.detach().cpu().clone() |
There was a problem hiding this comment.
query.detach().cpu().clone() is a blocking operation that forces a GPU-CPU synchronization during the model's forward pass. This is called for every attention layer being captured, which will significantly degrade the throughput of the entire batch whenever a capture request is present. Consider keeping the tensors on the GPU and performing the attention computation there, or using asynchronous memory copies to avoid blocking the forward pass.
There was a problem hiding this comment.
The query.detach().cpu().clone() path introduces an implicit GPU↔CPU sync. I will replace it with an async D2H copy into pinned memory on a dedicated stream to avoid blocking the forward hot path.
| time.sleep(0.01) | ||
| continue | ||
| try: | ||
| data = pickle.loads(bytes(mem.buf[8 : 8 + size])) |
There was a problem hiding this comment.
I will remove pickle from the IPC path and replace it with a safe binary format (fixed struct header + raw tensor bytes). No object deserialization.
| """Capture attention and clean up buffers for a finished request.""" | ||
| req_state = self.requests.get(req_id) | ||
| if not req_state: | ||
| return |
There was a problem hiding this comment.
self.attn_capture.capture(...) is called synchronously within the worker's execution loop for finished requests. This function performs heavy GPU computations (full NxN attention) and CPU-intensive tasks (gzip compression). For long sequences, this will introduce a significant delay before the next batch can start executing, impacting overall engine throughput. This processing should be offloaded to a background thread.
There was a problem hiding this comment.
capture() will be offloaded so the worker execution loop is non-blocking. The worker will enqueue a capture job and proceed; the capture task performs post-forward compute + compression asynchronously and writes the result out-of-band.
| return k.to(dtype) # NOTE(jehyun): Uniform dtype for downstream float16 computation | ||
|
|
||
|
|
||
| def compute_qk_attention( |
There was a problem hiding this comment.
compute_qk_attention computes a full [T, H, T] attention matrix. For long sequences, this can consume a massive amount of GPU memory (e.g., ~8GB for T=8192, H=32 in float32). The current implementation lacks safeguards or limits on the sequence length or the number of layers, making the server vulnerable to OOM crashes via the public API. A hard limit on the total number of elements captured per request should be enforced.
There was a problem hiding this comment.
will add hard per-request resource caps (e.g., --max-attn-capture-tokens, --max-attn-capture-bytes, and/or max layers). Requests exceeding limits will be rejected with an explicit status instead of attempting allocation.
Thanks for the pointer to #33118 and the earlier discussion on tensor outputs. I’m going to rework this PR to avoid returning large tensors through the request output / engine IPC:
Does this direction align with what you had in mind with #33118? If there are additional constraints you want me to follow, I’m happy to incorporate them before pushing the rework. |
|
I think this is a relatively niche feature, so I'd prefer to minimize intrusion to our existing code. I am not sure whether we really want to return the tensors via API either, perhaps it's better for the client to connect to the KV cache directly using the returned handle? |
Closes #11365
Motivation
Interpretability researchers and multimodal debugging workflows need access to raw attention scores at inference time without patching model code or writing custom inference loops.
This PR exposes per-request attention instrumentation through the existing OpenAI-compatible API. When not requested, there is zero additional compute cost — no forward-pass modification, no graph break, no torch.compile impact.
What This PR Does
Adds an opt-in, per-request attention capture mechanism to vLLM's OpenAI API server. When the server is started with instrumentation enabled and a client explicitly requests capture, Q×Kᵀ softmax attention scores are computed
post-forward, serialised via shared memory, and attached to the chat completion response.
New Server Flags
New Request Fields (Chat Completions)
attn_captureint(0/1)attn_capture_layersstr"all"New Response Field
attn_capture_data— list of per-layer objects:[ { "layer_idx": 8, "data": "<base64-gzipped tensor>", "shape": [T, H, T], "dtype": "float16", "token_meta": { "token_idx_basis": 0, "prompt_len": 42, "total_len": 50, "vision_ranges": [[0, 35]] } } ]Shape is
[T, H, T]whereT=prompt_len + generated_len - 1(all tokens visible to the attention kernel — the final sampled token has no subsequent Q step).H= query heads.token_meta.vision_rangesmaps image token spans for cross-modal analysis.Design Notes
Why post-forward computation?
torch.compileside effect is introduced.Why Q buffering?
K is read from the KV cache and Q×Kᵀ is computed once. This avoids any impact on the hot path.
Prefix caching interaction
q_cache(FIFO-capped at 32 768 slots, ~128 MB at float16/128d/16h) stores Q vectors across requests so prefix-cached tokens are still included in the captured score matrix.Why shared memory?
Each snapshot is keyed by
request_idand cleaned up immediately after the output processor reads it.Concurrency and isolation
request_id, so concurrent requests are fully isolated. Segments are unlinked after reading; a 30-second read timeout is used to handle slow capture paths gracefully.If the engine crashes before writing, the segment simply times out and
attn_capture_datais absent from the response (no server crash).Streaming compatibility
stream=False). Streaming responses do not includeattn_capture_databecause the snapshot is only available after the full sequence is generated.Memory Impact
Per captured request (non-streaming only):
Tensors are gzip-compressed before serialisation, typically achieving 3–5×
reduction. The persistent Q cache is FIFO-capped at 32 768 slots (~128 MB
worst-case) to bound memory growth across requests.
Capture is disabled by default. No automatic hard cap on response size is
enforced — for very long sequences, callers are expected to limit scope via
attn_capture_layers.Implementation
attn_capture.py(new, ~600 LOC)gpu_model_runner.pycapture()post-requestattention.pyoutput_processor.pyRequestOutputprotocol.pyarg_utils.py,cache.pyattn_capture.pycore logic is ~350 lines; the remainder is IPC utilitiesand validation helpers. Can be split if preferred.
Example Usage
See
examples/offline_inference/attention_instrumentation/for full examplesincluding multimodal token classification and cross-modal attention analysis.
Test Results
Tested on Qwen2.5-VL-3B-Instruct and Gemma-3-4B-IT across 9 input
groups × 6 capture configurations = 54 cases per model, 108 total.
All 108/108 cases passed.
Test script: Parkprogrammer/vllm_capture
Summary:
attn_capture=0(instrumentation-enabled server, capture disabled per-request)Full per-group breakdown
Input groups:
text·{short,medium,long},text+image·{short,medium,long},image only·{short,medium,long}Capture configs:
off(attn_capture=0),early(layer 2),mid(layer 8),late(layer 15),multi(2,8,15),allQwen2.5-VL-3B-Instruct
Gemma-3-4B-IT
Column legend:
allattn_capture=0on instrumentation-enabled server--enable-attention-instrumentationflag (2 warm-up requests; indicative, single GPU)