[Attention][UX] Add AttentionConfig and change attention backend to CLI argument #26315

MatthewBonanni · 2025-10-06T19:10:26Z

Purpose

As a first step in the effort to reduce env variables, this PR introduces AttentionConfig and the attention_config cli argument group. For now, only VLLM_ATTENTION_BACKEND is changed to a cli argument, with a fallback and a deprecation warning.

Test Plan

CI should suffice

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

This commit consolidates attention-related configuration into a new AttentionConfig class and exposes the attention backend as a CLI argument. Changes: - Created new AttentionConfig class in vllm/config/attention.py - Added AttentionConfig to VllmConfig - Added --attention-backend CLI argument in dedicated argument group - Updated imports and exports This sets the foundation for making more attention-related settingsconfigurable via CLI arguments in future work. Signed-off-by: Matthew Bonanni <[email protected]>

Added detailed documentation to clarify that all attention-related environment variables are still respected for backward compatibility. This ensures users can continue using environment variables while also having the option to use the new --attention-backend CLI argument. Signed-off-by: Matthew Bonanni <[email protected]>

Added comprehensive deprecation warnings to guide users toward using CLI arguments instead of environment variables for attention configuration. Changes: - Added deprecation warning in create_attention_config() that lists all attention-related environment variables that are set - Warning directs users to use --attention-backend and other future CLI args - Consolidated warning logic to avoid duplicate warnings - Maintains full backward compatibility while encouraging migration The warning will only show if one or more attention environment variables are explicitly set, making it non-intrusive for users who don't use them. Signed-off-by: Matthew Bonanni <[email protected]>

Updated the deprecation warning to only apply to VLLM_ATTENTION_BACKEND since that's the only attention env var with a CLI alternative (--attention-backend). The other attention-related environment variables remain fully supported without deprecation warnings, as they don't have CLI argument alternatives yet. This makes the warning more accurate and less alarming for users who rely on other attention env vars that aren't being deprecated. Also removed unused envs import from attention.py after removing __post_init__. Signed-off-by: Matthew Bonanni <[email protected]>

gemini-code-assist

Code Review

This pull request introduces a dedicated AttentionConfig to centralize attention-related settings and moves the attention backend selection from an environment variable to a CLI argument, which is a good step towards reducing reliance on environment variables. My review focuses on improving the robustness of the new configuration hashing, clarifying deprecation warnings for better user experience, and cleaning up some redundant logic. Overall, the changes are well-structured, and with a few adjustments, the implementation can be made more maintainable and user-friendly.

gemini-code-assist · 2025-10-06T19:12:21Z

vllm/config/attention.py

+        factors: list[Any] = [
+            self.backend,
+            self.use_triton_flash_attn,
+            self.flash_attn_version,
+            self.v1_use_prefill_decode_attention,
+            self.use_aiter_unified_attention,
+            self.flash_attn_max_num_splits_for_cuda_graph,
+            self.use_cudnn_prefill,
+            self.use_trtllm_attention,
+            self.disable_flashinfer_prefill,
+            self.flashinfer_disable_q_quantization,
+        ]


The manual listing of fields in compute_hash is fragile. If a new field is added to AttentionConfig in the future, a developer might forget to update this list, leading to incorrect cache hashes. This can cause subtle and hard-to-debug issues.

To make this more robust, you could programmatically collect the fields. This ensures that any new field is automatically included. Since all current fields seem to affect the computation graph, iterating over all fields is appropriate. This can be done without extra imports by using the __dataclass_fields__ attribute.

factors = [getattr(self, f) for f in self.__dataclass_fields__]

gemini-code-assist · 2025-10-06T19:12:22Z

vllm/config/vllm.py

+        if self.attention_config:
+            vllm_factors.append(self.attention_config.compute_hash())
+        else:
+            vllm_factors.append("None")


The attention_config field in VllmConfig is initialized with a default_factory and is not Optional. Therefore, self.attention_config will always be an instance of AttentionConfig and the if self.attention_config: check will always evaluate to true. This makes the else branch unreachable (dead code) and the conditional check redundant. You can simplify this by directly appending the hash.

vllm_factors.append(self.attention_config.compute_hash())

gemini-code-assist · 2025-10-06T19:12:22Z

vllm/engine/arg_utils.py

+        """
+
+        # Warn if VLLM_ATTENTION_BACKEND env var is used instead of CLI arg
+        if envs.is_set("VLLM_ATTENTION_BACKEND") and self.attention_backend is None:


The deprecation warning for VLLM_ATTENTION_BACKEND is only issued when the --attention-backend CLI argument is not provided. If a user sets both, the CLI argument silently overrides the environment variable. This can be confusing, as the user might not realize their environment variable is being ignored. It's better to always warn when the deprecated environment variable is set to avoid this confusion.

Suggested change

if envs.is_set("VLLM_ATTENTION_BACKEND") and self.attention_backend is None:

if envs.is_set("VLLM_ATTENTION_BACKEND"):

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

chatgpt-codex-connector · 2025-10-06T19:13:21Z

vllm/engine/arg_utils.py

+        # Warn if VLLM_ATTENTION_BACKEND env var is used instead of CLI arg
+        if envs.is_set("VLLM_ATTENTION_BACKEND") and self.attention_backend is None:
+            logger.warning(
+                "Using VLLM_ATTENTION_BACKEND environment variable is deprecated "
+                "and will be removed in a future release. "
+                "Please use --attention-backend CLI argument instead."
+            )
+
+        # Handle backend: prefer CLI arg, fall back to env var
+        backend = self.attention_backend
+        if backend is None:
+            backend = envs.VLLM_ATTENTION_BACKEND
+
+        return AttentionConfig(
+            backend=backend,


Propagate --attention-backend CLI value to runtime

The new create_attention_config reads the --attention-backend CLI argument but only returns an AttentionConfig object; it never updates envs.VLLM_ATTENTION_BACKEND. The rest of the codebase (e.g. attention selector and platform-specific backends) still relies on the global env variable to decide which backend to load. As a result, specifying --attention-backend has no effect: all backend selection logic still sees None (or the old env value) and behaves as if the argument was never given. This makes the advertised CLI option a no-op and bypasses existing compatibility checks tied to the env variable. The CLI value should be written back into envs.VLLM_ATTENTION_BACKEND or the downstream logic should read from AttentionConfig instead.

Useful? React with 👍 / 👎.

Updated code throughout the codebase to use AttentionConfig instead of directly reading from environment variables. This consolidates attention configuration and makes it easier to manage and test. Changes: - Updated V1 attention backends to read from vllm_config.attention_config - Used get_current_vllm_config() in impl classes that don't have VllmConfig - Replaced envs.VLLM_* references with config.attention_config.* where used Files modified: - vllm/v1/attention/backends/flash_attn.py - vllm/v1/attention/backends/mla/flashattn_mla.py - vllm/v1/attention/backends/mla/triton_mla.py - vllm/v1/attention/backends/mla/common.py - vllm/v1/attention/backends/rocm_attn.py - vllm/v1/attention/backends/utils.py - vllm/attention/utils/fa_utils.py - vllm/utils/flashinfer.py Note: Some cached functions still use env vars directly to avoid breaking the cache key. These can be refactored in a future PR. Signed-off-by: Matthew Bonanni <[email protected]>

Updated remaining files to use AttentionConfig.backend instead of reading directly from envs.VLLM_ATTENTION_BACKEND. This completes the migration to using AttentionConfig as the single source of truth. Changes: - vllm/platforms/cuda.py: Use vllm_config.attention_config.backend - vllm/platforms/xpu.py: Use get_current_vllm_config() - vllm/attention/selector.py: Read backend from AttentionConfig - vllm/config/model.py: Added comments explaining why envs is still used (ModelConfig is created before AttentionConfig, so can't use it yet) - vllm/engine/arg_utils.py: Define backend locally from CLI/env in _is_v1_supported_oracle Note: ModelConfig.__post_init__ still reads from envs because it's created before VllmConfig/AttentionConfig exists. This is only for early validation. Signed-off-by: Matthew Bonanni <[email protected]>

Restructured the config creation order so that AttentionConfig is created first, then passed to ModelConfig. This allows ModelConfig to use attention_config.backend instead of reading directly from envs. Changes: - create_engine_config(): Create AttentionConfig before ModelConfig - create_model_config(): Accept optional attention_config parameter - ModelConfig: Added attention_config field - ModelConfig.__post_init__: Use self.attention_config.backend instead of envs.VLLM_ATTENTION_BACKEND Benefits: - Eliminated 2 more env var usages from ModelConfig - AttentionConfig is now truly the single source of truth for attention backend - Cleaner dependency flow: AttentionConfig -> ModelConfig -> VllmConfig This completes the migration away from reading VLLM_ATTENTION_BACKEND directly from environment variables in core config classes. Signed-off-by: Matthew Bonanni <[email protected]>

Eliminated the separate forced_attn_backend global variable mechanism in favor of directly modifying vllm_config.attention_config.backend. This simplifies the code and makes AttentionConfig the true single source of truth. Changes: - Removed forced_attn_backend global variable - Removed global_force_attn_backend() and get_global_forced_attn_backend() - Updated global_force_attn_backend_context_manager() to modify vllm_config.attention_config.backend directly - Updated attention selector to only check AttentionConfig.backend - Updated cuda.py TODO comment to reflect the new approach Benefits: - Simpler architecture: one source of truth instead of two - No more global state to manage - Runtime overrides now just modify AttentionConfig.backend - Context manager still works for tests that need temporary overrides The context manager is preserved for backward compatibility with tests. Signed-off-by: Matthew Bonanni <[email protected]>

- Updated flashinfer.py to use attention_config.flashinfer_disable_q_quantization - Updated platforms/rocm.py to use attention_config for v1_use_prefill_decode_attention and use_aiter_unified_attention - Updated rocm_attn.py to use attention_config.use_aiter_unified_attention - Only arg_utils.py now reads env vars (at initialization point) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Matthew Bonanni <[email protected]>

MatthewBonanni added 4 commits October 6, 2025 14:41

MatthewBonanni requested review from simon-mo, WoosukKwon, youkaichao, robertgshaw2-redhat, mgoin, tlrmchlsmth, houseroad, hmellor, yewentao256 and ProExpertProg as code owners October 6, 2025 19:10

gemini-code-assist bot reviewed Oct 6, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Oct 6, 2025

View reviewed changes

MatthewBonanni requested review from gshtras and LucasWilkinson as code owners October 6, 2025 19:19

mergify bot added the v1 label Oct 6, 2025

MatthewBonanni added 4 commits October 6, 2025 15:24

MatthewBonanni requested a review from jikunshang as a code owner October 6, 2025 20:49

mergify bot added the rocm Related to AMD ROCm label Oct 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Attention][UX] Add AttentionConfig and change attention backend to CLI argument #26315

[Attention][UX] Add AttentionConfig and change attention backend to CLI argument #26315

MatthewBonanni commented Oct 6, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 6, 2025

Uh oh!

gemini-code-assist bot Oct 6, 2025

Uh oh!

gemini-code-assist bot Oct 6, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Oct 6, 2025

Uh oh!

Uh oh!

	if envs.is_set("VLLM_ATTENTION_BACKEND") and self.attention_backend is None:
	if envs.is_set("VLLM_ATTENTION_BACKEND"):

Uh oh!

[Attention][UX] Add AttentionConfig and change attention backend to CLI argument #26315

Are you sure you want to change the base?

[Attention][UX] Add AttentionConfig and change attention backend to CLI argument #26315

Conversation

MatthewBonanni commented Oct 6, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MatthewBonanni commented Oct 6, 2025 •

edited by github-actions bot

Loading