-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Attention][UX] Add AttentionConfig and change attention backend to CLI argument #26315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Attention][UX] Add AttentionConfig and change attention backend to CLI argument #26315
Conversation
This commit consolidates attention-related configuration into a new AttentionConfig class and exposes the attention backend as a CLI argument. Changes: - Created new AttentionConfig class in vllm/config/attention.py - Added AttentionConfig to VllmConfig - Added --attention-backend CLI argument in dedicated argument group - Updated imports and exports This sets the foundation for making more attention-related settingsconfigurable via CLI arguments in future work. Signed-off-by: Matthew Bonanni <[email protected]>
Added detailed documentation to clarify that all attention-related environment variables are still respected for backward compatibility. This ensures users can continue using environment variables while also having the option to use the new --attention-backend CLI argument. Signed-off-by: Matthew Bonanni <[email protected]>
Added comprehensive deprecation warnings to guide users toward using CLI arguments instead of environment variables for attention configuration. Changes: - Added deprecation warning in create_attention_config() that lists all attention-related environment variables that are set - Warning directs users to use --attention-backend and other future CLI args - Consolidated warning logic to avoid duplicate warnings - Maintains full backward compatibility while encouraging migration The warning will only show if one or more attention environment variables are explicitly set, making it non-intrusive for users who don't use them. Signed-off-by: Matthew Bonanni <[email protected]>
Updated the deprecation warning to only apply to VLLM_ATTENTION_BACKEND since that's the only attention env var with a CLI alternative (--attention-backend). The other attention-related environment variables remain fully supported without deprecation warnings, as they don't have CLI argument alternatives yet. This makes the warning more accurate and less alarming for users who rely on other attention env vars that aren't being deprecated. Also removed unused envs import from attention.py after removing __post_init__. Signed-off-by: Matthew Bonanni <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a dedicated AttentionConfig
to centralize attention-related settings and moves the attention backend selection from an environment variable to a CLI argument, which is a good step towards reducing reliance on environment variables. My review focuses on improving the robustness of the new configuration hashing, clarifying deprecation warnings for better user experience, and cleaning up some redundant logic. Overall, the changes are well-structured, and with a few adjustments, the implementation can be made more maintainable and user-friendly.
factors: list[Any] = [ | ||
self.backend, | ||
self.use_triton_flash_attn, | ||
self.flash_attn_version, | ||
self.v1_use_prefill_decode_attention, | ||
self.use_aiter_unified_attention, | ||
self.flash_attn_max_num_splits_for_cuda_graph, | ||
self.use_cudnn_prefill, | ||
self.use_trtllm_attention, | ||
self.disable_flashinfer_prefill, | ||
self.flashinfer_disable_q_quantization, | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The manual listing of fields in compute_hash
is fragile. If a new field is added to AttentionConfig
in the future, a developer might forget to update this list, leading to incorrect cache hashes. This can cause subtle and hard-to-debug issues.
To make this more robust, you could programmatically collect the fields. This ensures that any new field is automatically included. Since all current fields seem to affect the computation graph, iterating over all fields is appropriate. This can be done without extra imports by using the __dataclass_fields__
attribute.
factors = [getattr(self, f) for f in self.__dataclass_fields__]
if self.attention_config: | ||
vllm_factors.append(self.attention_config.compute_hash()) | ||
else: | ||
vllm_factors.append("None") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The attention_config
field in VllmConfig
is initialized with a default_factory
and is not Optional
. Therefore, self.attention_config
will always be an instance of AttentionConfig
and the if self.attention_config:
check will always evaluate to true. This makes the else
branch unreachable (dead code) and the conditional check redundant. You can simplify this by directly appending the hash.
vllm_factors.append(self.attention_config.compute_hash())
""" | ||
|
||
# Warn if VLLM_ATTENTION_BACKEND env var is used instead of CLI arg | ||
if envs.is_set("VLLM_ATTENTION_BACKEND") and self.attention_backend is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deprecation warning for VLLM_ATTENTION_BACKEND
is only issued when the --attention-backend
CLI argument is not provided. If a user sets both, the CLI argument silently overrides the environment variable. This can be confusing, as the user might not realize their environment variable is being ignored. It's better to always warn when the deprecated environment variable is set to avoid this confusion.
if envs.is_set("VLLM_ATTENTION_BACKEND") and self.attention_backend is None: | |
if envs.is_set("VLLM_ATTENTION_BACKEND"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
# Warn if VLLM_ATTENTION_BACKEND env var is used instead of CLI arg | ||
if envs.is_set("VLLM_ATTENTION_BACKEND") and self.attention_backend is None: | ||
logger.warning( | ||
"Using VLLM_ATTENTION_BACKEND environment variable is deprecated " | ||
"and will be removed in a future release. " | ||
"Please use --attention-backend CLI argument instead." | ||
) | ||
|
||
# Handle backend: prefer CLI arg, fall back to env var | ||
backend = self.attention_backend | ||
if backend is None: | ||
backend = envs.VLLM_ATTENTION_BACKEND | ||
|
||
return AttentionConfig( | ||
backend=backend, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Propagate --attention-backend CLI value to runtime
The new create_attention_config
reads the --attention-backend
CLI argument but only returns an AttentionConfig
object; it never updates envs.VLLM_ATTENTION_BACKEND
. The rest of the codebase (e.g. attention selector and platform-specific backends) still relies on the global env variable to decide which backend to load. As a result, specifying --attention-backend
has no effect: all backend selection logic still sees None
(or the old env value) and behaves as if the argument was never given. This makes the advertised CLI option a no-op and bypasses existing compatibility checks tied to the env variable. The CLI value should be written back into envs.VLLM_ATTENTION_BACKEND
or the downstream logic should read from AttentionConfig
instead.
Useful? React with 👍 / 👎.
Updated code throughout the codebase to use AttentionConfig instead of directly reading from environment variables. This consolidates attention configuration and makes it easier to manage and test. Changes: - Updated V1 attention backends to read from vllm_config.attention_config - Used get_current_vllm_config() in impl classes that don't have VllmConfig - Replaced envs.VLLM_* references with config.attention_config.* where used Files modified: - vllm/v1/attention/backends/flash_attn.py - vllm/v1/attention/backends/mla/flashattn_mla.py - vllm/v1/attention/backends/mla/triton_mla.py - vllm/v1/attention/backends/mla/common.py - vllm/v1/attention/backends/rocm_attn.py - vllm/v1/attention/backends/utils.py - vllm/attention/utils/fa_utils.py - vllm/utils/flashinfer.py Note: Some cached functions still use env vars directly to avoid breaking the cache key. These can be refactored in a future PR. Signed-off-by: Matthew Bonanni <[email protected]>
Updated remaining files to use AttentionConfig.backend instead of reading directly from envs.VLLM_ATTENTION_BACKEND. This completes the migration to using AttentionConfig as the single source of truth. Changes: - vllm/platforms/cuda.py: Use vllm_config.attention_config.backend - vllm/platforms/xpu.py: Use get_current_vllm_config() - vllm/attention/selector.py: Read backend from AttentionConfig - vllm/config/model.py: Added comments explaining why envs is still used (ModelConfig is created before AttentionConfig, so can't use it yet) - vllm/engine/arg_utils.py: Define backend locally from CLI/env in _is_v1_supported_oracle Note: ModelConfig.__post_init__ still reads from envs because it's created before VllmConfig/AttentionConfig exists. This is only for early validation. Signed-off-by: Matthew Bonanni <[email protected]>
Restructured the config creation order so that AttentionConfig is created first, then passed to ModelConfig. This allows ModelConfig to use attention_config.backend instead of reading directly from envs. Changes: - create_engine_config(): Create AttentionConfig before ModelConfig - create_model_config(): Accept optional attention_config parameter - ModelConfig: Added attention_config field - ModelConfig.__post_init__: Use self.attention_config.backend instead of envs.VLLM_ATTENTION_BACKEND Benefits: - Eliminated 2 more env var usages from ModelConfig - AttentionConfig is now truly the single source of truth for attention backend - Cleaner dependency flow: AttentionConfig -> ModelConfig -> VllmConfig This completes the migration away from reading VLLM_ATTENTION_BACKEND directly from environment variables in core config classes. Signed-off-by: Matthew Bonanni <[email protected]>
Eliminated the separate forced_attn_backend global variable mechanism in favor of directly modifying vllm_config.attention_config.backend. This simplifies the code and makes AttentionConfig the true single source of truth. Changes: - Removed forced_attn_backend global variable - Removed global_force_attn_backend() and get_global_forced_attn_backend() - Updated global_force_attn_backend_context_manager() to modify vllm_config.attention_config.backend directly - Updated attention selector to only check AttentionConfig.backend - Updated cuda.py TODO comment to reflect the new approach Benefits: - Simpler architecture: one source of truth instead of two - No more global state to manage - Runtime overrides now just modify AttentionConfig.backend - Context manager still works for tests that need temporary overrides The context manager is preserved for backward compatibility with tests. Signed-off-by: Matthew Bonanni <[email protected]>
- Updated flashinfer.py to use attention_config.flashinfer_disable_q_quantization - Updated platforms/rocm.py to use attention_config for v1_use_prefill_decode_attention and use_aiter_unified_attention - Updated rocm_attn.py to use attention_config.use_aiter_unified_attention - Only arg_utils.py now reads env vars (at initialization point) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Matthew Bonanni <[email protected]>
Purpose
As a first step in the effort to reduce env variables, this PR introduces
AttentionConfig
and theattention_config
cli argument group. For now, onlyVLLM_ATTENTION_BACKEND
is changed to a cli argument, with a fallback and a deprecation warning.Test Plan
CI should suffice
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.