Skip to content

Conversation

MatthewBonanni
Copy link
Contributor

@MatthewBonanni MatthewBonanni commented Oct 6, 2025

Purpose

As a first step in the effort to reduce env variables, this PR introduces AttentionConfig and the attention_config cli argument group. For now, only VLLM_ATTENTION_BACKEND is changed to a cli argument, with a fallback and a deprecation warning.

Test Plan

CI should suffice

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

This commit consolidates attention-related configuration into a new
AttentionConfig class and exposes the attention backend as a CLI argument.

Changes:
- Created new AttentionConfig class in vllm/config/attention.py
- Added AttentionConfig to VllmConfig
- Added --attention-backend CLI argument in dedicated argument group
- Updated imports and exports

This sets the foundation for making more attention-related settingsconfigurable via CLI arguments in future work.

Signed-off-by: Matthew Bonanni <[email protected]>
Added detailed documentation to clarify that all attention-related
environment variables are still respected for backward compatibility.

This ensures users can continue using environment variables while
also having the option to use the new --attention-backend CLI argument.

Signed-off-by: Matthew Bonanni <[email protected]>
Added comprehensive deprecation warnings to guide users toward using
CLI arguments instead of environment variables for attention configuration.

Changes:
- Added deprecation warning in create_attention_config() that lists all
  attention-related environment variables that are set
- Warning directs users to use --attention-backend and other future CLI args
- Consolidated warning logic to avoid duplicate warnings
- Maintains full backward compatibility while encouraging migration

The warning will only show if one or more attention environment variables
are explicitly set, making it non-intrusive for users who don't use them.

Signed-off-by: Matthew Bonanni <[email protected]>
Updated the deprecation warning to only apply to VLLM_ATTENTION_BACKEND
since that's the only attention env var with a CLI alternative (--attention-backend).

The other attention-related environment variables remain fully supported
without deprecation warnings, as they don't have CLI argument alternatives yet.

This makes the warning more accurate and less alarming for users who rely
on other attention env vars that aren't being deprecated.

Also removed unused envs import from attention.py after removing __post_init__.

Signed-off-by: Matthew Bonanni <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a dedicated AttentionConfig to centralize attention-related settings and moves the attention backend selection from an environment variable to a CLI argument, which is a good step towards reducing reliance on environment variables. My review focuses on improving the robustness of the new configuration hashing, clarifying deprecation warnings for better user experience, and cleaning up some redundant logic. Overall, the changes are well-structured, and with a few adjustments, the implementation can be made more maintainable and user-friendly.

Comment on lines +64 to +75
factors: list[Any] = [
self.backend,
self.use_triton_flash_attn,
self.flash_attn_version,
self.v1_use_prefill_decode_attention,
self.use_aiter_unified_attention,
self.flash_attn_max_num_splits_for_cuda_graph,
self.use_cudnn_prefill,
self.use_trtllm_attention,
self.disable_flashinfer_prefill,
self.flashinfer_disable_q_quantization,
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The manual listing of fields in compute_hash is fragile. If a new field is added to AttentionConfig in the future, a developer might forget to update this list, leading to incorrect cache hashes. This can cause subtle and hard-to-debug issues.

To make this more robust, you could programmatically collect the fields. This ensures that any new field is automatically included. Since all current fields seem to affect the computation graph, iterating over all fields is appropriate. This can be done without extra imports by using the __dataclass_fields__ attribute.

        factors = [getattr(self, f) for f in self.__dataclass_fields__]

Comment on lines +159 to +162
if self.attention_config:
vllm_factors.append(self.attention_config.compute_hash())
else:
vllm_factors.append("None")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The attention_config field in VllmConfig is initialized with a default_factory and is not Optional. Therefore, self.attention_config will always be an instance of AttentionConfig and the if self.attention_config: check will always evaluate to true. This makes the else branch unreachable (dead code) and the conditional check redundant. You can simplify this by directly appending the hash.

        vllm_factors.append(self.attention_config.compute_hash())

"""

# Warn if VLLM_ATTENTION_BACKEND env var is used instead of CLI arg
if envs.is_set("VLLM_ATTENTION_BACKEND") and self.attention_backend is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The deprecation warning for VLLM_ATTENTION_BACKEND is only issued when the --attention-backend CLI argument is not provided. If a user sets both, the CLI argument silently overrides the environment variable. This can be confusing, as the user might not realize their environment variable is being ignored. It's better to always warn when the deprecated environment variable is set to avoid this confusion.

Suggested change
if envs.is_set("VLLM_ATTENTION_BACKEND") and self.attention_backend is None:
if envs.is_set("VLLM_ATTENTION_BACKEND"):

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Comment on lines +1238 to +1252
# Warn if VLLM_ATTENTION_BACKEND env var is used instead of CLI arg
if envs.is_set("VLLM_ATTENTION_BACKEND") and self.attention_backend is None:
logger.warning(
"Using VLLM_ATTENTION_BACKEND environment variable is deprecated "
"and will be removed in a future release. "
"Please use --attention-backend CLI argument instead."
)

# Handle backend: prefer CLI arg, fall back to env var
backend = self.attention_backend
if backend is None:
backend = envs.VLLM_ATTENTION_BACKEND

return AttentionConfig(
backend=backend,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Propagate --attention-backend CLI value to runtime

The new create_attention_config reads the --attention-backend CLI argument but only returns an AttentionConfig object; it never updates envs.VLLM_ATTENTION_BACKEND. The rest of the codebase (e.g. attention selector and platform-specific backends) still relies on the global env variable to decide which backend to load. As a result, specifying --attention-backend has no effect: all backend selection logic still sees None (or the old env value) and behaves as if the argument was never given. This makes the advertised CLI option a no-op and bypasses existing compatibility checks tied to the env variable. The CLI value should be written back into envs.VLLM_ATTENTION_BACKEND or the downstream logic should read from AttentionConfig instead.

Useful? React with 👍 / 👎.

Updated code throughout the codebase to use AttentionConfig instead of
directly reading from environment variables. This consolidates attention
configuration and makes it easier to manage and test.

Changes:
- Updated V1 attention backends to read from vllm_config.attention_config
- Used get_current_vllm_config() in impl classes that don't have VllmConfig
- Replaced envs.VLLM_* references with config.attention_config.* where used

Files modified:
- vllm/v1/attention/backends/flash_attn.py
- vllm/v1/attention/backends/mla/flashattn_mla.py
- vllm/v1/attention/backends/mla/triton_mla.py
- vllm/v1/attention/backends/mla/common.py
- vllm/v1/attention/backends/rocm_attn.py
- vllm/v1/attention/backends/utils.py
- vllm/attention/utils/fa_utils.py
- vllm/utils/flashinfer.py

Note: Some cached functions still use env vars directly to avoid
breaking the cache key. These can be refactored in a future PR.

Signed-off-by: Matthew Bonanni <[email protected]>
Updated remaining files to use AttentionConfig.backend instead of reading
directly from envs.VLLM_ATTENTION_BACKEND. This completes the migration
to using AttentionConfig as the single source of truth.

Changes:
- vllm/platforms/cuda.py: Use vllm_config.attention_config.backend
- vllm/platforms/xpu.py: Use get_current_vllm_config()
- vllm/attention/selector.py: Read backend from AttentionConfig
- vllm/config/model.py: Added comments explaining why envs is still used
  (ModelConfig is created before AttentionConfig, so can't use it yet)
- vllm/engine/arg_utils.py: Define backend locally from CLI/env in _is_v1_supported_oracle

Note: ModelConfig.__post_init__ still reads from envs because it's created
before VllmConfig/AttentionConfig exists. This is only for early validation.

Signed-off-by: Matthew Bonanni <[email protected]>
Restructured the config creation order so that AttentionConfig is created
first, then passed to ModelConfig. This allows ModelConfig to use
attention_config.backend instead of reading directly from envs.

Changes:
- create_engine_config(): Create AttentionConfig before ModelConfig
- create_model_config(): Accept optional attention_config parameter
- ModelConfig: Added attention_config field
- ModelConfig.__post_init__: Use self.attention_config.backend instead of
  envs.VLLM_ATTENTION_BACKEND

Benefits:
- Eliminated 2 more env var usages from ModelConfig
- AttentionConfig is now truly the single source of truth for attention backend
- Cleaner dependency flow: AttentionConfig -> ModelConfig -> VllmConfig

This completes the migration away from reading VLLM_ATTENTION_BACKEND
directly from environment variables in core config classes.

Signed-off-by: Matthew Bonanni <[email protected]>
Eliminated the separate forced_attn_backend global variable mechanism
in favor of directly modifying vllm_config.attention_config.backend.
This simplifies the code and makes AttentionConfig the true single
source of truth.

Changes:
- Removed forced_attn_backend global variable
- Removed global_force_attn_backend() and get_global_forced_attn_backend()
- Updated global_force_attn_backend_context_manager() to modify
  vllm_config.attention_config.backend directly
- Updated attention selector to only check AttentionConfig.backend
- Updated cuda.py TODO comment to reflect the new approach

Benefits:
- Simpler architecture: one source of truth instead of two
- No more global state to manage
- Runtime overrides now just modify AttentionConfig.backend
- Context manager still works for tests that need temporary overrides

The context manager is preserved for backward compatibility with tests.

Signed-off-by: Matthew Bonanni <[email protected]>
- Updated flashinfer.py to use attention_config.flashinfer_disable_q_quantization
- Updated platforms/rocm.py to use attention_config for v1_use_prefill_decode_attention and use_aiter_unified_attention
- Updated rocm_attn.py to use attention_config.use_aiter_unified_attention
- Only arg_utils.py now reads env vars (at initialization point)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

Signed-off-by: Matthew Bonanni <[email protected]>
@mergify mergify bot added the rocm Related to AMD ROCm label Oct 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rocm Related to AMD ROCm v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant