Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions vllm/model_executor/layers/attention/mla_attention.py
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,7 @@
"""

import functools
import os
from abc import abstractmethod
from dataclasses import dataclass, field
from enum import Enum
Expand Down Expand Up @@ -1971,6 +1972,7 @@ def __init__(
# num_heads=128, nope_dim=128, rope_dim=64
self._use_flashinfer_concat_mla_k = (
has_flashinfer()
and os.environ.get("VLLM_DISABLE_FLASHINFER_CONCAT_MLA_K", "0") != "1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For consistency with how other vLLM-specific environment variables are handled, it would be better to manage VLLM_DISABLE_FLASHINFER_CONCAT_MLA_K through the vllm.envs module. This centralizes environment variable management and makes the code cleaner.

You can add the new environment variable to vllm/envs.py like this:

# In vllm/envs.py
'VLLM_DISABLE_FLASHINFER_CONCAT_MLA_K': lambda: os.getenv("VLLM_DISABLE_FLASHINFER_CONCAT_MLA_K", "0") == "1",

Then, you can use it here as suggested. With this change, the import os at the top of this file is no longer needed and can be removed.

Suggested change
and os.environ.get("VLLM_DISABLE_FLASHINFER_CONCAT_MLA_K", "0") != "1"
and not envs.VLLM_DISABLE_FLASHINFER_CONCAT_MLA_K

and (self.num_heads == 128)
and (self.qk_nope_head_dim == 128)
and (self.qk_rope_head_dim == 64)
Expand Down