[v0.9.1-dev]dynamic eplb metrics #2364

baxingpiaochong · 2025-08-14T03:20:37Z

What this PR does / why we need it?

In the dynamic EPLB, expert heat is written to the metrics interface, allowing for observation of expert heat.
Metrics are written to the physical expert heat and the physical expert to logical expert mapping table. Because redundant experts prevent the logical expert from accurately observing the heat on each card, physical expert heat is recorded. The physical expert to logical expert mapping is also recorded to map heat back to the logical expert.
Physical expert heat is recorded using layer, phy_expert_id, and rank, representing the MoE layer, physical expert ID, and card, respectively. The physical expert to logical expert mapping table uses layer, log_expert_id, phy_expert_id, and rank, representing the MoE layer, logical expert ID, physical expert ID, and card, respectively. The mapping table uses a binding relationship: the log_expert_id indicator is 1 when the phy_expert_id is bound to it, and 0 otherwise.

After the test, the impact on the overall tpot is less than 1%.

Does this PR introduce any user-facing change?

How was this patch tested?

version vllm-ascend v0.9.1 dev
export DYNAMIC_EXPERT_LOAD_METRICS=1

gemini-code-assist

Code Review

This pull request introduces metrics for dynamic Expert Parallel Load Balancing (EPLB) to monitor expert heat and mappings. The changes involve creating a new logger class, EplbStatLogger, to handle Prometheus metrics, and integrating it into the model runner. The review identified a few high-severity issues: a non-daemon background thread that could prevent clean shutdowns, an incorrect implementation of a singleton's get_instance method, and improper handling of an environment variable for a boolean flag. These issues should be addressed to ensure correctness and robustness.

gemini-code-assist · 2025-08-14T03:21:56Z

vllm_ascend/eplb/eplb_loggers.py

+            documentation="physical expert to logical expert per rank",
+            labelnames=labelnames_phy2log)
+
+        self.do_record_loop = threading.Thread(target=self.record_loop)


The background thread do_record_loop is not configured as a daemon thread. This can prevent the application from shutting down cleanly, as the non-daemon thread will keep the process alive. It's recommended to create it as a daemon thread to ensure the process can exit properly.

Suggested change

self.do_record_loop = threading.Thread(target=self.record_loop)

self.do_record_loop = threading.Thread(target=self.record_loop, daemon=True)

gemini-code-assist · 2025-08-14T03:21:56Z

vllm_ascend/eplb/eplb_loggers.py

+    def get_instance():
+        if EplbStatLogger._instance is None:
+            raise ValueError(
+                "ExpertLoadBalancer instance has not been initialized.")
+        return EplbStatLogger


The get_instance method is implemented incorrectly. It returns the EplbStatLogger class object instead of the singleton instance _instance. Additionally, the ValueError message incorrectly refers to ExpertLoadBalancer instead of EplbStatLogger.

Suggested change

def get_instance():

if EplbStatLogger._instance is None:

raise ValueError(

"ExpertLoadBalancer instance has not been initialized.")

return EplbStatLogger

def get_instance():

if EplbStatLogger._instance is None:

raise ValueError(

"EplbStatLogger instance has not been initialized.")

return EplbStatLogger._instance

gemini-code-assist · 2025-08-14T03:21:57Z

vllm_ascend/worker/model_runner_v1.py

@@ -386,10 +388,12 @@ def __init__(self, vllm_config: VllmConfig, device: torch.device):

        #EPLB
        self.dynamic_eplb = ascend_config.dynamic_eplb
+        self.dynamic_eplb_metrics = os.getenv("DYNAMIC_EXPERT_LOAD_METRICS", False)


The os.getenv function returns a string. When checking for a boolean environment variable, simply checking the truthiness of the returned value can lead to unexpected behavior. For example, if DYNAMIC_EXPERT_LOAD_METRICS is set to "0", it will be evaluated as True. The value should be explicitly compared to a string like "1".

Suggested change

self.dynamic_eplb_metrics = os.getenv("DYNAMIC_EXPERT_LOAD_METRICS", False)

self.dynamic_eplb_metrics = os.getenv("DYNAMIC_EXPERT_LOAD_METRICS", "0") == "1"

baxingpiaochong · 2025-08-14T03:46:39Z

ready for review

This reverts commit 79acf0d.

This reverts commit 271fc67.

metrics

edf0313

gemini-code-assist bot reviewed Aug 14, 2025

View reviewed changes

youhua

f9bd6a3

hyh02297474 added 11 commits August 14, 2025 11:56

bug

5054361

bug

d7ee913

geshi

b0274b2

geshi

94d2d9c

geshi

4fb7aec

geshi

f421f65

geshi

0b70e05

geshi

271fc67

geshi

79acf0d

Revert "geshi"

5ab63ce

This reverts commit 79acf0d.

Revert "geshi"

7aa0901

This reverts commit 271fc67.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[v0.9.1-dev]dynamic eplb metrics #2364

[v0.9.1-dev]dynamic eplb metrics #2364

baxingpiaochong commented Aug 14, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 14, 2025

Uh oh!

gemini-code-assist bot Aug 14, 2025

Uh oh!

gemini-code-assist bot Aug 14, 2025

Uh oh!

baxingpiaochong commented Aug 14, 2025

Uh oh!

Uh oh!

	self.do_record_loop = threading.Thread(target=self.record_loop)
	self.do_record_loop = threading.Thread(target=self.record_loop, daemon=True)

	self.dynamic_eplb_metrics = os.getenv("DYNAMIC_EXPERT_LOAD_METRICS", False)
	self.dynamic_eplb_metrics = os.getenv("DYNAMIC_EXPERT_LOAD_METRICS", "0") == "1"

[v0.9.1-dev]dynamic eplb metrics #2364

Are you sure you want to change the base?

[v0.9.1-dev]dynamic eplb metrics #2364

Conversation

baxingpiaochong commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

baxingpiaochong commented Aug 14, 2025

Uh oh!

Uh oh!

baxingpiaochong commented Aug 14, 2025 •

edited

Loading