Commit 128c120
authored
[0.9.1][bugfix] Address abnormal VRAM increase in quantized models with floating-point MTP (#2554)
### **Problem & Cause**
VRAM usage increased abnormally during mixed-precision inference with
quantized models and floating-point MTP. This was caused by
`dist.all_to_all_single` creating extra HCCL communicators, which
produced unnecessary buffers that consumed more memory.
### **Solution**
This commit adds a communicator parameter to `dist.all_to_all_single`.
By passing the existing communicator from the `vllm-ascend` framework,
we ensure all communication operations use a unified domain, preventing
the creation of extra buffers and solving the VRAM issue.
### **Collaborators**
@kunpengW-code
cc @farawayboat @MengqingCao
Signed-off-by: SlightwindSec <[email protected]>1 parent 60c2df2 commit 128c120
1 file changed
+18
-7
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
526 | 526 | | |
527 | 527 | | |
528 | 528 | | |
529 | | - | |
| 529 | + | |
| 530 | + | |
| 531 | + | |
530 | 532 | | |
531 | 533 | | |
532 | 534 | | |
| |||
542 | 544 | | |
543 | 545 | | |
544 | 546 | | |
545 | | - | |
546 | | - | |
547 | | - | |
548 | | - | |
| 547 | + | |
| 548 | + | |
| 549 | + | |
| 550 | + | |
| 551 | + | |
| 552 | + | |
| 553 | + | |
| 554 | + | |
| 555 | + | |
| 556 | + | |
549 | 557 | | |
550 | 558 | | |
551 | 559 | | |
| |||
593 | 601 | | |
594 | 602 | | |
595 | 603 | | |
596 | | - | |
597 | | - | |
| 604 | + | |
| 605 | + | |
| 606 | + | |
| 607 | + | |
| 608 | + | |
598 | 609 | | |
599 | 610 | | |
600 | 611 | | |
| |||
0 commit comments