Skip to content

Commit 1705501

Browse files
authored
[BugFix] Fix ACLgraph bug in Qwen3_32b_int8 case (#3204)
### What this PR does / why we need it? 1. Solved the issue where sizes capture failed for the Qwen3-32b-int8 model when aclgraph, dp1, and tp4 were enabled. 2. Added the exception thrown when sizes capture fails and provided a solution 3. Add this common problem to the FAQ doc ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@releases/v0.11.0 Signed-off-by: lilinsiman <[email protected]>
1 parent a86ece5 commit 1705501

File tree

4 files changed

+47
-14
lines changed

4 files changed

+47
-14
lines changed

docs/source/faqs.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -196,3 +196,18 @@ export ATB_LLM_LCOC_ENABLE=0
196196
### 19. How to fix the error "ImportError: Please install vllm[audio] for audio support" for Qwen2.5-Omni model?
197197
The `Qwen2.5-Omni` model requires the `librosa` package to be installed, you need to install the `qwen-omni-utils` package to ensure all dependencies are met `pip install qwen-omni-utils`,
198198
this package will install `librosa` and its related dependencies, resolving the `ImportError: No module named 'librosa'` issue and ensuring audio processing functionality works correctly.
199+
200+
### 20. How to troubleshoot and resolve size capture failures resulting from stream resource exhaustion, and what are the underlying causes?
201+
202+
```
203+
error example in detail:
204+
ERROR 09-26 10:48:07 [model_runner_v1.py:3029] ACLgraph sizes capture fail: RuntimeError:
205+
ERROR 09-26 10:48:07 [model_runner_v1.py:3029] ACLgraph has insufficient available streams to capture the configured number of sizes.Please verify both the availability of adequate streams and the appropriateness of the configured size count.
206+
```
207+
208+
Recommended mitigation strategies:
209+
1. Manually configure the compilation_config parameter with a reduced size set: '{"cudagraph_capture_sizes":[size1, size2, size3, ...]}'.
210+
2. Employ ACLgraph's full graph mode as an alternative to the piece-wise approach.
211+
212+
Root cause analysis:
213+
The current stream requirement calculation for size captures only accounts for measurable factors including: data parallel size, tensor parallel size, expert parallel configuration, piece graph count, multistream overlap shared expert settings, and HCCL communication mode (AIV/AICPU). However, numerous unquantifiable elements - such as operator characteristics and specific hardware features - consume additional streams outside of this calculation framework, resulting in stream resource exhaustion during size capture operations.

tests/ut/test_utils.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -260,7 +260,7 @@ def test_update_aclgraph_sizes(self):
260260
utils.update_aclgraph_sizes(test_vllm_config)
261261
del os.environ['HCCL_OP_EXPANSION_MODE']
262262
self.assertEqual(
263-
138,
263+
137,
264264
len(test_vllm_config.compilation_config.cudagraph_capture_sizes))
265265

266266
test_vllm_config.speculative_config = mock.MagicMock()
@@ -273,7 +273,7 @@ def test_update_aclgraph_sizes(self):
273273
utils.update_aclgraph_sizes(test_vllm_config)
274274
del os.environ['HCCL_OP_EXPANSION_MODE']
275275
self.assertEqual(
276-
112,
276+
111,
277277
len(test_vllm_config.compilation_config.cudagraph_capture_sizes))
278278

279279
# max_num_batch_sizes >= len(original_sizes)

vllm_ascend/utils.py

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -40,14 +40,6 @@
4040
else:
4141
VllmConfig = None
4242

43-
# NOTE: Currently, we can only capture 1800 graphs at most,
44-
# due to the limitation of ACL graph. This number is bounded by
45-
# the number of streams, which is 2048, we save 248 streams
46-
# as a buffer.
47-
# Maximum number of graphs that can be captured by ACL Graph
48-
# TODO: Find out whether we need to solve allreduce function
49-
MAX_CAPTURE_SIZE = 1800
50-
5143
ASCEND_QUANTIZATION_METHOD = "ascend"
5244
SOC_VERSION_INFERENCE_SERIES = ["Ascend310P3"]
5345
REGISTERED_ASCEND_OPS = {}
@@ -293,6 +285,14 @@ def _rec_find(d):
293285

294286
def update_aclgraph_sizes(vllm_config: VllmConfig) -> None:
295287
"""Update ACL graph capture sizes based on hardware limitations"""
288+
# NOTE: Currently, we can only capture 1800 graphs at most,
289+
# due to the limitation of ACL graph. This number is bounded by
290+
# the number of streams, which is 2048, we save 248 streams
291+
# as a buffer.
292+
# Maximum number of graphs that can be captured by ACL Graph
293+
# TODO: Find out whether we need to solve allreduce function
294+
MAX_CAPTURE_SIZE = 1800
295+
296296
# Store original configuration and temporarily clear it
297297
compilation_config = vllm_config.compilation_config
298298
original_sizes, compilation_config.cudagraph_capture_sizes = \
@@ -326,6 +326,11 @@ def update_aclgraph_sizes(vllm_config: VllmConfig) -> None:
326326
"multistream_overlap_shared_expert", False))
327327
if is_moe_model(vllm_config):
328328
parallel_factor += (parallel_config.data_parallel_size > 1)
329+
else:
330+
# When AIV mode is enabled, the allreduce operator of the dense
331+
# layer model will occupy additional streams, which are buffered here.
332+
MAX_CAPTURE_SIZE = MAX_CAPTURE_SIZE - parallel_factor * resources_per_graph
333+
329334
# Calculate maximum supported batch sizes considering model architecture on the A2 Hardware Device
330335
# Assume the following case:
331336
# MAX_CAPTURE_SIZE = 1920, num_hidden_layers = 48, data_parallel_size is 1, tensor_parallel_size is 4,

vllm_ascend/worker/model_runner_v1.py

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3418,10 +3418,23 @@ def _capture_model(self):
34183418
aclgraph_runtime_mode = aclgraph_mode.mixed_mode()
34193419

34203420
compilation_cases = list(reversed(self.aclgraph_batch_sizes))
3421-
self._capture_aclgraphs(
3422-
compilation_cases,
3423-
aclgraph_runtime_mode=aclgraph_runtime_mode,
3424-
uniform_decode=False)
3421+
3422+
try:
3423+
self._capture_aclgraphs(
3424+
compilation_cases,
3425+
aclgraph_runtime_mode=aclgraph_runtime_mode,
3426+
uniform_decode=False)
3427+
except Exception as e:
3428+
logger.error(
3429+
f"ACLgraph sizes capture fail: {type(e).__name__}:\n"
3430+
"ACLgraph has insufficient available streams to capture the configured number of sizes. "
3431+
"Please verify both the availability of adequate streams and the appropriateness of the configured size count.\n\n"
3432+
"Recommended solutions:\n"
3433+
"1. Manually configure the compilation_config parameter "
3434+
"with a reduced set of sizes: '{\"cudagraph_capture_sizes\":[size1, size2, size3, ...]}'.\n"
3435+
"2. Utilize ACLgraph's full graph mode as an alternative to the piece-wise approach.\n\n"
3436+
f"{str(e)}")
3437+
raise
34253438

34263439
if aclgraph_mode.decode_mode() == CUDAGraphMode.FULL and \
34273440
aclgraph_mode.separate_routine():

0 commit comments

Comments
 (0)