Skip to content

Commit f60bb47

Browse files
authored
[CI] Enable linux-aarch64-a2 (64GB) and tp2 * 2 max-parallel to speed up CI (vllm-project#2065)
### What this PR does / why we need it? Currently our workflow run time takes about 3 hours in total, which seriously affects the developer experience, so it is urgent to have a optimization, after this pr, It is expected that the running time of the full CI can be shortened to 1h40min. - Enable linux-aarch64-a2 (64GB) to replace linux-arm64-npu (32GB) - Change TP4 ---> TP2 * 2 max-parallel - Move DeepSeek-V2-Lite-W8A8 to single card test ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@a248025 --------- Signed-off-by: wangli <[email protected]>
1 parent ca8007f commit f60bb47

14 files changed

+75
-75
lines changed

.github/actionlint.yaml

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,10 @@
11
self-hosted-runner:
22
# Labels of self-hosted runner in array of strings.
33
labels:
4-
- linux-arm64-npu-1
5-
- linux-arm64-npu-2
6-
- linux-arm64-npu-4
4+
- linux-aarch64-a2-0
5+
- linux-aarch64-a2-1
6+
- linux-aarch64-a2-2
7+
- linux-aarch64-a2-4
8+
- linux-aarch64-a2-8
79
- linux-arm64-npu-static-8
810
- ubuntu-24.04-arm

.github/workflows/accuracy_test.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -85,8 +85,8 @@ jobs:
8585
}}
8686
runs-on: >-
8787
${{
88-
(matrix.model_name == 'Qwen/Qwen3-30B-A3B' && 'linux-arm64-npu-4') ||
89-
'linux-arm64-npu-2'
88+
(matrix.model_name == 'Qwen/Qwen3-30B-A3B' && 'linux-aarch64-a2-2') ||
89+
'linux-aarch64-a2-1'
9090
}}
9191
strategy:
9292
matrix:

.github/workflows/vllm_ascend_doctest.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ jobs:
4848
matrix:
4949
vllm_verison: [v0.9.1-dev, v0.9.1-dev-openeuler, main, main-openeuler]
5050
name: vLLM Ascend test
51-
runs-on: linux-arm64-npu-1
51+
runs-on: linux-aarch64-a2-1
5252
container:
5353
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:${{ matrix.vllm_verison }}
5454
steps:

.github/workflows/vllm_ascend_test.yaml

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -136,7 +136,7 @@ jobs:
136136
strategy:
137137
max-parallel: 2
138138
matrix:
139-
os: [linux-arm64-npu-1]
139+
os: [linux-aarch64-a2-1]
140140
vllm_version: [main, v0.10.0]
141141
name: singlecard e2e test
142142
runs-on: ${{ matrix.os }}
@@ -213,9 +213,9 @@ jobs:
213213
needs: [e2e]
214214
if: ${{ needs.e2e.result == 'success' }}
215215
strategy:
216-
max-parallel: 1
216+
max-parallel: 2
217217
matrix:
218-
os: [linux-arm64-npu-4]
218+
os: [linux-aarch64-a2-2]
219219
vllm_version: [main, v0.10.0]
220220
name: multicard e2e test
221221
runs-on: ${{ matrix.os }}
@@ -275,7 +275,6 @@ jobs:
275275
# To avoid oom, we need to run the test in a single process.
276276
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_multistream_moe
277277
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_QwQ
278-
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_W8A8
279278
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_dbo
280279
pytest -sv tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeekV3_dbo
281280
pytest -sv tests/e2e/multicard/test_data_parallel.py

.github/workflows/vllm_ascend_test_long_term.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ jobs:
4242
strategy:
4343
max-parallel: 2
4444
matrix:
45-
os: [linux-arm64-npu-1, linux-arm64-npu-4]
45+
os: [linux-aarch64-a2-1, linux-aarch64-a2-2]
4646
vllm_version: [main, v0.10.0]
4747
name: vLLM Ascend long term test
4848
runs-on: ${{ matrix.os }}

benchmarks/scripts/run_accuracy.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -50,17 +50,17 @@
5050
# Command templates for running evaluations
5151
MODEL_RUN_INFO = {
5252
"Qwen/Qwen3-30B-A3B": (
53-
"export MODEL_ARGS='pretrained={model},max_model_len=4096,dtype=auto,tensor_parallel_size=4,gpu_memory_utilization=0.6,enable_expert_parallel=True'\n"
53+
"export MODEL_ARGS='pretrained={model},max_model_len=4096,dtype=auto,tensor_parallel_size=2,gpu_memory_utilization=0.6,enable_expert_parallel=True'\n"
5454
"lm_eval --model vllm --model_args $MODEL_ARGS --tasks {datasets} \ \n"
5555
"--apply_chat_template --fewshot_as_multiturn --num_fewshot 5 --batch_size 1"
5656
),
5757
"Qwen/Qwen3-8B-Base": (
58-
"export MODEL_ARGS='pretrained={model},max_model_len=4096,dtype=auto,tensor_parallel_size=2,gpu_memory_utilization=0.6'\n"
58+
"export MODEL_ARGS='pretrained={model},max_model_len=4096,dtype=auto,tensor_parallel_size=1,gpu_memory_utilization=0.6'\n"
5959
"lm_eval --model vllm --model_args $MODEL_ARGS --tasks {datasets} \ \n"
6060
"--apply_chat_template --fewshot_as_multiturn --num_fewshot 5 --batch_size 1"
6161
),
6262
"Qwen/Qwen2.5-VL-7B-Instruct": (
63-
"export MODEL_ARGS='pretrained={model},max_model_len=8192,dtype=auto,tensor_parallel_size=2,max_images=2'\n"
63+
"export MODEL_ARGS='pretrained={model},max_model_len=8192,dtype=auto,tensor_parallel_size=1,max_images=2'\n"
6464
"lm_eval --model vllm-vlm --model_args $MODEL_ARGS --tasks {datasets} \ \n"
6565
"--apply_chat_template --fewshot_as_multiturn --batch_size 1"
6666
),
@@ -94,9 +94,9 @@
9494

9595
# Model arguments for evaluation
9696
MODEL_ARGS = {
97-
"Qwen/Qwen3-8B-Base": "pretrained=Qwen/Qwen3-8B-Base,max_model_len=4096,dtype=auto,tensor_parallel_size=2,gpu_memory_utilization=0.6",
98-
"Qwen/Qwen2.5-VL-7B-Instruct": "pretrained=Qwen/Qwen2.5-VL-7B-Instruct,max_model_len=8192,dtype=auto,tensor_parallel_size=2,max_images=2",
99-
"Qwen/Qwen3-30B-A3B": "pretrained=Qwen/Qwen3-30B-A3B,max_model_len=4096,dtype=auto,tensor_parallel_size=4,gpu_memory_utilization=0.6,enable_expert_parallel=True",
97+
"Qwen/Qwen3-8B-Base": "pretrained=Qwen/Qwen3-8B-Base,max_model_len=4096,dtype=auto,tensor_parallel_size=1,gpu_memory_utilization=0.6",
98+
"Qwen/Qwen2.5-VL-7B-Instruct": "pretrained=Qwen/Qwen2.5-VL-7B-Instruct,max_model_len=8192,dtype=auto,tensor_parallel_size=1,max_images=2",
99+
"Qwen/Qwen3-30B-A3B": "pretrained=Qwen/Qwen3-30B-A3B,max_model_len=4096,dtype=auto,tensor_parallel_size=2,gpu_memory_utilization=0.6,enable_expert_parallel=True",
100100
}
101101

102102
# Whether to apply chat template formatting

examples/disaggregated_prefill_v1/gen_ranktable.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,4 +76,4 @@ if [[ -n "${GEN_RANKTABLE}" || ! -e ${PWD}/ranktable.json ]]; then
7676
--master_addr ${MASTER_ADDR} \
7777
--master_port ${MASTER_PORT} \
7878
gen_ranktable.py --local-host $LOCAL_HOST --prefill-device-cnt $PREFILL_DEVICE_CNT --decode-device-cnt $DECODE_DEVICE_CNT
79-
fi
79+
fi

tests/e2e/long_term/accuracy/accuracy_multicard.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -91,9 +91,9 @@
9191
"Qwen/Qwen2.5-0.5B-Instruct":
9292
None,
9393
"Qwen/Qwen3-30B-A3B":
94-
"tensor_parallel_size=4,enable_expert_parallel=True,enforce_eager=True",
94+
"tensor_parallel_size=2,enable_expert_parallel=True,enforce_eager=True",
9595
"deepseek-ai/DeepSeek-V2-Lite":
96-
"tensor_parallel_size=4,trust_remote_code=True,enforce_eager=True"
96+
"tensor_parallel_size=2,trust_remote_code=True,enforce_eager=True"
9797
}
9898

9999
multiprocessing.set_start_method("spawn", force=True)

tests/e2e/multicard/test_fused_moe_allgather_ep.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ def test_generate_with_allgather():
4646
sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
4747

4848
with VllmRunner(snapshot_download("vllm-ascend/DeepSeek-V3-Pruning"),
49-
tensor_parallel_size=4,
49+
tensor_parallel_size=2,
5050
enforce_eager=True,
5151
max_model_len=1024,
5252
dtype="auto",
@@ -74,7 +74,7 @@ def test_generate_with_alltoall():
7474
sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
7575

7676
with VllmRunner(snapshot_download("vllm-ascend/DeepSeek-V3-Pruning"),
77-
tensor_parallel_size=4,
77+
tensor_parallel_size=2,
7878
enforce_eager=True,
7979
max_model_len=1024,
8080
dtype="auto",

tests/e2e/multicard/test_offline_inference_distributed.py

Lines changed: 7 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ def test_models_distributed_QwQ():
4242
with VllmRunner(
4343
"Qwen/QwQ-32B",
4444
dtype=dtype,
45-
tensor_parallel_size=4,
45+
tensor_parallel_size=2,
4646
distributed_executor_backend="mp",
4747
) as vllm_model:
4848
vllm_model.generate_greedy(example_prompts, max_tokens)
@@ -57,7 +57,7 @@ def test_models_distributed_DeepSeek_multistream_moe():
5757
with VllmRunner(
5858
"vllm-ascend/DeepSeek-V3-Pruning",
5959
dtype=dtype,
60-
tensor_parallel_size=4,
60+
tensor_parallel_size=2,
6161
distributed_executor_backend="mp",
6262
additional_config={
6363
"torchair_graph_config": {
@@ -82,7 +82,7 @@ def test_models_distributed_DeepSeek_dbo():
8282
with VllmRunner(
8383
"deepseek-ai/DeepSeek-V2-Lite",
8484
dtype=dtype,
85-
tensor_parallel_size=4,
85+
tensor_parallel_size=2,
8686
distributed_executor_backend="mp",
8787
) as vllm_model:
8888
model_arch = 'DeepseekV2ForCausalLM'
@@ -106,7 +106,7 @@ def test_models_distributed_DeepSeekV3_dbo():
106106
with VllmRunner(
107107
"vllm-ascend/DeepSeek-V3-Pruning",
108108
dtype=dtype,
109-
tensor_parallel_size=4,
109+
tensor_parallel_size=2,
110110
distributed_executor_backend="mp",
111111
) as vllm_model:
112112
model_arch = 'DeepseekV3ForCausalLM'
@@ -118,24 +118,6 @@ def test_models_distributed_DeepSeekV3_dbo():
118118
vllm_model.generate(example_prompts, sampling_params)
119119

120120

121-
@pytest.mark.skip(reason="Due to OOM,waiting for 1311pr to merge in")
122-
def test_models_distributed_DeepSeek_W8A8():
123-
example_prompts = [
124-
"Hello, my name is",
125-
]
126-
max_tokens = 5
127-
128-
with VllmRunner(
129-
snapshot_download("vllm-ascend/DeepSeek-V2-Lite-W8A8"),
130-
max_model_len=8192,
131-
enforce_eager=True,
132-
dtype="auto",
133-
tensor_parallel_size=4,
134-
quantization="ascend",
135-
) as vllm_model:
136-
vllm_model.generate_greedy(example_prompts, max_tokens)
137-
138-
139121
def test_models_distributed_pangu():
140122
example_prompts = [
141123
"Hello, my name is",
@@ -147,7 +129,7 @@ def test_models_distributed_pangu():
147129
max_model_len=8192,
148130
enforce_eager=True,
149131
dtype="auto",
150-
tensor_parallel_size=4,
132+
tensor_parallel_size=2,
151133
distributed_executor_backend="mp",
152134
) as vllm_model:
153135
vllm_model.generate_greedy(example_prompts, max_tokens)
@@ -169,7 +151,7 @@ def test_models_distributed_topk() -> None:
169151
with VllmRunner(
170152
"deepseek-ai/DeepSeek-V2-Lite",
171153
dtype=dtype,
172-
tensor_parallel_size=4,
154+
tensor_parallel_size=2,
173155
distributed_executor_backend="mp",
174156
) as vllm_model:
175157
vllm_model.generate(example_prompts, sampling_params)
@@ -186,7 +168,7 @@ def test_models_distributed_Qwen3_W8A8():
186168
max_model_len=8192,
187169
enforce_eager=True,
188170
dtype="auto",
189-
tensor_parallel_size=4,
171+
tensor_parallel_size=2,
190172
quantization="ascend",
191173
) as vllm_model:
192174
vllm_model.generate_greedy(example_prompts, max_tokens)

0 commit comments

Comments
 (0)