Skip to content

Commit 9718a9e

Browse files
authored
[NPU] Upgrade to v0.17.0 (vllm-project#1890)
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
1 parent 88caaf1 commit 9718a9e

File tree

9 files changed

+357
-72
lines changed

9 files changed

+357
-72
lines changed

docker/Dockerfile.npu

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,17 @@
11
ARG VLLM_ASCEND_IMAGE=quay.io/ascend/vllm-ascend
2-
ARG VLLM_ASCEND_TAG=v0.14.0rc1
2+
ARG VLLM_ASCEND_TAG=v0.17.0rc1
33
FROM ${VLLM_ASCEND_IMAGE}:${VLLM_ASCEND_TAG}
44

5-
WORKDIR /vllm-workspace/vllm-ascend
6-
RUN git checkout e2175d9c7e62b437391dfee996b1375674ba7c18
7-
RUN pip install -v -e .
8-
95
ARG APP_DIR=/vllm-workspace/vllm-omni
106
WORKDIR ${APP_DIR}
117

128
COPY . .
139

14-
RUN pip install -v -e .
10+
RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
11+
source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
12+
source /usr/local/Ascend/nnal/atb/set_env.sh && \
13+
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
14+
python3 -m pip install -v -e /vllm-workspace/vllm-omni/ --no-build-isolation
1515

1616
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
1717

docker/Dockerfile.npu.a3

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,17 @@
11
ARG VLLM_ASCEND_IMAGE=quay.io/ascend/vllm-ascend
2-
ARG VLLM_ASCEND_TAG=v0.14.0rc1-a3
2+
ARG VLLM_ASCEND_TAG=v0.17.0rc1-a3
33
FROM ${VLLM_ASCEND_IMAGE}:${VLLM_ASCEND_TAG}
44

5-
WORKDIR /vllm-workspace/vllm-ascend
6-
RUN git checkout e2175d9c7e62b437391dfee996b1375674ba7c18
7-
RUN pip install -v -e .
8-
95
ARG APP_DIR=/vllm-workspace/vllm-omni
106
WORKDIR ${APP_DIR}
117

128
COPY . .
139

14-
RUN pip install -v -e .
10+
RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
11+
source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
12+
source /usr/local/Ascend/nnal/atb/set_env.sh && \
13+
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
14+
python3 -m pip install -v -e /vllm-workspace/vllm-omni/ --no-build-isolation
1515

1616
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
1717

docs/getting_started/installation/npu/npu.inc.md

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -33,18 +33,25 @@ docker run --rm \
3333
-p 8000:8000 \
3434
-it $IMAGE bash
3535

36+
cd /vllm-workspace/vllm
37+
git pull origin main
38+
git fetch origin --tags
39+
git checkout v0.16.0
40+
3641
# Because vllm-ascend will release v0.16.0rc1 after vllm-omni 0.16.0,
3742
# we have to pin vllm-ascend at the current commit.
3843
cd /vllm-workspace/vllm-ascend
44+
git pull origin main
3945
git checkout e2175d9c7e62b437391dfee996b1375674ba7c18
4046
pip install -v -e .
4147

4248
# Inside the container, install vLLM-Omni from source
4349
cd /vllm-workspace
4450
git clone -b v0.16.0 https://github.com/vllm-project/vllm-omni.git
45-
4651
cd vllm-omni
47-
pip install -v -e .
52+
pip install -v -e . --no-build-isolation
53+
# or VLLM_OMNI_TARGET_DEVICE=npu pip install -v -e .
54+
4855
export VLLM_WORKER_MULTIPROC_METHOD=spawn
4956
```
5057

@@ -61,22 +68,22 @@ We are keeping [issue #886](https://github.com/vllm-project/vllm-omni/issues/886
6168
You can also build vLLM-Omni from the latest main branch if you want to use the latest features or bug fixes. (But sometimes it will break for a while. You can check [issue #886](https://github.com/vllm-project/vllm-omni/issues/886) for the status of the latest commit of vLLM-Omni main branch on NPU.)
6269

6370
```bash
64-
# Pin vLLM version to 0.16.0
71+
# Pin vLLM version to 0.17.0
6572
cd /vllm-workspace/vllm
6673
git pull origin main
6774
git fetch origin --tags
68-
git checkout v0.16.0
75+
git checkout v0.17.0
6976
VLLM_TARGET_DEVICE=empty pip install -v -e .
7077

7178
# Because vllm-ascend has not yet entered continuous development and has not been officially released, we need to pin it to a specific commit. Please note that this commit may change over time.
72-
cd ../vllm-ascend
79+
cd /vllm-workspace/vllm-ascend
7380
git pull origin main
7481
git fetch origin --tags
75-
git checkout e2175d9c7e62b437391dfee996b1375674ba7c18
82+
git checkout v0.17.0
7683
pip install -v -e .
7784

7885
# Install vLLM-Omni from the latest main branch
79-
cd ../vllm-omni
86+
cd /vllm-workspace/vllm-omni
8087
git clone https://github.com/vllm-project/vllm-omni.git
8188
pip install -v -e . --no-build-isolation
8289
# or VLLM_OMNI_TARGET_DEVICE=npu pip install -v -e .

vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_moe_code_predictor_mtp.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,8 @@
2222
from vllm.model_executor.layers.rotary_embedding import get_rope
2323
from vllm.model_executor.layers.vocab_parallel_embedding import VocabParallelEmbedding
2424

25+
from vllm_omni.platforms import current_omni_platform
26+
2527
logger = init_logger(__name__)
2628

2729

@@ -343,6 +345,10 @@ def _ensure_cached_refs(self) -> None:
343345
def _ensure_model_fwd(self) -> None:
344346
if self._model_fwd is not None:
345347
return
348+
if not current_omni_platform.supports_torch_inductor():
349+
logger.warning_once("code_predictor: torch.compile disabled")
350+
self._model_fwd = self.model.forward
351+
return
346352
self._model_fwd = torch.compile(
347353
self.model.forward,
348354
mode="default",

vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_code_predictor_vllm.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@
2121
)
2222
from vllm.model_executor.models.utils import is_pp_missing_parameter
2323

24+
from vllm_omni.platforms import current_omni_platform
25+
2426
from .configuration_qwen3_tts import Qwen3TTSTalkerCodePredictorConfig, Qwen3TTSTalkerConfig
2527

2628
logger = init_logger(__name__)
@@ -410,6 +412,10 @@ def _setup_compile(self) -> None:
410412
"""
411413
if self._compiled_model_fwd is not None:
412414
return
415+
if not current_omni_platform.supports_torch_inductor():
416+
logger.warning_once("code_predictor: torch.compile disabled")
417+
self._compiled_model_fwd = self.model.forward
418+
return
413419
self._compiled_model_fwd = torch.compile(
414420
self.model.forward,
415421
mode="default",
Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
# Stage config for running Qwen3-Omni-MoE with 3-stage architecture
2+
# Stage 0: Thinker (multimodal understanding + text generation)
3+
# Stage 1: Talker (text embeddings → 16-layer RVQ codec codes)
4+
# Stage 2: Code2Wav (16-layer RVQ codes → audio waveform)
5+
6+
# The following config has been verified on 2x H100-80G GPUs.
7+
async_chunk: true
8+
stage_args:
9+
- stage_id: 0
10+
stage_type: llm # Use llm stage type to launch OmniLLM
11+
runtime:
12+
devices: "0,1"
13+
max_batch_size: 10
14+
engine_args:
15+
model_stage: thinker
16+
model_arch: Qwen3OmniMoeForConditionalGeneration
17+
worker_type: ar
18+
scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
19+
gpu_memory_utilization: 0.9
20+
enforce_eager: false
21+
trust_remote_code: true
22+
engine_output_type: latent # Output hidden states for talker
23+
distributed_executor_backend: "mp"
24+
enable_prefix_caching: false
25+
max_num_batched_tokens: 32768
26+
hf_config_name: thinker_config
27+
tensor_parallel_size: 2
28+
custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.thinker2talker_async_chunk
29+
final_output: true
30+
final_output_type: text
31+
is_comprehension: true
32+
default_sampling_params:
33+
temperature: 0.4
34+
top_p: 0.9
35+
top_k: 1
36+
max_tokens: 2048
37+
seed: 42
38+
detokenize: True
39+
repetition_penalty: 1.05
40+
41+
- stage_id: 1
42+
stage_type: llm # Use llm stage type to launch OmniLLM
43+
runtime:
44+
devices: "2"
45+
max_batch_size: 10
46+
engine_args:
47+
model_stage: talker
48+
model_arch: Qwen3OmniMoeForConditionalGeneration
49+
worker_type: ar
50+
scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
51+
gpu_memory_utilization: 0.6
52+
enforce_eager: true
53+
trust_remote_code: true
54+
engine_output_type: latent # Output codec codes for code2wav
55+
enable_prefix_caching: false
56+
max_num_batched_tokens: 32768
57+
distributed_executor_backend: "mp"
58+
hf_config_name: talker_config
59+
custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_omni.talker2code2wav_async_chunk
60+
engine_input_source: [0]
61+
# final_output: true
62+
# final_output_type: text
63+
default_sampling_params:
64+
temperature: 0.9
65+
top_k: 50
66+
max_tokens: 4096
67+
seed: 42
68+
detokenize: False
69+
repetition_penalty: 1.0
70+
stop_token_ids: [2150]
71+
72+
- stage_id: 2
73+
stage_type: llm # Use llm stage type to launch OmniLLM
74+
runtime:
75+
devices: "2"
76+
max_batch_size: 10
77+
engine_args:
78+
model_stage: code2wav
79+
model_arch: Qwen3OmniMoeForConditionalGeneration
80+
worker_type: generation
81+
scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler
82+
enforce_eager: true
83+
trust_remote_code: true
84+
async_scheduling: false
85+
enable_prefix_caching: false
86+
engine_output_type: audio # Final output: audio waveform
87+
gpu_memory_utilization: 0.3
88+
distributed_executor_backend: "mp"
89+
max_num_batched_tokens: 51200 # [TODO] if max_num_batch_tokens < max_batch_size * 800, there will be precision problem.
90+
hf_config_name: thinker_config
91+
engine_input_source: [1]
92+
final_output: true
93+
final_output_type: audio
94+
default_sampling_params:
95+
temperature: 0.0
96+
top_p: 1.0
97+
top_k: -1
98+
max_tokens: 65536
99+
seed: 42
100+
detokenize: True
101+
repetition_penalty: 1.1

0 commit comments

Comments
 (0)