-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
Description
Your current environment
π Low generation throughput (~40 tokens/s) for Qwen3-Next-80B-A3B-Instruct-AWQ-4bit (MoE) on a single GPU
Describe the Issue
I am observing unexpectedly low generation throughput when running the Qwen3-Next-80B-A3B-Instruct-AWQ-4bit model (an AWQ-quantized Mixture of Experts architecture) on a single, high-end NVIDIA GPU.
The throughput is significantly lower than anticipated for this type of hardware and optimization.
Environment Details
- Model:
cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit(80B MoE, AWQ-4bit quantized) - Hardware (Single GPU used): NVIDIA RTX Pro 6000 Blackwell (96GB VRAM)
- vLLM Image:
vllm/vllm-openai:latest(or specify your actual vLLM version if known) - Tensor Parallelism:
TP=1(Single GPU)
Steps to Reproduce
- Launch the vLLM Server using the following Docker command (targeting a single GPU):
docker run -d --name vllm-qwen-80b \
--gpus '"device=2"' --ipc=host \
-p 6000:6000 \
-v ./hf_hub/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit:/models:ro \
vllm/vllm-openai:latest \
--model /models \
--served-model-name Qwen3-Next-80B-A3B-Instruct-AWQ-4bit \
--tensor-parallel-size 1 \
--max-model-len 65536 \
--gpu-memory-utilization 0.90-
Submit a long-context request to the server (e.g., input
$\approx$ 8,400 tokens, generate$\approx$ 500 tokens).
Observed Performance
The reported generation throughput is the average result from testing N consecutive single-request runs.
The generation throughput observed is around 40-45 tokens/second (Based on the average run where 494 output tokens took 11.89 seconds, yielding 41.5 tokens/s).
- Input Tokens: 8,417
- Output Tokens: 494
- Total Time (Average): 11.89 seconds
- Generation Speed (Average): 41.5 tokens/second
Expected Behavior
Given the large model size (80B) and MoE architecture, some performance penalty is expected on a single GPU. However, for an AWQ-quantized model on an NVIDIA Blackwell architecture card, I would expect the single-stream generation throughput to be substantially higher than
Could this low throughput be due to:
- Specific inefficiencies in vLLM's handling of this particular Qwen MoE + AWQ configuration on a single GPU?
- A suboptimal kernel or loading strategy for the MoE layers?
Please note: The model files are confirmed to be valid as they successfully load and run.
π Describe the bug
docker run -d --name vllm-qwen-80b \
--gpus '"device=2"' --ipc=host \
-p 6000:6000 \
-v ./hf_hub/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit:/models:ro \
vllm/vllm-openai:latest \
--model /models \
--served-model-name Qwen3-Next-80B-A3B-Instruct-AWQ-4bit \
--tensor-parallel-size 1 \
--max-model-len 65536 \
--gpu-memory-utilization 0.90
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.