Skip to content

[Bug]: Poor Performance: ~40 t/s for Qwen3-80B-AWQ on Single RTX 6000Β #28667

@Wanli-Lee

Description

@Wanli-Lee

Your current environment

πŸš€ Low generation throughput (~40 tokens/s) for Qwen3-Next-80B-A3B-Instruct-AWQ-4bit (MoE) on a single GPU

Describe the Issue

I am observing unexpectedly low generation throughput when running the Qwen3-Next-80B-A3B-Instruct-AWQ-4bit model (an AWQ-quantized Mixture of Experts architecture) on a single, high-end NVIDIA GPU.

The throughput is significantly lower than anticipated for this type of hardware and optimization.

Environment Details

  • Model: cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit (80B MoE, AWQ-4bit quantized)
  • Hardware (Single GPU used): NVIDIA RTX Pro 6000 Blackwell (96GB VRAM)
  • vLLM Image: vllm/vllm-openai:latest (or specify your actual vLLM version if known)
  • Tensor Parallelism: TP=1 (Single GPU)

Steps to Reproduce

  1. Launch the vLLM Server using the following Docker command (targeting a single GPU):
docker run -d --name vllm-qwen-80b \
  --gpus '"device=2"' --ipc=host \
  -p 6000:6000 \
  -v ./hf_hub/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit:/models:ro \
  vllm/vllm-openai:latest \
  --model /models \
  --served-model-name Qwen3-Next-80B-A3B-Instruct-AWQ-4bit \
  --tensor-parallel-size 1 \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.90
  1. Submit a long-context request to the server (e.g., input $\approx$ 8,400 tokens, generate $\approx$ 500 tokens).

Observed Performance

The reported generation throughput is the average result from testing N consecutive single-request runs.

The generation throughput observed is around 40-45 tokens/second (Based on the average run where 494 output tokens took 11.89 seconds, yielding 41.5 tokens/s).

  • Input Tokens: 8,417
  • Output Tokens: 494
  • Total Time (Average): 11.89 seconds
  • Generation Speed (Average): 41.5 tokens/second

Expected Behavior

Given the large model size (80B) and MoE architecture, some performance penalty is expected on a single GPU. However, for an AWQ-quantized model on an NVIDIA Blackwell architecture card, I would expect the single-stream generation throughput to be substantially higher than $\approx 40$ tokens/second.

Could this low throughput be due to:

  1. Specific inefficiencies in vLLM's handling of this particular Qwen MoE + AWQ configuration on a single GPU?
  2. A suboptimal kernel or loading strategy for the MoE layers?

Please note: The model files are confirmed to be valid as they successfully load and run.

πŸ› Describe the bug

docker run -d --name vllm-qwen-80b \
  --gpus '"device=2"' --ipc=host \
  -p 6000:6000 \
  -v ./hf_hub/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit:/models:ro \
  vllm/vllm-openai:latest \
  --model /models \
  --served-model-name Qwen3-Next-80B-A3B-Instruct-AWQ-4bit \
  --tensor-parallel-size 1 \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.90

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions