[Bug]: Poor Performance: ~40 t/s for Qwen3-80B-AWQ on Single RTX 6000

### Your current environment


## 🚀 Low generation throughput (\~40 tokens/s) for Qwen3-Next-80B-A3B-Instruct-AWQ-4bit (MoE) on a single GPU

### Describe the Issue

I am observing unexpectedly low generation throughput when running the **Qwen3-Next-80B-A3B-Instruct-AWQ-4bit** model (an AWQ-quantized Mixture of Experts architecture) on a single, high-end NVIDIA GPU.

The throughput is significantly lower than anticipated for this type of hardware and optimization.

### Environment Details

  * **Model:** `cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit` (80B MoE, AWQ-4bit quantized)
  * **Hardware (Single GPU used):** NVIDIA RTX Pro 6000 Blackwell (96GB VRAM)
  * **vLLM Image:** `vllm/vllm-openai:latest` (or specify your actual vLLM version if known)
  * **Tensor Parallelism:** `TP=1` (Single GPU)

### Steps to Reproduce

1.  **Launch the vLLM Server** using the following Docker command (targeting a single GPU):



```bash
docker run -d --name vllm-qwen-80b \
  --gpus '"device=2"' --ipc=host \
  -p 6000:6000 \
  -v ./hf_hub/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit:/models:ro \
  vllm/vllm-openai:latest \
  --model /models \
  --served-model-name Qwen3-Next-80B-A3B-Instruct-AWQ-4bit \
  --tensor-parallel-size 1 \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.90
```

2.  **Submit a long-context request** to the server (e.g., input $\approx$ 8,400 tokens, generate $\approx$ 500 tokens).

### Observed Performance

The reported generation throughput is the **average result** from testing N consecutive single-request runs.

The generation throughput observed is around **40-45 tokens/second** (Based on the average run where 494 output tokens took 11.89 seconds, yielding 41.5 tokens/s).

  * Input Tokens: 8,417
  * Output Tokens: 494
  * Total Time (Average): 11.89 seconds
  * **Generation Speed (Average): 41.5 tokens/second**

### Expected Behavior

Given the large model size (80B) and MoE architecture, some performance penalty is expected on a single GPU. However, for an AWQ-quantized model on an NVIDIA Blackwell architecture card, I would expect the single-stream generation throughput to be substantially higher than $\approx 40$ tokens/second.

**Could this low throughput be due to:**

1.  Specific inefficiencies in vLLM's handling of this particular **Qwen MoE + AWQ** configuration on a single GPU?
2.  A suboptimal kernel or loading strategy for the MoE layers?

*Please note: The model files are confirmed to be valid as they successfully load and run.*

### 🐛 Describe the bug

```
docker run -d --name vllm-qwen-80b \
  --gpus '"device=2"' --ipc=host \
  -p 6000:6000 \
  -v ./hf_hub/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit:/models:ro \
  vllm/vllm-openai:latest \
  --model /models \
  --served-model-name Qwen3-Next-80B-A3B-Instruct-AWQ-4bit \
  --tensor-parallel-size 1 \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.90
```

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Poor Performance: ~40 t/s for Qwen3-80B-AWQ on Single RTX 6000 #28667

Your current environment

🚀 Low generation throughput (~40 tokens/s) for Qwen3-Next-80B-A3B-Instruct-AWQ-4bit (MoE) on a single GPU

Describe the Issue

Environment Details

Steps to Reproduce

Observed Performance

Expected Behavior

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Poor Performance: ~40 t/s for Qwen3-80B-AWQ on Single RTX 6000 #28667

Description

Your current environment

🚀 Low generation throughput (~40 tokens/s) for Qwen3-Next-80B-A3B-Instruct-AWQ-4bit (MoE) on a single GPU

Describe the Issue

Environment Details

Steps to Reproduce

Observed Performance

Expected Behavior

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions