[Feature Request] Quantized Whisper models for QCS6490 Linux devices

**Is your feature request related to a problem? Please describe.**

The Whisper models (`whisper_tiny`, `whisper_small`, `whisper_base`) cannot be deployed to QCS6490-based IoT devices (Radxa Dragon Q6A, RB3 Gen 2) because:

1. **Non-quantized variants** only support `--precision float`, but QCS6490 NPU requires quantized I/O. Export fails with:
   ```
   ❌ FAILED  Tensor 'input_features' has a floating-point type which is not supported by the targeted device.
   Please quantize the model including its I/O and try again.
   ```

2. **Quantized variant** (`whisper_small_quantized`) requires AIMET-ONNX, which has no aarch64 Linux wheels:
   ```
   RuntimeError: AIMET-ONNX is missing but must be installed.
   ```

3. **Pre-compiled assets** via `--fetch-static-assets` don't recognize QCS6490 as a valid device.

4. **AI Hub website** doesn't list QCS6490 as a download target for Whisper models.

**Describe the solution you'd like**

Any of the following would resolve this:

1. **Add quantization support** to `whisper_tiny`/`whisper_small`/`whisper_base` - Allow `--precision w8a8` or `--quantize w8a8` similar to models like `yolov11_det`

2. **Provide pre-compiled assets for QCS6490** - Enable `--fetch-static-assets` to download pre-quantized Whisper models for IoT devices

3. **Add QCS6490 to AI Hub website downloads** - Allow direct download of compiled Whisper models for this chipset

**Describe alternatives you've considered**

- **HuggingFace Transformers (CPU):** Works but runs at ~2.3x realtime on Whisper Small, whereas NPU would likely achieve real-time performance
  ```python
  from transformers import pipeline
  pipe = pipeline("automatic-speech-recognition", model="openai/whisper-tiny", device="cpu")
  ```

- **ONNX Runtime with downloaded models:** Downloaded ONNX files contain `com.microsoft.EPContext` nodes requiring QNN Execution Provider, which isn't available for aarch64 Linux via pip

- **Building AIMET-ONNX from source:** Complex dependency chain, not practical for end users

**Additional context**

- **Environment:** Radxa Dragon Q6A (QCS6490), Ubuntu 24.04, qai-hub-models 0.45.0, Python 3.12
- **Device is recognized** by qai-hub:
  ```
  | Dragonwing RB3 Gen 2 Vision Kit | Qc_Linux 1.6 | qualcomm-qcs6490 |
  ```
- **NPU works for other models:** Llama 3.2 runs successfully on the NPU via pre-quantized context binaries (genie-t2t-run)
- **Use case:** Voice-interactive AI assistant combining Whisper (speech) + Llama (reasoning) on edge devices

The QCS6490 is used in several IoT products and speech recognition is a common use case. NPU-accelerated Whisper would significantly benefit this ecosystem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Quantized Whisper models for QCS6490 Linux devices #281

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Quantized Whisper models for QCS6490 Linux devices #281

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions