-
Notifications
You must be signed in to change notification settings - Fork 160
Description
Is your feature request related to a problem? Please describe.
The Whisper models (whisper_tiny, whisper_small, whisper_base) cannot be deployed to QCS6490-based IoT devices (Radxa Dragon Q6A, RB3 Gen 2) because:
-
Non-quantized variants only support
--precision float, but QCS6490 NPU requires quantized I/O. Export fails with:❌ FAILED Tensor 'input_features' has a floating-point type which is not supported by the targeted device. Please quantize the model including its I/O and try again. -
Quantized variant (
whisper_small_quantized) requires AIMET-ONNX, which has no aarch64 Linux wheels:RuntimeError: AIMET-ONNX is missing but must be installed. -
Pre-compiled assets via
--fetch-static-assetsdon't recognize QCS6490 as a valid device. -
AI Hub website doesn't list QCS6490 as a download target for Whisper models.
Describe the solution you'd like
Any of the following would resolve this:
-
Add quantization support to
whisper_tiny/whisper_small/whisper_base- Allow--precision w8a8or--quantize w8a8similar to models likeyolov11_det -
Provide pre-compiled assets for QCS6490 - Enable
--fetch-static-assetsto download pre-quantized Whisper models for IoT devices -
Add QCS6490 to AI Hub website downloads - Allow direct download of compiled Whisper models for this chipset
Describe alternatives you've considered
-
HuggingFace Transformers (CPU): Works but runs at ~2.3x realtime on Whisper Small, whereas NPU would likely achieve real-time performance
from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="openai/whisper-tiny", device="cpu")
-
ONNX Runtime with downloaded models: Downloaded ONNX files contain
com.microsoft.EPContextnodes requiring QNN Execution Provider, which isn't available for aarch64 Linux via pip -
Building AIMET-ONNX from source: Complex dependency chain, not practical for end users
Additional context
- Environment: Radxa Dragon Q6A (QCS6490), Ubuntu 24.04, qai-hub-models 0.45.0, Python 3.12
- Device is recognized by qai-hub:
| Dragonwing RB3 Gen 2 Vision Kit | Qc_Linux 1.6 | qualcomm-qcs6490 | - NPU works for other models: Llama 3.2 runs successfully on the NPU via pre-quantized context binaries (genie-t2t-run)
- Use case: Voice-interactive AI assistant combining Whisper (speech) + Llama (reasoning) on edge devices
The QCS6490 is used in several IoT products and speech recognition is a common use case. NPU-accelerated Whisper would significantly benefit this ecosystem.