Skip to content

[Feature Request] Quantized Whisper models for QCS6490 Linux devices #281

@Foadsf

Description

@Foadsf

Is your feature request related to a problem? Please describe.

The Whisper models (whisper_tiny, whisper_small, whisper_base) cannot be deployed to QCS6490-based IoT devices (Radxa Dragon Q6A, RB3 Gen 2) because:

  1. Non-quantized variants only support --precision float, but QCS6490 NPU requires quantized I/O. Export fails with:

    ❌ FAILED  Tensor 'input_features' has a floating-point type which is not supported by the targeted device.
    Please quantize the model including its I/O and try again.
    
  2. Quantized variant (whisper_small_quantized) requires AIMET-ONNX, which has no aarch64 Linux wheels:

    RuntimeError: AIMET-ONNX is missing but must be installed.
    
  3. Pre-compiled assets via --fetch-static-assets don't recognize QCS6490 as a valid device.

  4. AI Hub website doesn't list QCS6490 as a download target for Whisper models.

Describe the solution you'd like

Any of the following would resolve this:

  1. Add quantization support to whisper_tiny/whisper_small/whisper_base - Allow --precision w8a8 or --quantize w8a8 similar to models like yolov11_det

  2. Provide pre-compiled assets for QCS6490 - Enable --fetch-static-assets to download pre-quantized Whisper models for IoT devices

  3. Add QCS6490 to AI Hub website downloads - Allow direct download of compiled Whisper models for this chipset

Describe alternatives you've considered

  • HuggingFace Transformers (CPU): Works but runs at ~2.3x realtime on Whisper Small, whereas NPU would likely achieve real-time performance

    from transformers import pipeline
    pipe = pipeline("automatic-speech-recognition", model="openai/whisper-tiny", device="cpu")
  • ONNX Runtime with downloaded models: Downloaded ONNX files contain com.microsoft.EPContext nodes requiring QNN Execution Provider, which isn't available for aarch64 Linux via pip

  • Building AIMET-ONNX from source: Complex dependency chain, not practical for end users

Additional context

  • Environment: Radxa Dragon Q6A (QCS6490), Ubuntu 24.04, qai-hub-models 0.45.0, Python 3.12
  • Device is recognized by qai-hub:
    | Dragonwing RB3 Gen 2 Vision Kit | Qc_Linux 1.6 | qualcomm-qcs6490 |
    
  • NPU works for other models: Llama 3.2 runs successfully on the NPU via pre-quantized context binaries (genie-t2t-run)
  • Use case: Voice-interactive AI assistant combining Whisper (speech) + Llama (reasoning) on edge devices

The QCS6490 is used in several IoT products and speech recognition is a common use case. NPU-accelerated Whisper would significantly benefit this ecosystem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionPlease ask any questions on Slack. This issue will be closed once responded to.stale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions