Pure Rust implementation of Qwen3-ASR automatic speech recognition. The project builds a cross-platform CLI tool and API server suitable for agentic skills for AI agents and bots.
- asr generates text from an input audio file (supports most codex and file formats)
- asr-server runs an OpenAI-compatible HTTP API server for audio transcription
Supports two backends: libtorch (via the tch crate, cross-platform with optional CUDA) and MLX (Apple Silicon native via Metal GPU). Loads model weights directly from safetensors files and re-implements the complete neural network forward pass in Rust.
Learn more:
- A Rust implementation / CLI for Qwen3's TTS (Text-to-Speech or speech synthesis) models
- An OpenAI compatible API server for audio / speech
- An OpenClaw SKILL for voice recognition. Copy and paste to your lobster to install it
The install script automatically detects your platform (macOS/Linux, CPU/CUDA GPU), downloads the correct release binary, model weights, and a sample audio file:
curl -sSf https://raw.githubusercontent.com/second-state/qwen3_asr_rs/main/install.sh | bashThe installer will prompt you to choose a model size (0.6B recommended) and, on Linux with an NVIDIA GPU, whether to use CUDA or CPU.
Once complete, run your first transcription:
cd qwen3_asr_rs
./asr ./Qwen3-ASR-0.6B sample.wavOutput:
Language: English
Text: Thank you for your contribution to the most recent issue of Computer.
The implementation ports the Qwen3-ASR encoder-decoder architecture from PyTorch/Transformers to Rust with libtorch (via the tch crate):
- Audio Encoder (Whisper-style): 3x Conv2d downsampling → sinusoidal positional embeddings → 18 transformer encoder layers → output projection (896 → 1024)
- Text Decoder (Qwen3): 28 transformer decoder layers with Grouped Query Attention (16 Q heads / 8 KV heads), QK-normalization, MRoPE (Multimodal Rotary Position Embeddings), and SwiGLU MLP
- Audio preprocessing: FFmpeg decodes any audio format → resampled to mono 16kHz f32 → 128-bin log-mel spectrogram (Whisper-style)
| Model | Parameters | HuggingFace |
|---|---|---|
| Qwen3-ASR-0.6B | 0.6B | Qwen/Qwen3-ASR-0.6B |
| Qwen3-ASR-1.7B | 1.7B | Qwen/Qwen3-ASR-1.7B |
# Basic transcription (auto-detect language)
asr ./Qwen3-ASR-0.6B input.wav
# Force language
asr ./Qwen3-ASR-0.6B input.wav chinese
asr ./Qwen3-ASR-0.6B input.wav english
# Enable debug logging
RUST_LOG=debug asr ./Qwen3-ASR-0.6B input.wavLanguage: Chinese
Text: 你好世界
The asr-server binary provides an OpenAI-compatible HTTP API for audio transcription.
asr-server --model-dir ./Qwen3-ASR-0.6BOptions:
--model-dir <PATH> Path to the Qwen3-ASR model directory (required)
--host <ADDR> Host address to bind to (default: 0.0.0.0)
--port <PORT> Port to listen on (default: 8080)
--language <LANG> Default language for transcription (e.g., chinese, english)
-v, -vv Verbose output (debug, trace)
OpenAI-compatible transcription endpoint. Accepts multipart form data.
| Field | Type | Required | Description |
|---|---|---|---|
file |
binary | Yes | Audio file (any format supported by FFmpeg) |
language |
string | No | Language hint (e.g., english, chinese) |
response_format |
string | No | json (default), text, or verbose_json |
model |
string | No | Accepted for compatibility, ignored |
temperature |
float | No | Accepted for compatibility, ignored |
prompt |
string | No | Accepted for compatibility, ignored |
Examples:
# JSON response (default)
curl -X POST http://localhost:8080/v1/audio/transcriptions \
-F file=@recording.wav
# {"text":"Thank you for your contribution..."}
# Plain text response
curl -X POST http://localhost:8080/v1/audio/transcriptions \
-F file=@recording.wav \
-F response_format=text
# Thank you for your contribution...
# Verbose JSON with language and duration
curl -X POST http://localhost:8080/v1/audio/transcriptions \
-F file=@recording.wav \
-F response_format=verbose_json
# {"task":"transcribe","language":"English","duration":7.999,"text":"Thank you..."}
# Force language
curl -X POST http://localhost:8080/v1/audio/transcriptions \
-F file=@recording.wav \
-F language=chineseLists available models.
curl http://localhost:8080/v1/models
# {"object":"list","data":[{"id":"qwen3-asr","object":"model","owned_by":"qwen"}]}Health check endpoint.
curl http://localhost:8080/health
# {"status":"ok"}Qwen3-ASR supports 30 languages: Chinese, English, Cantonese, Arabic, German, French, Spanish, Portuguese, Indonesian, Italian, Korean, Russian, Thai, Vietnamese, Japanese, Turkish, Hindi, Malay, Dutch, Swedish, Danish, Finnish, Polish, Czech, Filipino, Persian, Greek, Romanian, Hungarian, Macedonian.
Download model weights and generate the tokenizer:
pip install huggingface_hub transformers
huggingface-cli download Qwen/Qwen3-ASR-0.6B --local-dir Qwen3-ASR-0.6B
python -c "
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained('Qwen3-ASR-0.6B', trust_remote_code=True)
tok.backend_tokenizer.save('Qwen3-ASR-0.6B/tokenizer.json')
"git submodule update --init --recursive
cargo build --release --no-default-features --features mlxDownload and extract libtorch for your platform from libtorch-releases:
# Linux x86_64 (CPU)
curl -LO https://github.com/second-state/libtorch-releases/releases/download/v2.7.1/libtorch-cxx11-abi-x86_64-2.7.1.tar.gz
tar xzf libtorch-cxx11-abi-x86_64-2.7.1.tar.gz
# Linux x86_64 (CUDA 12.6)
curl -LO https://github.com/second-state/libtorch-releases/releases/download/v2.7.1/libtorch-cxx11-abi-x86_64-cuda12.6-2.7.1.tar.gz
tar xzf libtorch-cxx11-abi-x86_64-cuda12.6-2.7.1.tar.gz
# Linux ARM64 (CPU)
curl -LO https://github.com/second-state/libtorch-releases/releases/download/v2.7.1/libtorch-cxx11-abi-aarch64-2.7.1.tar.gz
tar xzf libtorch-cxx11-abi-aarch64-2.7.1.tar.gz
# Linux ARM64 (CUDA 12.6 / Jetson)
curl -LO https://github.com/second-state/libtorch-releases/releases/download/v2.7.1/libtorch-cxx11-abi-aarch64-cuda12.6-2.7.1.tar.gz
tar xzf libtorch-cxx11-abi-aarch64-cuda12.6-2.7.1.tar.gzSet environment variables:
export LIBTORCH=$(pwd)/libtorch
export LIBTORCH_BYPASS_VERSION_CHECK=1Install dependencies and build:
cargo build --releasesrc/
├── main.rs # CLI binary entry point
├── bin/
│ └── server.rs # API server binary entry point
├── lib.rs # Library module declarations
├── tensor.rs # Unified Tensor abstraction (tch/MLX backend)
├── config.rs # Model configuration (from config.json)
├── error.rs # Error types
├── audio.rs # FFmpeg-based audio loading and format conversion
├── mel.rs # Whisper-style mel spectrogram feature extraction
├── weights.rs # Safetensors weight loading (bf16 → f32 conversion)
├── layers.rs # Neural network building blocks (LayerNorm, RMSNorm,
│ # attention, MLP, MRoPE, etc.)
├── audio_encoder.rs # Whisper-style audio encoder (Conv2d + Transformer)
├── text_decoder.rs # Qwen3 text decoder with KV cache
├── tokenizer.rs # HuggingFace tokenizer wrapper
├── inference.rs # End-to-end ASR inference pipeline
└── backend/
└── mlx/ # Apple MLX backend (Metal GPU)
├── ffi.rs # Raw C FFI bindings to mlx-c
├── array.rs # Safe RAII MlxArray wrapper
├── ops.rs # Safe operation wrappers
├── io.rs # Safetensors loading via mlx-c
├── signal.rs # STFT, mel spectrogram signal processing
└── stream.rs # Device/stream management
Benchmarked on Apple M4 Mac Mini (16GB RAM), MLX Metal GPU backend. All times are warm runs (post-shader compilation), best-of-3.
| Test File | Audio Duration | Tokens | CLI | API Server |
|---|---|---|---|---|
| sample1.wav (English) | 8.0s | 31 | 2.35s | 2.10s |
| speech_en.wav (English) | 3.5s | 15 | 1.30s | 1.05s |
| sample2.wav (English) | 2.8s | 13 | 1.17s | 0.95s |
| sample3.wav (Chinese) | 5.6s | 15 | 1.31s | 1.07s |
| Test File | Audio Duration | Tokens | CLI | API Server |
|---|---|---|---|---|
| sample1.wav (English) | 8.0s | 31 | 6.26s | 5.80s |
| speech_en.wav (English) | 3.5s | 15 | 3.40s | 3.06s |
| sample2.wav (English) | 2.8s | 13 | 2.82s | 2.59s |
| sample3.wav (Chinese) | 5.6s | 15 | 3.31s | 2.94s |
The API server is faster per request because the model stays loaded in memory, avoiding the process startup and model loading overhead of the CLI.
Apache-2.0