File tree Expand file tree Collapse file tree 2 files changed +30
-0
lines changed
mlx_audio/stt/models/moonshine Expand file tree Collapse file tree 2 files changed +30
-0
lines changed Original file line number Diff line number Diff line change @@ -100,6 +100,7 @@ for result in model.generate("Hello from MLX-Audio!", voice="af_heart"):
100100| ** Voxtral** | Mistral's speech model | Multiple | [ mlx-community/Voxtral-Mini-3B-2507-bf16] ( https://huggingface.co/mlx-community/Voxtral-Mini-3B-2507-bf16 ) |
101101| ** Voxtral Realtime** | Mistral's 4B streaming STT | Multiple | [ 4bit] ( https://huggingface.co/mlx-community/Voxtral-Mini-4B-Realtime-2602-4bit ) , [ fp16] ( https://huggingface.co/mlx-community/Voxtral-Mini-4B-Realtime-2602-fp16 ) |
102102| ** VibeVoice-ASR** | Microsoft's 9B ASR with diarization & timestamps | Multiple | [ mlx-community/VibeVoice-ASR-bf16] ( https://huggingface.co/mlx-community/VibeVoice-ASR-bf16 ) |
103+ | ** Moonshine** | Useful Sensors' lightweight ASR | EN | [ README] ( mlx_audio/stt/models/moonshine/README.md ) |
103104
104105
105106### Voice Activity Detection / Speaker Diarization (VAD)
Original file line number Diff line number Diff line change 1+ # Moonshine
2+
3+ MLX implementation of Useful Sensors' Moonshine, a lightweight ASR model that processes raw audio through a learned conv frontend rather than mel spectrograms.
4+
5+ ## Available Models
6+
7+ | Model | Parameters | Description |
8+ | -------| ------------| -------------|
9+ | [ UsefulSensors/moonshine-tiny] ( https://huggingface.co/UsefulSensors/moonshine-tiny ) | 27M | Smallest variant |
10+ | [ UsefulSensors/moonshine-base] ( https://huggingface.co/UsefulSensors/moonshine-base ) | 61M | Larger, more accurate |
11+
12+ ## Python Usage
13+
14+ ``` python
15+ from mlx_audio.stt import load
16+
17+ model = load(" UsefulSensors/moonshine-tiny" )
18+
19+ result = model.generate(" audio.wav" )
20+ print (result.text)
21+ ```
22+
23+ ## Architecture
24+
25+ - 3 layer conv frontend (strides 64, 3, 2) with GroupNorm
26+ - Transformer encoder with RoPE (6 layers tiny, 8 layers base)
27+ - Transformer decoder with cross attention and SwiGLU (6 layers tiny, 8 layers base)
28+ - Byte level BPE tokenizer (32k vocab)
29+ - 16kHz raw audio input (no mel spectrogram)
You can’t perform that action at this time.
0 commit comments