Blaizzy · lucasnewman · Mar 9, 2026 · Mar 7, 2026 · Mar 8, 2026 · Mar 8, 2026
diff --git a/README.md b/README.md
@@ -101,6 +101,7 @@ for result in model.generate("Hello from MLX-Audio!", voice="af_heart"):
 | **Voxtral Realtime** | Mistral's 4B streaming STT | Multiple | [4bit](https://huggingface.co/mlx-community/Voxtral-Mini-4B-Realtime-2602-4bit), [fp16](https://huggingface.co/mlx-community/Voxtral-Mini-4B-Realtime-2602-fp16) |
 | **VibeVoice-ASR** | Microsoft's 9B ASR with diarization & timestamps | Multiple | [mlx-community/VibeVoice-ASR-bf16](https://huggingface.co/mlx-community/VibeVoice-ASR-bf16) |
 | **Canary** | NVIDIA's multilingual ASR with translation | 25 EU + RU, UK | [README](mlx_audio/stt/models/canary/README.md) |
+| **Moonshine** | Useful Sensors' lightweight ASR | EN | [README](mlx_audio/stt/models/moonshine/README.md) |
 
 
 ### Voice Activity Detection / Speaker Diarization (VAD)

diff --git a/mlx_audio/stt/models/moonshine/README.md b/mlx_audio/stt/models/moonshine/README.md
@@ -0,0 +1,29 @@
+# Moonshine
+
+MLX implementation of Useful Sensors' Moonshine, a lightweight ASR model that processes raw audio through a learned conv frontend rather than mel spectrograms.
+
+## Available Models
+
+| Model | Parameters | Description |
+|-------|------------|-------------|
+| [UsefulSensors/moonshine-tiny](https://huggingface.co/UsefulSensors/moonshine-tiny) | 27M | Smallest variant |
+| [UsefulSensors/moonshine-base](https://huggingface.co/UsefulSensors/moonshine-base) | 61M | Larger, more accurate |
+
+## Python Usage
+
+```python
+from mlx_audio.stt import load
+
+model = load("UsefulSensors/moonshine-tiny")
+
+result = model.generate("audio.wav")
+print(result.text)
+```
+
+## Architecture
+
+- 3 layer conv frontend (strides 64, 3, 2) with GroupNorm
+- Transformer encoder with RoPE (6 layers tiny, 8 layers base)
+- Transformer decoder with cross attention and SwiGLU (6 layers tiny, 8 layers base)
+- Byte level BPE tokenizer (32k vocab)
+- 16kHz raw audio input (no mel spectrogram)
diff --git a/mlx_audio/stt/models/moonshine/__init__.py b/mlx_audio/stt/models/moonshine/__init__.py
@@ -0,0 +1 @@
+from .moonshine import Model, ModelConfig
diff --git a/mlx_audio/stt/models/moonshine/config.py b/mlx_audio/stt/models/moonshine/config.py
@@ -0,0 +1,45 @@
+import inspect
+from dataclasses import dataclass
+from typing import Any, Dict, Optional
+
+
+@dataclass
+class ModelConfig:
+    model_type: str = "moonshine"
+    vocab_size: int = 32768
+    hidden_size: int = 288
+    intermediate_size: int = 1152
+    encoder_num_hidden_layers: int = 6
+    decoder_num_hidden_layers: int = 6
+    encoder_num_attention_heads: int = 8
+    decoder_num_attention_heads: int = 8
+    encoder_num_key_value_heads: Optional[int] = None
+    decoder_num_key_value_heads: Optional[int] = None
+    encoder_hidden_act: str = "gelu"
+    decoder_hidden_act: str = "silu"
+    max_position_embeddings: int = 512
+    attention_bias: bool = False
+    attention_dropout: float = 0.0
+    partial_rotary_factor: float = 0.9
+    rope_theta: float = 10000.0
+    bos_token_id: int = 1
+    eos_token_id: int = 2
+    decoder_start_token_id: int = 1
+    tie_word_embeddings: bool = True
+    pad_head_dim_to_multiple_of: Optional[int] = None
+
+    def __post_init__(self):
+        if self.encoder_num_key_value_heads is None:
+            self.encoder_num_key_value_heads = self.encoder_num_attention_heads
+        if self.decoder_num_key_value_heads is None:
+            self.decoder_num_key_value_heads = self.decoder_num_attention_heads
+
+    @classmethod
+    def from_dict(cls, params: Dict[str, Any]) -> "ModelConfig":
+        return cls(
+            **{
+                k: v
+                for k, v in params.items()
+                if k in inspect.signature(cls).parameters
+            }
+        )