Skip to content

Commit 5e895fc

Browse files
committed
add readme and docs
1 parent f847627 commit 5e895fc

File tree

2 files changed

+30
-0
lines changed

2 files changed

+30
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,7 @@ for result in model.generate("Hello from MLX-Audio!", voice="af_heart"):
100100
| **Voxtral** | Mistral's speech model | Multiple | [mlx-community/Voxtral-Mini-3B-2507-bf16](https://huggingface.co/mlx-community/Voxtral-Mini-3B-2507-bf16) |
101101
| **Voxtral Realtime** | Mistral's 4B streaming STT | Multiple | [4bit](https://huggingface.co/mlx-community/Voxtral-Mini-4B-Realtime-2602-4bit), [fp16](https://huggingface.co/mlx-community/Voxtral-Mini-4B-Realtime-2602-fp16) |
102102
| **VibeVoice-ASR** | Microsoft's 9B ASR with diarization & timestamps | Multiple | [mlx-community/VibeVoice-ASR-bf16](https://huggingface.co/mlx-community/VibeVoice-ASR-bf16) |
103+
| **Moonshine** | Useful Sensors' lightweight ASR | EN | [README](mlx_audio/stt/models/moonshine/README.md) |
103104

104105

105106
### Voice Activity Detection / Speaker Diarization (VAD)
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Moonshine
2+
3+
MLX implementation of Useful Sensors' Moonshine, a lightweight ASR model that processes raw audio through a learned conv frontend rather than mel spectrograms.
4+
5+
## Available Models
6+
7+
| Model | Parameters | Description |
8+
|-------|------------|-------------|
9+
| [UsefulSensors/moonshine-tiny](https://huggingface.co/UsefulSensors/moonshine-tiny) | 27M | Smallest variant |
10+
| [UsefulSensors/moonshine-base](https://huggingface.co/UsefulSensors/moonshine-base) | 61M | Larger, more accurate |
11+
12+
## Python Usage
13+
14+
```python
15+
from mlx_audio.stt import load
16+
17+
model = load("UsefulSensors/moonshine-tiny")
18+
19+
result = model.generate("audio.wav")
20+
print(result.text)
21+
```
22+
23+
## Architecture
24+
25+
- 3 layer conv frontend (strides 64, 3, 2) with GroupNorm
26+
- Transformer encoder with RoPE (6 layers tiny, 8 layers base)
27+
- Transformer decoder with cross attention and SwiGLU (6 layers tiny, 8 layers base)
28+
- Byte level BPE tokenizer (32k vocab)
29+
- 16kHz raw audio input (no mel spectrogram)

0 commit comments

Comments
 (0)