Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,7 @@ for result in model.generate("Hello from MLX-Audio!", voice="af_heart"):
| **Voxtral Realtime** | Mistral's 4B streaming STT | Multiple | [4bit](https://huggingface.co/mlx-community/Voxtral-Mini-4B-Realtime-2602-4bit), [fp16](https://huggingface.co/mlx-community/Voxtral-Mini-4B-Realtime-2602-fp16) |
| **VibeVoice-ASR** | Microsoft's 9B ASR with diarization & timestamps | Multiple | [mlx-community/VibeVoice-ASR-bf16](https://huggingface.co/mlx-community/VibeVoice-ASR-bf16) |
| **Canary** | NVIDIA's multilingual ASR with translation | 25 EU + RU, UK | [README](mlx_audio/stt/models/canary/README.md) |
| **Moonshine** | Useful Sensors' lightweight ASR | EN | [README](mlx_audio/stt/models/moonshine/README.md) |


### Voice Activity Detection / Speaker Diarization (VAD)
Expand Down
29 changes: 29 additions & 0 deletions mlx_audio/stt/models/moonshine/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Moonshine

MLX implementation of Useful Sensors' Moonshine, a lightweight ASR model that processes raw audio through a learned conv frontend rather than mel spectrograms.

## Available Models

| Model | Parameters | Description |
|-------|------------|-------------|
| [UsefulSensors/moonshine-tiny](https://huggingface.co/UsefulSensors/moonshine-tiny) | 27M | Smallest variant |
| [UsefulSensors/moonshine-base](https://huggingface.co/UsefulSensors/moonshine-base) | 61M | Larger, more accurate |

## Python Usage

```python
from mlx_audio.stt import load

model = load("UsefulSensors/moonshine-tiny")

result = model.generate("audio.wav")
print(result.text)
```

## Architecture

- 3 layer conv frontend (strides 64, 3, 2) with GroupNorm
- Transformer encoder with RoPE (6 layers tiny, 8 layers base)
- Transformer decoder with cross attention and SwiGLU (6 layers tiny, 8 layers base)
- Byte level BPE tokenizer (32k vocab)
- 16kHz raw audio input (no mel spectrogram)
1 change: 1 addition & 0 deletions mlx_audio/stt/models/moonshine/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .moonshine import Model, ModelConfig
45 changes: 45 additions & 0 deletions mlx_audio/stt/models/moonshine/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
import inspect
from dataclasses import dataclass
from typing import Any, Dict, Optional


@dataclass
class ModelConfig:
model_type: str = "moonshine"
vocab_size: int = 32768
hidden_size: int = 288
intermediate_size: int = 1152
encoder_num_hidden_layers: int = 6
decoder_num_hidden_layers: int = 6
encoder_num_attention_heads: int = 8
decoder_num_attention_heads: int = 8
encoder_num_key_value_heads: Optional[int] = None
decoder_num_key_value_heads: Optional[int] = None
encoder_hidden_act: str = "gelu"
decoder_hidden_act: str = "silu"
max_position_embeddings: int = 512
attention_bias: bool = False
attention_dropout: float = 0.0
partial_rotary_factor: float = 0.9
rope_theta: float = 10000.0
bos_token_id: int = 1
eos_token_id: int = 2
decoder_start_token_id: int = 1
tie_word_embeddings: bool = True
pad_head_dim_to_multiple_of: Optional[int] = None

def __post_init__(self):
if self.encoder_num_key_value_heads is None:
self.encoder_num_key_value_heads = self.encoder_num_attention_heads
if self.decoder_num_key_value_heads is None:
self.decoder_num_key_value_heads = self.decoder_num_attention_heads

@classmethod
def from_dict(cls, params: Dict[str, Any]) -> "ModelConfig":
return cls(
**{
k: v
for k, v in params.items()
if k in inspect.signature(cls).parameters
}
)
Loading