Skip to content

Commit 88f098d

Browse files
Add comprehensive SNAC implementation documentation
- Document completed architecture infrastructure - Explain SNAC components (encoder, quantizer, decoder) - Describe Snake activation implementation approach - Provide model conversion instructions - List remaining implementation tasks - Include integration notes for Orpheus TTS Co-Authored-By: Jake Cosme <[email protected]>
1 parent d67d7c1 commit 88f098d

File tree

1 file changed

+185
-0
lines changed

1 file changed

+185
-0
lines changed

docs/SNAC_IMPLEMENTATION.md

Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
# SNAC Decoder Implementation for Orpheus TTS
2+
3+
## Overview
4+
5+
This document describes the implementation of SNAC (Multi-Scale Neural Audio Codec) decoder support in llama.cpp for Orpheus TTS models.
6+
7+
## Current Status
8+
9+
### ✅ Completed
10+
11+
1. **Architecture Infrastructure**
12+
- Added `LLM_ARCH_SNAC_DEC` architecture enum
13+
- Registered "snac-dec" architecture name
14+
- Defined 31 SNAC-specific tensor types
15+
- Added tensor name mappings for decoder, quantizer, and encoder components
16+
17+
2. **GGUF Constants**
18+
- Added `MODEL_ARCH.SNAC_DEC` to gguf constants
19+
- Defined tensor enums for all SNAC components
20+
- Added tensor name format strings
21+
22+
3. **Model Conversion**
23+
- Implemented `SnacDecModel` class in `convert_hf_to_gguf.py`
24+
- Handles weight_norm parameters (skips _g and _v suffixes)
25+
- Configures SNAC-specific hyperparameters
26+
27+
### 🚧 In Progress / TODO
28+
29+
1. **Model Loading (llama-model.cpp)**
30+
- Need to implement SNAC decoder model loading
31+
- Load decoder convolution layers
32+
- Load vector quantizer components (in_proj, out_proj, codebook)
33+
- Load attention layers if present
34+
- Handle Snake activation parameters
35+
36+
2. **Forward Pass Implementation (llama.cpp)**
37+
- Implement SNAC decoder forward pass
38+
- Vector quantization decoding (from_codes)
39+
- Decoder blocks with:
40+
- Transposed convolutions (upsampling)
41+
- Residual units with dilated convolutions
42+
- Snake activation function
43+
- Local multi-head attention (if present)
44+
- Output convolution and tanh activation
45+
46+
3. **TTS Tool Integration (tools/tts/tts.cpp)**
47+
- Add SNAC decoder option to TTS tool
48+
- Support for multi-scale code input
49+
- Audio generation from hierarchical codes
50+
- Integration with Orpheus TTS models
51+
52+
4. **Testing**
53+
- Download and convert SNAC models from HuggingFace
54+
- Test with Orpheus TTS models
55+
- Validate audio quality
56+
- Performance benchmarking
57+
58+
## SNAC Architecture
59+
60+
### Components
61+
62+
1. **Encoder** (not needed for TTS, only for training)
63+
- Input convolution
64+
- Encoder blocks with strided convolutions
65+
- Local attention (optional)
66+
- Output convolution
67+
68+
2. **Vector Quantizer** (needed for decoding)
69+
- 4 quantization levels with different strides [8, 4, 2, 1]
70+
- Each level has:
71+
- `in_proj`: Projects latent to codebook dimension
72+
- `codebook`: Embedding table (4096 x 8)
73+
- `out_proj`: Projects back to latent dimension
74+
- Residual quantization across levels
75+
76+
3. **Decoder** (main component needed)
77+
- Input convolution (or direct from quantizer output)
78+
- Local attention (optional)
79+
- Decoder blocks (4 blocks for standard config):
80+
- Transposed convolution for upsampling
81+
- 3 residual units with dilations [1, 3, 9]
82+
- Snake activation
83+
- Output convolution + tanh
84+
85+
### Snake Activation
86+
87+
Formula: `x + (1/alpha) * sin^2(alpha * x)`
88+
89+
Can be implemented using existing ggml operations:
90+
```c
91+
// x_scaled = x * alpha
92+
// sin_x = sin(x_scaled)
93+
// sin2_x = sin_x * sin_x
94+
// result = x + sin2_x / alpha
95+
```
96+
97+
### Tensor Naming Convention
98+
99+
Decoder tensors:
100+
- `decoder.conv_in` - Input convolution
101+
- `decoder.attn_norm`, `decoder.attn_q/k/v/out` - Attention (if present)
102+
- `decoder.block.{i}.conv_up` - Upsampling transposed conv
103+
- `decoder.block.{i}.conv1/2/3` - Residual unit convolutions
104+
- `decoder.block.{i}.snake_alpha` - Snake activation parameters
105+
- `decoder.conv_out` - Output convolution
106+
107+
Quantizer tensors:
108+
- `quantizer.{i}.in_proj` - Input projection for level i
109+
- `quantizer.{i}.out_proj` - Output projection for level i
110+
- `quantizer.{i}.codebook` - Codebook embeddings for level i
111+
112+
## Model Conversion
113+
114+
### Converting SNAC Models
115+
116+
```bash
117+
# Download SNAC model
118+
git clone https://huggingface.co/hubertsiuzdak/snac_24khz
119+
120+
# Convert to GGUF
121+
python convert_hf_to_gguf.py snac_24khz \
122+
--outfile snac-24khz-f16.gguf \
123+
--outtype f16
124+
```
125+
126+
### Expected Hyperparameters
127+
128+
From SNAC config.json:
129+
```json
130+
{
131+
"sampling_rate": 24000,
132+
"encoder_dim": 64,
133+
"encoder_rates": [3, 3, 7, 7],
134+
"latent_dim": 1344,
135+
"decoder_dim": 1536,
136+
"decoder_rates": [7, 7, 3, 3],
137+
"attn_window_size": 32,
138+
"codebook_size": 4096,
139+
"codebook_dim": 8,
140+
"vq_strides": [8, 4, 2, 1]
141+
}
142+
```
143+
144+
## Integration with Orpheus TTS
145+
146+
Orpheus TTS uses a two-model architecture:
147+
1. **Text-to-Codes Model**: LLM that generates hierarchical audio codes
148+
2. **Codes-to-Speech Model**: SNAC decoder that converts codes to audio
149+
150+
Usage flow:
151+
```
152+
Text → Orpheus LLM → Multi-scale codes → SNAC Decoder → Audio waveform
153+
```
154+
155+
## References
156+
157+
- SNAC Paper: https://arxiv.org/abs/2410.14411
158+
- SNAC GitHub: https://github.com/hubertsiuzdak/snac
159+
- Orpheus Models: https://huggingface.co/collections/canopylabs/orpheus-tts-67d9ea3f6c05a941c06ad9d2
160+
- OuteTTS Reference: PR #10784 in llama.cpp
161+
162+
## Implementation Notes
163+
164+
### Key Differences from WavTokenizer
165+
166+
1. **Multi-scale Quantization**: SNAC uses 4 levels with different temporal resolutions
167+
2. **Snake Activation**: Custom activation function (WavTokenizer uses standard activations)
168+
3. **Simpler Architecture**: No PosNet or ConvNext blocks
169+
4. **Hierarchical Codes**: Variable-length codes at different scales
170+
171+
### Performance Considerations
172+
173+
- SNAC is designed for low bitrate (0.98-2.6 kbps)
174+
- Decoder is relatively lightweight
175+
- Main computation in transposed convolutions and residual blocks
176+
- Attention is optional and can be disabled for faster inference
177+
178+
## Next Steps
179+
180+
1. Implement model loading in `llama-model.cpp`
181+
2. Implement forward pass in `llama.cpp`
182+
3. Add SNAC support to TTS tool
183+
4. Test with Orpheus models
184+
5. Add documentation and examples
185+
6. Performance optimization

0 commit comments

Comments
 (0)