Skip to content

Commit f7fce44

Browse files
committed
feat: LFM2.5-Audio
1 parent 251949e commit f7fce44

File tree

7 files changed

+277
-31
lines changed

7 files changed

+277
-31
lines changed

README.md

Lines changed: 22 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,9 @@ We present LFM2-Audio-1.5B, [Liquid AI](https://www.liquid.ai/)'s first end-to-e
44

55
LFM2-Audio supports two generation modes, interleaved and sequential, to maximize performance and quality across different tasks. Interleaved generation outputs text and audio tokens in a fixed interleaved pattern. This approach minimizes time to first audio output and number of tokens generated, making it ideal for naturally flowing real-time speech-to-speech interactions on resource constrained devices. Sequential generation mode, where the model decides when to switch modalities via special tokens, is suitable for non-conversational tasks, such as speech-to-text (ASR) or text-to-speech (TTS).
66

7+
### Updates
8+
- [LFM2.5-Audio-1.5B](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B) is released! This model is based on the stronger LFM2.5-1.2B base, and comes with a lightning fast LFM2 based audio detokenizer, stronger ASR, and better TTS voices. To use the new detokenizer, simply use `processor.decode`, see the examples below for more details. For the improved TTS voices, see the [TTS](#tts) section.
9+
710
## Installation
811
The package can be installed via `pip`
912
```bash
@@ -61,7 +64,7 @@ import torchaudio
6164
from liquid_audio import LFM2AudioModel, LFM2AudioProcessor, ChatState, LFMModality
6265

6366
# Load models
64-
HF_REPO = "LiquidAI/LFM2-Audio-1.5B"
67+
HF_REPO = "LiquidAI/LFM2.5-Audio-1.5B"
6568

6669
processor = LFM2AudioProcessor.from_pretrained(HF_REPO).eval()
6770
model = LFM2AudioModel.from_pretrained(HF_REPO).eval()
@@ -97,9 +100,8 @@ for t in model.generate_interleaved(**chat, max_new_tokens=512, audio_temperatur
97100

98101
# Detokenize audio, removing the last "end-of-audio" codes
99102
# Mimi returns audio at 24kHz
100-
mimi_codes = torch.stack(audio_out[:-1], 1).unsqueeze(0)
101-
with torch.no_grad():
102-
waveform = processor.mimi.decode(mimi_codes)[0]
103+
audio_codes = torch.stack(audio_out[:-1], 1).unsqueeze(0)
104+
waveform = processor.decode(audio_codes)
103105
torchaudio.save("answer1.wav", waveform.cpu(), 24_000)
104106

105107
# Append newly generated tokens to chat history
@@ -128,9 +130,8 @@ for t in model.generate_interleaved(**chat, max_new_tokens=512, audio_temperatur
128130
# output: Sure thing! How about “Comfortable Chairs, Crafted with Care” or “Elegant Seats, Handcrafted for You”? Let me know if you’d like a few more options.
129131

130132
# Detokenize second turn audio, removing the last "end-of-audio" codes
131-
mimi_codes = torch.stack(audio_out[:-1], 1).unsqueeze(0)
132-
with torch.no_grad():
133-
waveform = processor.mimi.decode(mimi_codes)[0]
133+
audio_codes = torch.stack(audio_out[:-1], 1).unsqueeze(0)
134+
waveform = processor.decode(audio_codes)
134135
torchaudio.save("answer2.wav", waveform.cpu(), 24_000)
135136
```
136137

@@ -154,7 +155,7 @@ import torchaudio
154155
from liquid_audio import LFM2AudioModel, LFM2AudioProcessor, ChatState, LFMModality
155156

156157
# Load models
157-
HF_REPO = "LiquidAI/LFM2-Audio-1.5B"
158+
HF_REPO = "LiquidAI/LFM2.5-Audio-1.5B"
158159

159160
processor = LFM2AudioProcessor.from_pretrained(HF_REPO).eval()
160161
model = LFM2AudioModel.from_pretrained(HF_REPO).eval()
@@ -182,19 +183,25 @@ for t in model.generate_sequential(**chat, max_new_tokens=512):
182183
```
183184

184185
### TTS
185-
For TTS, we also use sequential generation, with the fixed system prompt `Perform TTS.`. In addition, we can prompt the voice and a style using a natural language description.
186+
For TTS, we also use sequential generation. We support four pre-defined voices, which can be selected by choosing one of the four system prompts below
187+
```
188+
Perform TTS. Use the US male voice.
189+
Perform TTS. Use the US female voice.
190+
Perform TTS. Use the UK male voice.
191+
Perform TTS. Use the UK female voice.
192+
```
186193

187194
<details>
188195

189196
<summary>TTS Sample</summary>
190197

191-
**Voice description**: A male speaker delivers his lines with a low-pitched voice and an animated tone. The recording is of excellent quality with almost no noise and a very close-sounding atmosphere.
198+
**System prompt**: Perform TTS. Use the UK male voice.
192199

193200
**Input sentence**: What is this obsession people have with books? They put them in their houses—like they're trophies. What do you need it for after you read it?
194201

195202
**Output audio**
196203

197-
https://github.com/user-attachments/assets/2fa953cf-d8a8-477a-b841-c4f18d9266e6
204+
https://github.com/user-attachments/assets/8d57c184-b92e-4e1a-983b-d1f9d16d0d92
198205

199206
</details>
200207

@@ -204,7 +211,7 @@ import torchaudio
204211
from liquid_audio import LFM2AudioModel, LFM2AudioProcessor, ChatState, LFMModality
205212

206213
# Load models
207-
HF_REPO = "LiquidAI/LFM2-Audio-1.5B"
214+
HF_REPO = "LiquidAI/LFM2.5-Audio-1.5B"
208215

209216
processor = LFM2AudioProcessor.from_pretrained(HF_REPO).eval()
210217
model = LFM2AudioModel.from_pretrained(HF_REPO).eval()
@@ -213,7 +220,7 @@ model = LFM2AudioModel.from_pretrained(HF_REPO).eval()
213220
chat = ChatState(processor)
214221

215222
chat.new_turn("system")
216-
chat.add_text("Perform TTS.\nUse the following voice: A male speaker delivers his lines with a low-pitched voice and an animated tone. The recording is of excellent quality with almost no noise and a very close-sounding atmosphere.")
223+
chat.add_text("Perform TTS. Use the UK male voice.")
217224
chat.end_turn()
218225

219226
chat.new_turn("user")
@@ -229,9 +236,8 @@ for t in model.generate_sequential(**chat, max_new_tokens=512, audio_temperature
229236
audio_out.append(t)
230237

231238
# Detokenize audio
232-
mimi_codes = torch.stack(audio_out[:-1], 1).unsqueeze(0)
233-
with torch.no_grad():
234-
waveform = processor.mimi.decode(mimi_codes)[0]
239+
audio_codes = torch.stack(audio_out[:-1], 1).unsqueeze(0)
240+
waveform = processor.decode(audio_codes)
235241
torchaudio.save("tts.wav", waveform.cpu(), 24_000)
236242
```
237243

pyproject.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[project]
22
name = "liquid-audio"
3-
version = "1.0.0"
3+
version = "1.1.0"
44
description = "Liquid Audio - Speech-to-Speech audio models"
55
readme = "README.md"
66
authors = [
@@ -16,6 +16,7 @@ dependencies = [
1616
"sentencepiece>=0.2.1",
1717
"torch>=2.8.0",
1818
"torchaudio>=2.8.0",
19+
"torchcodec>=0.9.1",
1920
"transformers>=4.55.4",
2021
]
2122
keywords = ["Liquid AI", "LFM", "LFM2", "Audio", "Speech-to-Speech"]

src/liquid_audio/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
1+
from liquid_audio.detokenizer import LFM2AudioDetokenizer
12
from liquid_audio.model.lfm2_audio import LFM2AudioModel
23
from liquid_audio.processor import ChatState, LFM2AudioProcessor
34
from liquid_audio.utils import LFMModality
45

5-
__all__ = ["ChatState", "LFM2AudioModel", "LFM2AudioProcessor", "LFMModality"]
6+
__all__ = ["ChatState", "LFM2AudioDetokenizer", "LFM2AudioModel", "LFM2AudioProcessor", "LFMModality"]

src/liquid_audio/demo/model.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010

1111
__all__ = ["lfm2_audio", "mimi", "proc"]
1212

13-
HF_DIR = "LiquidAI/LFM2-Audio-1.5B"
13+
HF_DIR = "LiquidAI/LFM2.5-Audio-1.5B"
1414

1515
logging.info("Loading processor")
1616
proc = LFM2AudioProcessor.from_pretrained(HF_DIR).eval()

src/liquid_audio/detokenizer.py

Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
import torch
2+
from torch import nn
3+
from transformers import Lfm2Config, Lfm2Model
4+
5+
6+
class FusedEmbedding(nn.Module):
7+
"""Turn codes into embeddings"""
8+
9+
def __init__(
10+
self,
11+
dim: int,
12+
codeboooks: int = 8,
13+
vocab_size: int = 2048,
14+
):
15+
super().__init__()
16+
self.emb = nn.Embedding(codeboooks * vocab_size, dim)
17+
18+
self.codeboooks = codeboooks
19+
self.vocab_size = vocab_size
20+
21+
def forward(self, x: torch.Tensor) -> torch.Tensor:
22+
offsets = torch.arange(self.codeboooks, device=x.device) * self.vocab_size # TODO: buffer?
23+
offset_x = offsets[:, None] + x
24+
return self.emb(offset_x).mean(1) # B L D
25+
26+
27+
class ISTFT(nn.Module):
28+
"""
29+
Custom implementation of ISTFT since torch.istft doesn't allow custom padding (other than `center=True`) with
30+
windowing. This is because the NOLA (Nonzero Overlap Add) check fails at the edges.
31+
See issue: https://github.com/pytorch/pytorch/issues/62323
32+
Specifically, in the context of neural vocoding we are interested in "same" padding analogous to CNNs.
33+
The NOLA constraint is met as we trim padded samples anyway.
34+
35+
Adapted from Vocos: https://github.com/gemelo-ai/vocos/blob/c859e3b7b534f3776a357983029d34170ddd6fc3/vocos/spectral_ops.py#L7
36+
Args:
37+
n_fft (int): Size of Fourier transform.
38+
hop_length (int): The distance between neighboring sliding window frames.
39+
win_length (int): The size of window frame and STFT filter.
40+
padding (str, optional): Type of padding. Options are "center" or "same". Defaults to "same".
41+
"""
42+
43+
def __init__(self, n_fft: int, hop_length: int, win_length: int, padding: str = "same"):
44+
super().__init__()
45+
if padding not in ["center", "same"]:
46+
raise ValueError("Padding must be 'center' or 'same'.")
47+
self.padding = padding
48+
self.n_fft = n_fft
49+
self.hop_length = hop_length
50+
self.win_length = win_length
51+
window = torch.hann_window(win_length)
52+
self.register_buffer("window", window)
53+
54+
def forward(self, spec: torch.Tensor) -> torch.Tensor:
55+
"""
56+
Compute the Inverse Short Time Fourier Transform (ISTFT) of a complex spectrogram.
57+
Args:
58+
spec (Tensor): Input complex spectrogram of shape (B, N, T), where B is the batch size,
59+
N is the number of frequency bins, and T is the number of time frames.
60+
Returns:
61+
Tensor: Reconstructed time-domain signal of shape (B, L), where L is the length of the output signal.
62+
"""
63+
if self.padding == "center":
64+
# Fallback to pytorch native implementation
65+
return torch.istft(
66+
spec,
67+
self.n_fft,
68+
self.hop_length,
69+
self.win_length,
70+
self.window, # type: ignore[arg-type]
71+
center=True,
72+
)
73+
elif self.padding == "same":
74+
pad = (self.win_length - self.hop_length) // 2
75+
else:
76+
raise ValueError("Padding must be 'center' or 'same'.")
77+
78+
assert spec.dim() == 3, "Expected a 3D tensor as input"
79+
_B, _N, T = spec.shape
80+
81+
# Inverse FFT
82+
ifft = torch.fft.irfft(spec, self.n_fft, dim=1, norm="backward")
83+
ifft = ifft * self.window[None, :, None] # type: ignore[index]
84+
85+
# Overlap and Add
86+
output_size = (T - 1) * self.hop_length + self.win_length
87+
y = torch.nn.functional.fold(
88+
ifft,
89+
output_size=(1, output_size),
90+
kernel_size=(1, self.win_length),
91+
stride=(1, self.hop_length),
92+
)[:, 0, 0, pad:-pad]
93+
94+
# Window envelope
95+
window_sq = self.window.square().expand(1, T, -1).transpose(1, 2) # type: ignore[operator]
96+
window_envelope = torch.nn.functional.fold(
97+
window_sq,
98+
output_size=(1, output_size),
99+
kernel_size=(1, self.win_length),
100+
stride=(1, self.hop_length),
101+
).squeeze()[pad:-pad]
102+
103+
# Normalize
104+
assert (window_envelope > 1e-11).all()
105+
y = y / window_envelope
106+
107+
return y
108+
109+
110+
class LFM2AudioDetokenizer(nn.Module):
111+
def __init__(self, backbone_config: Lfm2Config):
112+
super().__init__()
113+
self.emb = FusedEmbedding(512)
114+
self.lfm = Lfm2Model(backbone_config)
115+
self.lin = nn.Linear(512, 1282) # half are log-magnitude, half are angle
116+
117+
self.istft = ISTFT(1280, 320, 1280, padding="same")
118+
self.sliding_window_size = getattr(backbone_config, "sliding_window", 30)
119+
120+
def forward(self, x: torch.Tensor) -> torch.Tensor:
121+
x = self.emb(x)
122+
upsample_size = 6 * x.shape[1]
123+
x = nn.functional.interpolate(x.mT, upsample_size, mode="nearest-exact").mT
124+
125+
# Set attn mask
126+
idx = torch.arange(x.shape[1], device=x.device)
127+
d_idx = idx - idx[:, None]
128+
mask = torch.logical_and(d_idx <= 0, d_idx > -self.sliding_window_size)[None, None, ...]
129+
130+
x = self.lfm(inputs_embeds=x, attention_mask=mask, use_cache=False).last_hidden_state
131+
x = self.lin(x)
132+
133+
log_abs, angle = torch.chunk(x.mT.contiguous(), 2, 1)
134+
y = torch.polar(log_abs.exp(), angle)
135+
136+
return self.istft(y)

0 commit comments

Comments
 (0)