-
Notifications
You must be signed in to change notification settings - Fork 412
Description
Hi team,
Thank you for releasing the model, it’s been really fun to experiment with.
I’m currently playing with it for speech generation and noticed it uses a Mimi codec with a VAE bottleneck (rather than the RVQ version). Do you have plans to release the Mimi VAE codec (standalone codec checkpoint) as well?
In the meantime, I tried extracting the Mimi module from Pocket-TTS to test a simple audio reconstruction path (encode - decode), but I’m clearly missing something: the reconstructed audio sounds heavily degraded / essentially unusable. Since I haven’t worked with VAE-style codecs before, I suspect I’m calling the interface incorrectly.
Here’s the minimal snippet I’m trying:
import torch
from pocket_tts.modules.stateful_module import init_states
audio_input = audio.unsqueeze(0).unsqueeze(0).float() # [B, C, T]
mimi = tts_model.mimi
with torch.no_grad():
mimi_state = init_states(mimi, batch_size=1, sequence_length=1000)
latents = mimi.encode_to_latent(audio_input)
reconstructed = mimi.decode_from_latent(latents, mimi_state)
reconstructed_audio = reconstructed.squeeze(0).squeeze(0)
Could you please point me to the correct way to do reconstruction with the Mimi VAE module (or share a reference snippet)?