Question about Mimi VAE codec release + encode/decode usage

Hi team,

Thank you for releasing the model, it’s been really fun to experiment with.

I’m currently playing with it for speech generation and noticed it uses a Mimi codec with a VAE bottleneck (rather than the RVQ version). Do you have plans to release the Mimi VAE codec (standalone codec checkpoint) as well?

In the meantime, I tried extracting the Mimi module from Pocket-TTS to test a simple audio reconstruction path (encode - decode), but I’m clearly missing something: the reconstructed audio sounds heavily degraded / essentially unusable. Since I haven’t worked with VAE-style codecs before, I suspect I’m calling the interface incorrectly.

Here’s the minimal snippet I’m trying:
```
import torch
from pocket_tts.modules.stateful_module import init_states

audio_input = audio.unsqueeze(0).unsqueeze(0).float()  # [B, C, T]
mimi = tts_model.mimi

with torch.no_grad():
    mimi_state = init_states(mimi, batch_size=1, sequence_length=1000)

    latents = mimi.encode_to_latent(audio_input)
    reconstructed = mimi.decode_from_latent(latents, mimi_state)

reconstructed_audio = reconstructed.squeeze(0).squeeze(0)
```
Could you please point me to the correct way to do reconstruction with the Mimi VAE module (or share a reference snippet)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Mimi VAE codec release + encode/decode usage #74

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about Mimi VAE codec release + encode/decode usage #74

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions