Skip to content

Question about Mimi VAE codec release + encode/decode usage #74

@orech

Description

@orech

Hi team,

Thank you for releasing the model, it’s been really fun to experiment with.

I’m currently playing with it for speech generation and noticed it uses a Mimi codec with a VAE bottleneck (rather than the RVQ version). Do you have plans to release the Mimi VAE codec (standalone codec checkpoint) as well?

In the meantime, I tried extracting the Mimi module from Pocket-TTS to test a simple audio reconstruction path (encode - decode), but I’m clearly missing something: the reconstructed audio sounds heavily degraded / essentially unusable. Since I haven’t worked with VAE-style codecs before, I suspect I’m calling the interface incorrectly.

Here’s the minimal snippet I’m trying:

import torch
from pocket_tts.modules.stateful_module import init_states

audio_input = audio.unsqueeze(0).unsqueeze(0).float()  # [B, C, T]
mimi = tts_model.mimi

with torch.no_grad():
    mimi_state = init_states(mimi, batch_size=1, sequence_length=1000)

    latents = mimi.encode_to_latent(audio_input)
    reconstructed = mimi.decode_from_latent(latents, mimi_state)

reconstructed_audio = reconstructed.squeeze(0).squeeze(0)

Could you please point me to the correct way to do reconstruction with the Mimi VAE module (or share a reference snippet)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions