Skip to content

Generate the audio in one go? #39

@rikhuijzer

Description

@rikhuijzer

More advanced models like Zyphra Zonos or the en-US-Chirp-HD-D from Google sound like the whole environment changes between each different audio generation. Google is usable, but Kokoro is way more consistent. Currently, I'm already passing a seed to Zyphra but still the audio sounds very different between clips.

One way to do this would be to run whisper on the generated audio and use the SRT timestamps to figure out where the slides should transition. This means handling wrongly generated SRT and guessing which text is the right match. This would probably improve the quality, but it also would require much more tokens and running whisper locally for each video.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions