-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
More advanced models like Zyphra Zonos or the en-US-Chirp-HD-D from Google sound like the whole environment changes between each different audio generation. Google is usable, but Kokoro is way more consistent. Currently, I'm already passing a seed to Zyphra but still the audio sounds very different between clips.
One way to do this would be to run whisper on the generated audio and use the SRT timestamps to figure out where the slides should transition. This means handling wrongly generated SRT and guessing which text is the right match. This would probably improve the quality, but it also would require much more tokens and running whisper locally for each video.
Metadata
Metadata
Assignees
Labels
No labels