Skip to content

Conversation

primepake
Copy link

This PR introduces Stage-Guided (STG) support to CosyVoice, inspired by the video diffusion framework from STGuidance. The changes enhance the text-to-speech pipeline by integrating stage-guided techniques, improving [e.g., generation quality, efficiency, or compatibility with diffusion-based workflows].

Changes Made

  • Updated cosyvoice/flow/decoder.py to [e.g., "incorporate stage-guided decoding logic for better alignment with diffusion processes"].

  • Modified cosyvoice/flow/flow_matching.py to [e.g., "adapt flow matching to support STG’s stage-based optimization"].

Motivation

The addition of STG support aims to [e.g., "leverage stage-guided diffusion techniques to enhance the quality and speed of speech synthesis, aligning CosyVoice with advanced video diffusion methodologies"]. This builds on the concepts from junhahyung/STGuidance, adapted for audio generation.

@johnwick123f
Copy link

Looks interesting, but may I ask, what are the effects of adding STG? Better voice cloning quality or better emotion?

@primepake
Copy link
Author

yes, I will improve the model quality. For example, the flow matching in flow model is sometimes difficult to maintain the consistent of speaker like it changed the voice identity from male to female in the same audio with STG it's improved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants