Add STG Support for Video Diffusion in CosyVoice Audio #1391
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces Stage-Guided (STG) support to CosyVoice, inspired by the video diffusion framework from STGuidance. The changes enhance the text-to-speech pipeline by integrating stage-guided techniques, improving [e.g., generation quality, efficiency, or compatibility with diffusion-based workflows].
Changes Made
Updated cosyvoice/flow/decoder.py to [e.g., "incorporate stage-guided decoding logic for better alignment with diffusion processes"].
Modified cosyvoice/flow/flow_matching.py to [e.g., "adapt flow matching to support STG’s stage-based optimization"].
Motivation
The addition of STG support aims to [e.g., "leverage stage-guided diffusion techniques to enhance the quality and speed of speech synthesis, aligning CosyVoice with advanced video diffusion methodologies"]. This builds on the concepts from junhahyung/STGuidance, adapted for audio generation.