MetaVoice 1B TTS: New and Improved Artificial Intelligence Capabilities as well as Improved User Interface.#194
Open
RahulVadisetty91 wants to merge 2 commits intometavoiceio:mainfrom
Open
Conversation
This commit introduces several key enhancements to the MetaVoice-1B text-to-speech (TTS) model, focusing on improving AI capabilities and user interaction: Advanced Speech Parameters: Added functionality for dynamic adjustment of speech stability and speaker similarity. Users can now fine-tune the top_p (speech stability) and guidance (speaker similarity) parameters through sliders, allowing for more personalized and controlled speech output. Enhanced Voice Cloning: Improved handling of uploaded voice samples for cloning. The script now includes validation for file size and duration, ensuring that uploaded samples are suitable for high-quality voice synthesis. Samples must be between 30-90 seconds and less than 50MB to ensure optimal performance. User Interface Improvements: Updated the user interface to provide a more intuitive experience. Users can choose between preset voices and uploaded target voices, with automatic layout adjustments based on the selected option. The interface now features clear labels and better organization for ease of use. Robust Error Handling: Enhanced error handling to manage edge cases and provide informative feedback. The script includes comprehensive checks and error messages for input validation, such as handling text length limits and ensuring uploaded files meet the required criteria. These updates aim to enhance the functionality, usability, and robustness of the MetaVoice-1B TTS model, delivering a more versatile and user-friendly text-to-speech solution. Signed-off-by: Rahul Vadisetty <rahulvy91@gmail.com>
Enhance TTS Model with Advanced Speech Parameters and Improved Voice Cloning
|
awesome with cloning you do a training or zero shot? if zero show many seconds of source speaking video you suggest? |
Author
Thank you for the feedback! Regarding cloning, we are currently using zero-shot learning to clone voices. From our tests, a source video with around 5-10 seconds of clear high-quality audio provides optimal results for speaker similarity. However, if needed we can experiment with different durations to see how it affects the quality. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
1. Summary:
This pull request brings a number of important advancements on the field of AI to the MetaVoice-1B TTS model as well as significant changes in the user interface. The new features include dynamic speech parameters including the top_p slider for speech stability and the guidance slider for speaker similarity allowing users to control the speech synthesis to their preference. New features are also added in the voice cloning including better validation of the voice samples uploaded and better handling of the edge cases. Further, the user interface is improved with better voice selection presentation and improved error messages where a better error handling mechanism has been applied in order to direct users to the right path in case of any problem.
2. Related Issues:
These updates concern the problems that require further developement in speech synthesis, including the enhancement of voice cloning accuracy, and the improvement of the interface. The improvements in dynamic speech parameters and error handling are an answer to the users’ complaints and earlier tests.
3. Discussions:
Concerns were raised on the need to give users some control on the type of speech that is generated in terms of stability and speaker similarity. The necessity for voice samples authentication in order to achieve high-quality cloning and the need for the interface improvements were discussed as well. Also, the importance of coming up with accurate and detailed error messages was underlined in order to improve the user’s experience.
4. QA Instructions:
top_pandguidancesliders and make sure that they work as intended and allow for the desired control over speech smoothness and speaker similarity.5. Merge Plan:
After the QA testing is done and is successful then the branch will be merged into the main branch. The merge will be done in a way that will not interfere much with the ongoing development activities and special emphasis will be made on the new dynamic speech parameters and voice cloning improvements.
6. Motivation and Context:
The reason for these updates is to enhance the efficiency, scalability, and the practicality of the MetaVoice-1B TTS model. Through implementing dynamic speech parameters, we want the users to have a high level of control on the speech output for it to be versatile. New and improved voice cloning and interface design take user complaints into consideration and provide a better service. This helps to minimize any inconveniences which may be experienced by users and therefore enhance the overall effectiveness of the text-to-speech conversion.
7. Types of Changes: