-
Notifications
You must be signed in to change notification settings - Fork 1
Text‐To‐Speech (TTS)
Generate character speech audio files from text.
RIDE Cognition package:
- AWSPollyTextToSpeechSystem
- ElevenLabsTextToSpeechSystem
- TextToSpeechSystemAzure
TTS is used to create real-time character speech. The to be spoken text, together with a voice ID, is sent to a 3rd party provider. The resulting audio is processed, including the creation of a lipsync schedule.
Main developer functions:
- Get Available Voices
- m_currentTTS.GetAvailableVoices()
- At startup, each system will query what voices are available for the key provided
- This may take a few seconds to receive the response, so results may not be available immediately
- Generate Text to Speech
- m_currentTTS.CreateTextToSpeech(currentVoice, m_inputText, audioFilePath => { Debug.Log(audioFilePath) });
- The callback provided will be called when the system finishes generating the audio file. The audioFilePath parameter will contain the location of the audio file. If the system is cloud based (WebGL), the audioFilePath will be a S3 bucket location.
- Generate Speech and Lipsync schedule
- m_currentTTS.CreateTextToSpeech(currentVoice, m_inputText, (string lipsyncXML, string audioFilePath) => {Debug.Log(lipsyncXML, audioFilePath)});
- lipsyncXML will contain the lipsync schedule that can be used with character lipsync
VHToolkit supports Amazon Polly, ElevenLabs, and Azure Text to Speech. These all require a key, see Getting Started.
Polly and Azure provide both audio and a phoneme schedule. The phonemes (individual sounds) are mapped to visemes (character mouth shapes) to create a lipsync schedule. ElevenLabs only provides audio, therefore, the speech text is also sent to Azure to receive an approximate phoneme schedule. As this does not match the ElevenLabs audio exactly, there is a trade-of in getting higher quality audio, but lower quality lip syching. We are in the process of exploring options to improve lipsych quality.
- Pre-recorded audio (e.g., actor) can be pre-processed for lipsynching and optional nonverbal behavior generation. The VHtoolkit uses FaceFX for generation of lipsync information. Currently, no end-to-end example is included; please contact us if you need this use case.
- The VHToolkit may only support a subset of the features offered by the external TTS API. For adding more control over the TTS request or to see alternative methods of sending and receiving data, see the external documentation (e.g., for ElevenLabs: https://elevenlabs.io/docs/api-reference/text-to-speech).
- Similarly, the VHToolkit typically sends the agent text as-is to the external TTS service, which may offer more fine-grained control: https://elevenlabs.io/docs/overview/capabilities/text-to-speech/best-practices