Text‐To‐Speech (TTS)

Purpose

Generate character speech audio files from text.

Organization

RIDE Cognition package:

AWSPollyTextToSpeechSystem
ElevenLabsTextToSpeechSystem
TextToSpeechSystemAzure

Approach

TTS is used to create real-time character speech. The to be spoken text, together with a voice ID, is sent to a 3rd party provider. The resulting audio is processed, including the creation of a lipsync schedule.

Main developer functions:

Get Available Voices
- m_currentTTS.GetAvailableVoices()
- At startup, each system will query what voices are available for the key provided
- This may take a few seconds to receive the response, so results may not be available immediately
Generate Text to Speech
- m_currentTTS.CreateTextToSpeech(currentVoice, m_inputText, audioFilePath => { Debug.Log(audioFilePath) });
- The callback provided will be called when the system finishes generating the audio file. The audioFilePath parameter will contain the location of the audio file. If the system is cloud based (WebGL), the audioFilePath will be a S3 bucket location.
Generate Speech and Lipsync schedule
- m_currentTTS.CreateTextToSpeech(currentVoice, m_inputText, (string lipsyncXML, string audioFilePath) => {Debug.Log(lipsyncXML, audioFilePath)});
- lipsyncXML will contain the lipsync schedule that can be used with character lipsync

VHToolkit supports Amazon Polly, ElevenLabs, and Azure Text to Speech. These all require a key, see Getting Started.

Polly and Azure provide both audio and a phoneme schedule. The phonemes (individual sounds) are mapped to visemes (character mouth shapes) to create a lipsync schedule. ElevenLabs only provides audio, therefore, the speech text is also sent to Azure to receive an approximate phoneme schedule. As this does not match the ElevenLabs audio exactly, there is a trade-of in getting higher quality audio, but lower quality lip syching. We are in the process of exploring options to improve lipsych quality.

Limitations

Pre-recorded audio (e.g., actor) can be pre-processed for lipsynching and optional nonverbal behavior generation. The VHtoolkit uses FaceFX for generation of lipsync information. Currently, no end-to-end example is included; please contact us if you need this use case.

Known Issues

The VHToolkit may only support a subset of the features offered by the external TTS API. For adding more control over the TTS request or to see alternative methods of sending and receiving data, see the external documentation (e.g., for ElevenLabs: https://elevenlabs.io/docs/api-reference/text-to-speech).
Similarly, the VHToolkit typically sends the agent text as-is to the external TTS service, which may offer more fine-grained control: https://elevenlabs.io/docs/overview/capabilities/text-to-speech/best-practices

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text‐To‐Speech (TTS)

Purpose

Organization

Approach

Limitations

Known Issues

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Getting Started

Capabilities

Multi-Platform Support

Tools

Data

Tutorials

Clone this wiki locally