Skip to content

Text‐To‐Speech (TTS)

Arno Hartholt edited this page Feb 24, 2026 · 4 revisions

Purpose

Generate character speech audio files from text.

Organization

RIDE Cognition package:

  • AWSPollyTextToSpeechSystem
  • ElevenLabsTextToSpeechSystem
  • TextToSpeechSystemAzure

Approach

TTS is used to create real-time character speech. The to be spoken text, together with a voice ID, is sent to a 3rd party provider. The resulting audio is processed, including the creation of a lipsync schedule.

Main developer functions:

  • Get Available Voices
    • m_currentTTS.GetAvailableVoices()
    • At startup, each system will query what voices are available for the key provided
    • This may take a few seconds to receive the response, so results may not be available immediately
  • Generate Text to Speech
    • m_currentTTS.CreateTextToSpeech(currentVoice, m_inputText, audioFilePath => { Debug.Log(audioFilePath) });
    • The callback provided will be called when the system finishes generating the audio file. The audioFilePath parameter will contain the location of the audio file. If the system is cloud based (WebGL), the audioFilePath will be a S3 bucket location.
  • Generate Speech and Lipsync schedule
    • m_currentTTS.CreateTextToSpeech(currentVoice, m_inputText, (string lipsyncXML, string audioFilePath) => {Debug.Log(lipsyncXML, audioFilePath)});
    • lipsyncXML will contain the lipsync schedule that can be used with character lipsync

VHToolkit supports Amazon Polly, ElevenLabs, and Azure Text to Speech. These all require a key, see Getting Started.

Polly and Azure provide both audio and a phoneme schedule. The phonemes (individual sounds) are mapped to visemes (character mouth shapes) to create a lipsync schedule. ElevenLabs only provides audio, therefore, the speech text is also sent to Azure to receive an approximate phoneme schedule. As this does not match the ElevenLabs audio exactly, there is a trade-of in getting higher quality audio, but lower quality lip syching. We are in the process of exploring options to improve lipsych quality.

Limitations

  • Pre-recorded audio (e.g., actor) can be pre-processed for lipsynching and optional nonverbal behavior generation. The VHtoolkit uses FaceFX for generation of lipsync information. Currently, no end-to-end example is included; please contact us if you need this use case.

Known Issues

Clone this wiki locally