This document outlines the Text-to-Speech (TTS) feature implemented in the application, enabling chat messages to be spoken aloud.
Currently, the generateAudio service (see src/lib/audioService.ts) returns a static test audio URL (e.g., a T-Rex roar) for development and testing purposes, rather than dynamically generating speech for each message using Google Cloud TTS. The implementation details below describe the setup for the Google Cloud TTS integration that was prototyped or is planned for full integration.
The TTS system is designed to provide a distinct and consistent voice for each user in the chat.
When a new chat message is received, if it's not from the current user and not a command message (e.g., starting with /), the system will automatically synthesize the message text into speech and play it.
- Action Location:
src/app/actions/tts.ts - Client Initialization:
src/app/actions/ttsClient.ts(handles Google Cloud TTS client setup usingGOOGLE_APP_CREDS_JSONenvironment variable). - Function:
synthesizeSpeechAction- Input:
text(string),userId(string),voiceName(optional string). - Output:
audioBase64(string) or anerrorobject.
- Input:
- Voice Assignment:
- A predefined list of Google Cloud 'Chirp3' voices is used (see
chirp3Voicesintts.ts). - If
voiceNameis not provided in the parameters, a voice is deterministically assigned to auserIdusing a simple hashing mechanism on theuserId. This ensures that each user consistently has the same voice. - The language code is derived from the selected voice name (e.g.,
en-US,en-GB).
- A predefined list of Google Cloud 'Chirp3' voices is used (see
- Authentication: The Google Cloud Text-to-Speech client is initialized using credentials provided via the
GOOGLE_APP_CREDS_JSONenvironment variable. This JSON file contains the service account key for accessing the Google Cloud TTS API.
-
Chat Message Component:
src/components/ChatMessage.tsx- This component is responsible for rendering individual chat messages.
- It receives the
currentUserIdas a prop. - TTS Trigger: When a new message is rendered, a
useEffecthook checks if the message is from a different user. If so, it calls thesynthesizeSpeechActionwith the message text and the sender'suserId. - Audio Playback:
- The
audioBase64data returned from the action is used to create anHTMLAudioElement(new Audio("data:audio/mp3;base64," + audioBase64)). - The audio is then played automatically.
- The
- State Management: The component manages
audioStatus(idle,loading,playing,error) anderrorstates to provide feedback. - Visual Indicators: Small icons (🎤) are displayed next to messages to indicate TTS status (loading, playing, error).
-
Chat Window Component:
src/components/ChatWindow.tsx- Passes the
myId(current user's ID) prop to eachChatMessageinstance ascurrentUserId.
- Passes the
GOOGLE_APP_CREDS_JSON: Required. A JSON string containing the Google Cloud service account credentials necessary for the Text-to-Speech API. This should be set in your.env.localor server environment.NEXT_PUBLIC_TTS_ENABLED: Optional. Set to"false"to disable Text-to-Speech functionality across the application. If not set, or set to any other value (e.g.,"true"), TTS will be enabled by default. This is useful for disabling TTS during development, testing (e.g., end-to-end tests), or if a user wishes to globally turn off the feature via environment configuration.
- The
synthesizeSpeechActionreturns error objects for issues like invalid parameters, TTS synthesis failures, or rate limit errors from the Google Cloud API. - The
ChatMessagecomponent handles these errors by displaying an error message and a visual indicator. - Audio playback errors on the client-side are also caught and reported.
- User preference to disable TTS (via UI, complementing the environment variable for global control).
- User preference for voice selection.
- More sophisticated rate limiting and queueing if API limits become an issue.
- Admin panel to manage available voices.