This document outlines the updated interaction flow for AI agents within the Planeo application, focusing on how actions, chat messages, and audio playback are synchronized.
The previous AI interaction model involved the AI generating a chat message, which was then processed for Text-to-Speech (TTS), and separately, actions might be inferred or directly generated. The new model tightens this loop, ensuring that an AI agent completes its spoken message and associated action before a new decision cycle is initiated for that agent.
The flow for each AI agent generally follows these steps:
-
Client-Side Trigger: An agent becomes ready for a new decision. This readiness is now primarily determined by the completion of audio playback for its previous chat message (or if it had no message to speak).
-
Visual Data Capture (Client): The client captures the current visual information from the specific AI agent's perspective (e.g., a screenshot of its view in the 3D world).
-
Request to Backend (Client -> Server):
- The client sends the captured image data and the current chat history to the
generateAiActionAndChatserver action. - It's crucial that the chat history sent is as up-to-date as possible to provide the LLM with the latest conversational context.
- The client sends the captured image data and the current chat history to the
-
LLM Processing (Server):
- The
generateAiActionAndChatfunction on the server constructs a prompt for the Google Generative AI model (e.g., Gemini). - This prompt includes the system instructions, the agent's current visual data (image), and the provided chat history.
- The LLM is instructed to return a JSON object containing:
chatMessage(optional): A brief text message for the agent to say.action: An object describing the action the agent should take (e.g.,{ "type": "move", "direction": "forward", "distance": 1 }or{ "type": "turn", "direction": "left", "degrees": 30 }).
- The
-
Audio Generation (Server):
- If the LLM response includes a
chatMessage, the server calls thegenerateAudiofunction (fromsrc/lib/audioService.ts). - Currently, for development and testing, this service returns a URL to a standard test audio file (e.g., a T-Rex roar) instead of generating unique TTS for each message.
- The
audioSrc(URL to the audio file) is added to the response object.
- If the LLM response includes a
-
Response to Client (Server -> Client): The server action returns the LLM's
action, thechatMessage, and theaudioSrcto the client. -
Client-Side Orchestration: A client-side component (conceptually, the
AIInteractionOrchestratoror similar logic) manages the following sequence for the specific agent:- Apply Action: The
actionreceived from the server is immediately applied to the game state (e.g., the AI agent's model in the 3D world is moved or rotated). - Play Audio: If an
audioSrcis present, the audio is played using an HTMLAudioElement. - Wait for Audio Completion: The orchestrator listens for the
onendedevent of the audio element. - Trigger Next Cycle: Only after the
onendedevent fires (or if there was no audio to play), the orchestrator considers the current cycle complete for that agent and initiates a new cycle by returning to Step 1 (or Step 2 directly).
- Apply Action: The
-
Chat Message Broadcast (Server via SSE): Concurrently with step 5 & 6, if a
chatMessagewas generated by the LLM, thepostChatMessageToEventsfunction is called on the server. This broadcasts the chat message to all connected clients via Server-Sent Events (SSE), allowing it to appear in the shared chat UI.
- Synchronization: Actions and spoken words are more closely tied. An agent acts, and then its corresponding line is "spoken." The next turn for that agent waits for the speech to finish.
- Controlled Pacing: Prevents agents from making new decisions and speaking again before their previous utterance has finished, leading to a more natural and understandable interaction pace.
- Reduced Rapid Cycling: Avoids overwhelming the LLM with too-frequent requests that could occur if decisions were made without waiting for audio.
- A central orchestrator logic (like the
AIInteractionOrchestrator.tsxcomponent proposed) is recommended to manage the state for each AI agent (e.g.,isProcessing,isWaitingForAudio). - It's essential to correctly handle the
onendedandonerrorevents of the HTMLAudioElement to control the loop. - Fetching the most recent chat history from a shared client-side store (updated by SSE) just before making the request to the backend is critical for providing good conversational context to the LLM.