AI Interaction Flow

This document outlines the updated interaction flow for AI agents within the Planeo application, focusing on how actions, chat messages, and audio playback are synchronized.

Overview

The previous AI interaction model involved the AI generating a chat message, which was then processed for Text-to-Speech (TTS), and separately, actions might be inferred or directly generated. The new model tightens this loop, ensuring that an AI agent completes its spoken message and associated action before a new decision cycle is initiated for that agent.

Core Interaction Loop

The flow for each AI agent generally follows these steps:

Client-Side Trigger: An agent becomes ready for a new decision. This readiness is now primarily determined by the completion of audio playback for its previous chat message (or if it had no message to speak).
Visual Data Capture (Client): The client captures the current visual information from the specific AI agent's perspective (e.g., a screenshot of its view in the 3D world).
Request to Backend (Client -> Server):
- The client sends the captured image data and the current chat history to the generateAiActionAndChat server action.
- It's crucial that the chat history sent is as up-to-date as possible to provide the LLM with the latest conversational context.
LLM Processing (Server):
- The generateAiActionAndChat function on the server constructs a prompt for the Google Generative AI model (e.g., Gemini).
- This prompt includes the system instructions, the agent's current visual data (image), and the provided chat history.
- The LLM is instructed to return a JSON object containing:
  - chatMessage (optional): A brief text message for the agent to say.
  - action: An object describing the action the agent should take (e.g., { "type": "move", "direction": "forward", "distance": 1 } or { "type": "turn", "direction": "left", "degrees": 30 }).
Audio Generation (Server):
- If the LLM response includes a chatMessage, the server calls the generateAudio function (from src/lib/audioService.ts).
- Currently, for development and testing, this service returns a URL to a standard test audio file (e.g., a T-Rex roar) instead of generating unique TTS for each message.
- The audioSrc (URL to the audio file) is added to the response object.
Response to Client (Server -> Client): The server action returns the LLM's action, the chatMessage, and the audioSrc to the client.
Client-Side Orchestration: A client-side component (conceptually, the AIInteractionOrchestrator or similar logic) manages the following sequence for the specific agent:
- Apply Action: The action received from the server is immediately applied to the game state (e.g., the AI agent's model in the 3D world is moved or rotated).
- Play Audio: If an audioSrc is present, the audio is played using an HTMLAudioElement.
- Wait for Audio Completion: The orchestrator listens for the onended event of the audio element.
- Trigger Next Cycle: Only after the onended event fires (or if there was no audio to play), the orchestrator considers the current cycle complete for that agent and initiates a new cycle by returning to Step 1 (or Step 2 directly).
Chat Message Broadcast (Server via SSE): Concurrently with step 5 & 6, if a chatMessage was generated by the LLM, the postChatMessageToEvents function is called on the server. This broadcasts the chat message to all connected clients via Server-Sent Events (SSE), allowing it to appear in the shared chat UI.

Benefits of this Flow

Synchronization: Actions and spoken words are more closely tied. An agent acts, and then its corresponding line is "spoken." The next turn for that agent waits for the speech to finish.
Controlled Pacing: Prevents agents from making new decisions and speaking again before their previous utterance has finished, leading to a more natural and understandable interaction pace.
Reduced Rapid Cycling: Avoids overwhelming the LLM with too-frequent requests that could occur if decisions were made without waiting for audio.

Client-Side Implementation Notes

A central orchestrator logic (like the AIInteractionOrchestrator.tsx component proposed) is recommended to manage the state for each AI agent (e.g., isProcessing, isWaitingForAudio).
It's essential to correctly handle the onended and onerror events of the HTMLAudioElement to control the loop.
Fetching the most recent chat history from a shared client-side store (updated by SSE) just before making the request to the backend is critical for providing good conversational context to the LLM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI Interaction Flow

Overview

Core Interaction Loop

Benefits of this Flow

Client-Side Implementation Notes

FilesExpand file tree

ai-interaction-flow.md

Latest commit

History

ai-interaction-flow.md

File metadata and controls

AI Interaction Flow

Overview

Core Interaction Loop

Benefits of this Flow

Client-Side Implementation Notes