You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CLAUDE.md
+45-7Lines changed: 45 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with Li
4
4
5
5
## Project Overview
6
6
7
-
This covers voice AI agent development with LiveKit Agentsfor Python. The concepts and patterns described here apply to building, extending, and improving LiveKit-based conversational AI agents.
7
+
This covers multimodal AI agent development with LiveKit Agents, a realtime framework for production-grade voice, text, and vision AI agents. While this guide focuses on Python development, LiveKit also supports Node.js (beta). The concepts and patterns described here apply to building, extending, and improving LiveKit-based conversational AI agents across multiple platforms and use cases.
8
8
9
9
## Development Commands
10
10
@@ -38,14 +38,18 @@ This covers voice AI agent development with LiveKit Agents for Python. The conce
38
38
-**Function Tools** - Methods decorated with `@function_tool` that extend agent capabilities
39
39
-**Entrypoint Function** - Sets up the voice AI pipeline with STT/LLM/TTS components
40
40
41
-
### Voice AI Pipeline Architecture
41
+
### Multimodal AI Pipeline Architecture
42
42
LiveKit agents use a modular pipeline approach with swappable components:
43
43
-**STT (Speech-to-Text)**: Converts audio input to text transcripts
44
-
-**LLM (Large Language Model)**: Processes conversationsand generates responses
44
+
-**LLM (Large Language Model)**: Processes conversations, text, and vision inputs to generate responses
45
45
-**TTS (Text-to-Speech)**: Converts text responses back to synthesized speech
46
+
-**Vision Processing**: Handles image and video understanding for multimodal interactions
46
47
-**Turn Detection**: Determines when users finish speaking for natural conversation flow
47
48
-**VAD (Voice Activity Detection)**: Detects when users are speaking vs silent
49
+
-**Background Audio Handling**: Manages background audio and interruption scenarios
50
+
-**Interrupt Management**: Handles conversation interruptions and context switching
0 commit comments