You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<imgsrc="assets/demo_thumbnail.png"alt="Watch the demo"style="max-width:800px; width:60%">
16
+
<imgsrc="assets/demo_thumbnail.png"alt="Watch the demo"style="width:100%; max-width:900px;">
11
17
</a>
12
18
13
-
Build Vision Agents quickly with any model or video provider.
19
+
### Multi-modal AI agents that watch, listen, and understand video.
20
+
21
+
Vision Agents give you the building blocks to create intelligent, low-latency video experiences powered by your models, your infrastructure, and your use cases.
14
22
15
-
-**Video AI**: Built for real-time video AI. Combine Yolo, Roboflow and others with gemini/openai realtime
16
-
-**Low Latency**: Join quickly (500ms) and low audio/video latency (30ms)
17
-
-**Open**: Built by Stream, but use any video edge network that you like
18
-
-**Native APIs**: Native SDK methods from OpenAI (create response), Gemini (generate) and Claude (create message). So you can always use the latest LLM capabilities.
19
-
-**SDKs**: SDKs for React, Android, iOS, Flutter, React, React Native and Unity.
23
+
### Key Highlights
20
24
21
-
Created by Stream, uses [Stream's edge network](https://getstream.io/video/) for ultra-low latency.
25
+
-**Video AI:** Built for real-time video AI. Combine YOLO, Roboflow, and others with Gemini/OpenAI in real-time.
26
+
-**Low Latency:** Join quickly (500ms) and maintain audio/video latency under 30ms using [Stream's edge network](https://getstream.io/video/).
27
+
-**Open:** Built by Stream, but works with any video edge network.
28
+
-**Native APIs:** Native SDK methods from OpenAI (`create response`), Gemini (`generate`), and Claude (`create message`) — always access the latest LLM capabilities.
29
+
-**SDKs:** SDKs for React, Android, iOS, Flutter, React Native, and Unity, powered by Stream's ultra-low-latency network.
22
30
23
-
## Examples
31
+
---
32
+
33
+
## See It In Action
24
34
25
35
### Sports Coaching
26
36
@@ -45,7 +55,7 @@ Combining a fast object detection model (like YOLO) with a full realtime AI is u
45
55
For example: Drone fire detection. Sports/video game coaching. Physical therapy. Workout coaching, Just dance style games etc.
Get a free API key from [Stream](https://getstream.io/). Developers receive **333,000 participant minutes** per month, plus extra credits via the Maker Program.
|**True real-time via WebRTC**| Stream directly to model providers that support it for instant visual understanding. |
98
+
|**Interval/processor pipeline**| For providers without WebRTC, process frames with pluggable video processors (e.g., YOLO, Roboflow, or custom PyTorch/ONNX) before/after model calls. |
99
+
|**Turn detection & diarization**| Keep conversations natural; know when the agent should speak or stay quiet and who's talking. |
100
+
|**Voice activity detection (VAD)**| Trigger actions intelligently and use resources efficiently. |
101
+
|**Speech↔Text↔Speech**| Enable low-latency loops for smooth, conversational voice UX. |
102
+
|**Tool/function calling**| Execute arbitrary code and APIs mid-conversation. Create Linear issues, query weather, trigger telephony, or hit internal services. |
103
+
|**Built-in memory via Stream Chat**| Agents recall context naturally across turns and sessions. |
104
+
|**Text back-channel**| Message the agent silently during a call. |
|**Cartesia**| TTS plugin for realistic voice synthesis in real-time voice applications |[View Docs](https://visionagents.ai/integrations/cartesia)|
111
+
|**Deepgram**| STT plugin for fast, accurate real-time transcription with speaker diarization |[View Docs](https://visionagents.ai/integrations/deepgram)|
112
+
|**ElevenLabs**| TTS plugin with highly realistic and expressive voices for conversational agents |[View Docs](https://visionagents.ai/integrations/elevenlabs)|
113
+
|**Kokoro**| Local TTS engine for offline voice synthesis with low latency |[View Docs](https://visionagents.ai/integrations/kokoro)|
114
+
|**Moonshine**| STT plugin optimized for fast, locally runnable transcription on constrained devices |[View Docs](https://visionagents.ai/integrations/moonshine)|
115
+
|**OpenAI**| LLM plugin for real-time reasoning, conversation, and multimodal capabilities using OpenAI's Realtime API |[View Docs](https://visionagents.ai/integrations/openai)|
116
+
|**Gemini**| Multimodal plugin for real-time audio, video, and text understanding powered by Google's Gemini Live models |[View Docs](https://visionagents.ai/integrations/gemini)|
117
+
|**Silero**| VAD plugin for voice activity detection and turn-taking in low-latency real-time conversations |[View Docs](https://visionagents.ai/integrations/silero)|
118
+
|**Wizper**| Real-time variant of OpenAI's Whisper v3 for Speech-to-Text and on-the-fly translation, hosted by Fal.ai |[View Docs](https://visionagents.ai/integrations/wizper)|
119
+
69
120
## Processors
70
121
71
-
Processors enable you to provide state and receive/publish video & audio.
72
-
Many video AI use case require you to do things like
122
+
Processors let your agent **manage state** and **handle audio/video** in real-time.
123
+
124
+
They take care of the hard stuff, like:
73
125
74
-
* Run a smaller AI model next to the LLM (like Yolo or roboflow)
75
-
* Make API calls to maintain relevant info/game state
76
-
* Modify audio/video, for instance avatars
77
-
* Capture audio/video
126
+
- Running smaller models
127
+
- Making API calls
128
+
- Transforming media
78
129
79
-
This is all handled by processors.
130
+
… so you can focus on your agent logic.
80
131
81
-
## Docs
132
+
## Documentation
82
133
83
-
To get started with Vision Agents, check out our getting started guide at [VisionAgents.ai](https://visionagents.ai).
134
+
Check out our getting started guide at [VisionAgents.ai](https://visionagents.ai/).
84
135
85
-
-Quickstart: [Building a Voice AI app](https://visionagents.ai/introduction/voice-agents)
86
-
-Quickstart: [Building a Video AI app](https://visionagents.ai/introduction/video-agents)
|[@demishassabis](https://x.com/demishassabis)<br>CEO @ Google DeepMind<br><sub>Won a Nobel prize</sub> |[@OfficialLoganK](https://x.com/OfficialLoganK)<br>Product Lead @ Gemini<br><sub>Posts about robotics vision</sub> |[@ultralytics](https://x.com/ultralytics)<br>Various fast vision AI models<br><sub>Pose, detect, segment, classify</sub> |
|[@skalskip92](https://x.com/skalskip92)<br>Open Source Lead @ Roboflow<br><sub>Building tools for vision AI</sub> |[@moondreamai](https://x.com/moondreamai)<br>The tiny vision model that could<br><sub>Lightweight, fast, efficient</sub> |[@kwindla](https://x.com/kwindla)<br>Pipecat / Daily<br><sub>Sharing AI and vision insights</sub> |
|[@juberti](https://x.com/juberti)<br>Head of Realtime AI @ OpenAI<br><sub>Realtime AI systems</sub> |[@romainhuet](https://x.com/romainhuet)<br>Head of DX @ OpenAI<br><sub>Developer tooling & APIs</sub> |[@thorwebdev](https://x.com/thorwebdev)<br>Eleven Labs<br><sub>Voice and AI experiments</sub> |
|[@mervenoyann](https://x.com/mervenoyann)<br>Hugging Face<br><sub>Posts extensively about Video AI</sub> |[@stash_pomichter](https://x.com/stash_pomichter)<br>Spatial memory for robots<br><sub>Robotics & AI navigation</sub> |
109
168
110
169
## Inspiration
111
170
112
171
- Livekit Agents: Great syntax, Livekit only
113
172
- Pipecat: Flexible, but more verbose.
114
173
- OpenAI Agents: Focused on openAI only
115
174
116
-
## Open Platform
117
-
Reach out to [email protected], and we'll collaborate on getting you added
118
-
We'd like to add support for and are reaching out to:
119
-
120
-
* Mediasoup
121
-
* Janus
122
-
* Cloudflare
123
-
* Twilio
124
-
* AWS IVS
125
-
* Vonage
126
-
* And others.
127
-
128
175
## Roadmap
129
176
130
-
**0.1 - First release**
131
-
- Support for >10 out of the box [integrations](https://visionagents.ai/integrations/introduction-to-integrations)
132
-
- Support for video processors
177
+
### 0.1 – First Release
178
+
179
+
- Support for 10+ out-of-the-box integrations
180
+
- Video processors
133
181
- Native Stream Chat integration for memory
134
-
- Support for MCP and function calling for Gemini and OpenAI
135
-
- Support for realtime WebRTC video and voice with GPT Realtime
136
-
137
-
**Coming Soon**
138
-
- The Python WebRTC lib we use has some limitations. Investigating this.
139
-
- Hosting & production deploy example
140
-
- More built-in Yolo processors: Object detection, person detection, etc
141
-
- Roboflow support
142
-
- Computer use support
143
-
- AI avatar support. Tavus etc
144
-
- QWen3 vision support
145
-
- Buffered video capture support (enabling AI to capture video when something exciting happens)
146
-
- Moondream vision
182
+
- MCP & function calling for Gemini and OpenAI
183
+
- Realtime WebRTC video and voice with GPT Realtime
184
+
185
+
### Coming Soon
186
+
187
+
[] Improved Python WebRTC library
188
+
[] Hosting & production deploy example
189
+
[] More built-in YOLO processors (object & person detection)
190
+
[] Roboflow support
191
+
[] Computer use support
192
+
[] AI avatar integrations (e.g., Tavus)
193
+
[] QWen3 vision support
194
+
[] Buffered video capture (for "catch the moment" scenarios)
195
+
[] Moondream vision
196
+
197
+
## Star History
147
198
199
+
[](https://www.star-history.com/#GetStream/vision-agents&type=timeline&legend=top-left)
0 commit comments