Skip to content

Commit 290849e

Browse files
authored
Feat: Add support for Moondream VLM functions (#154)
* Scaffolding setup for Moondream VLM * Basic (broken) impl * Fix parsing * Add some handling around processing * Basic Moondream VLM example * Remove extra character * Clean up folder structure * WIP local version * Fix broken track imports * LocalVLM tests * Unused param * Ensure processors are wramed up during launch * Ruff and MyPy * PR review - CloudVLM * Add missing debug log for processor warmup * Improve local device detection * Formatting and clean up * More clean up * Fix bug with processing lock * Ruff and MyPy final checks * Expose device for verification * Simplify example * Update public doc strings * Update readme * unused import
1 parent 58fd257 commit 290849e

File tree

16 files changed

+1154
-147
lines changed

16 files changed

+1154
-147
lines changed

agents-core/vision_agents/core/agents/agent_launcher.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,14 @@ async def warmup(self, **kwargs) -> None:
9292
if agent.turn_detection and hasattr(agent.turn_detection, 'warmup'):
9393
logger.debug("Warming up turn detection: %s", agent.turn_detection.__class__.__name__)
9494
warmup_tasks.append(agent.turn_detection.warmup())
95+
96+
# Warmup processors
97+
if agent.processors and hasattr(agent.processors, 'warmup'):
98+
logger.debug("Warming up processors")
99+
for processor in agent.processors:
100+
if hasattr(processor, 'warmup'):
101+
logger.debug("Warming up processor: %s", processor.__class__.__name__)
102+
warmup_tasks.append(processor.warmup())
95103

96104
# Run all warmups in parallel
97105
if warmup_tasks:

plugins/moondream/README.md

Lines changed: 151 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,48 @@
11
# Moondream Plugin
22

3-
This plugin provides Moondream 3 detection capabilities for vision-agents, enabling real-time zero-shot object detection on video streams. Choose between cloud-hosted or local processing depending on your needs.
3+
This plugin provides Moondream 3 vision capabilities for vision-agents, including:
4+
- **Object Detection**: Real-time zero-shot object detection on video streams
5+
- **Visual Question Answering (VQA)**: Answer questions about video frames
6+
- **Image Captioning**: Generate descriptions of video frames
7+
8+
Choose between cloud-hosted or local processing depending on your needs. When running locally, we recommend you do so on CUDA enabled devices.
49

510
## Installation
611

712
```bash
8-
uv add vision-agents-plugins-moondream
13+
uv add vision-agents[moondream]
914
```
1015

11-
## Choosing the Right Processor
16+
## Choosing the Right Component
17+
18+
### Detection Processors
1219

13-
### CloudDetectionProcessor (Recommended for Most Users)
20+
#### CloudDetectionProcessor (Recommended for Most Users)
1421
- **Use when:** You want a simple setup with no infrastructure management
1522
- **Pros:** No model download, no GPU required, automatic updates
1623
- **Cons:** Requires API key, 2 RPS rate limit by default (can be increased)
1724
- **Best for:** Development, testing, low-to-medium volume applications
1825

19-
### LocalDetectionProcessor (For Advanced Users)
26+
#### LocalDetectionProcessor (For Advanced Users)
2027
- **Use when:** You need higher throughput, have your own GPU infrastructure, or want to avoid rate limits
2128
- **Pros:** No rate limits, no API costs, full control over hardware
2229
- **Cons:** Requires GPU for best performance, model download on first use, infrastructure management
2330
- **Best for:** Production deployments, high-volume applications, Digital Ocean Gradient AI GPUs, or custom infrastructure
2431

32+
### Vision Language Models (VLM)
33+
34+
#### CloudVLM (Recommended for Most Users)
35+
- **Use when:** You want visual question answering or captioning without managing infrastructure
36+
- **Pros:** No model download, no GPU required, automatic updates
37+
- **Cons:** Requires API key, rate limits apply
38+
- **Best for:** Development, testing, applications requiring VQA or captioning
39+
40+
#### LocalVLM (For Advanced Users)
41+
- **Use when:** You need VQA or captioning with higher throughput or want to avoid rate limits
42+
- **Pros:** No rate limits, no API costs, full control over hardware
43+
- **Cons:** Requires GPU for best performance, model download on first use, infrastructure management
44+
- **Best for:** Production deployments, high-volume applications, or custom infrastructure
45+
2546
## Quick Start
2647

2748
### Using CloudDetectionProcessor (Hosted)
@@ -64,7 +85,7 @@ from vision_agents.core import Agent
6485
processor = moondream.LocalDetectionProcessor(
6586
detect_objects=["person", "car", "dog"],
6687
conf_threshold=0.3,
67-
device="cuda", # Auto-detects CUDA, MPS, or CPU
88+
force_cpu=False, # Auto-detects CUDA, MPS, or CPU
6889
fps=30
6990
)
7091

@@ -87,6 +108,107 @@ processor = moondream.CloudDetectionProcessor(
87108
)
88109
```
89110

111+
## Vision Language Model (VLM) Quick Start
112+
113+
### Using CloudVLM (Hosted)
114+
115+
The `CloudVLM` uses Moondream's hosted API for visual question answering and captioning. It automatically processes video frames and responds to questions asked via STT (Speech-to-Text).
116+
117+
```python
118+
import asyncio
119+
import os
120+
from dotenv import load_dotenv
121+
from vision_agents.core import User, Agent, cli
122+
from vision_agents.core.agents import AgentLauncher
123+
from vision_agents.plugins import deepgram, getstream, elevenlabs, moondream
124+
from vision_agents.core.events import CallSessionParticipantJoinedEvent
125+
126+
load_dotenv()
127+
128+
async def create_agent(**kwargs) -> Agent:
129+
# Create a cloud VLM for visual question answering
130+
llm = moondream.CloudVLM(
131+
api_key=os.getenv("MOONDREAM_API_KEY"), # or set MOONDREAM_API_KEY env var
132+
mode="vqa", # or "caption" for image captioning
133+
)
134+
135+
agent = Agent(
136+
edge=getstream.Edge(),
137+
agent_user=User(name="My happy AI friend", id="agent"),
138+
llm=llm,
139+
tts=elevenlabs.TTS(),
140+
stt=deepgram.STT(),
141+
)
142+
return agent
143+
144+
async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
145+
await agent.create_user()
146+
call = await agent.create_call(call_type, call_id)
147+
148+
@agent.events.subscribe
149+
async def on_participant_joined(event: CallSessionParticipantJoinedEvent):
150+
if event.participant.user.id != "agent":
151+
await asyncio.sleep(2)
152+
# Ask the agent to describe what it sees
153+
await agent.simple_response("Describe what you currently see")
154+
155+
with await agent.join(call):
156+
await agent.edge.open_demo(call)
157+
await agent.finish()
158+
159+
if __name__ == "__main__":
160+
cli(AgentLauncher(create_agent=create_agent, join_call=join_call))
161+
```
162+
163+
### Using LocalVLM (On-Device)
164+
165+
The `LocalVLM` downloads the model from HuggingFace and runs on device. It supports both VQA and captioning modes.
166+
167+
**Note:** The moondream3-preview model is gated and requires HuggingFace authentication:
168+
- Request access at https://huggingface.co/moondream/moondream3-preview
169+
- Set `HF_TOKEN` environment variable: `export HF_TOKEN=your_token_here`
170+
- Or run: `huggingface-cli login`
171+
172+
```python
173+
from vision_agents.plugins import moondream
174+
from vision_agents.core import Agent
175+
176+
# Create a local VLM (no API key needed)
177+
llm = moondream.LocalVLM(
178+
mode="vqa", # or "caption" for image captioning
179+
force_cpu=False, # Auto-detects CUDA, MPS, or CPU
180+
)
181+
182+
# Use in an agent
183+
agent = Agent(
184+
llm=llm,
185+
tts=your_tts,
186+
stt=your_stt,
187+
# ... other components
188+
)
189+
```
190+
191+
### VLM Modes
192+
193+
The VLM supports two modes:
194+
195+
- **`"vqa"`** (Visual Question Answering): Answers questions about video frames. Questions come from STT transcripts.
196+
- **`"caption"`** (Image Captioning): Generates descriptions of video frames automatically.
197+
198+
```python
199+
# VQA mode - answers questions about frames
200+
llm = moondream.CloudVLM(
201+
api_key="your-api-key",
202+
mode="vqa"
203+
)
204+
205+
# Caption mode - generates automatic descriptions
206+
llm = moondream.CloudVLM(
207+
api_key="your-api-key",
208+
mode="caption"
209+
)
210+
```
211+
90212
## Configuration
91213

92214
### CloudDetectionProcessor Parameters
@@ -107,12 +229,30 @@ processor = moondream.CloudDetectionProcessor(
107229
- `fps`: int - Frame processing rate (default: 30)
108230
- `interval`: int - Processing interval in seconds (default: 0)
109231
- `max_workers`: int - Thread pool size for CPU-intensive operations (default: 10)
110-
- `device`: str - Device to run inference on ('cuda', 'mps', or 'cpu'). Auto-detects CUDA, then MPS (Apple Silicon), then defaults to CPU. Default: `None` (auto-detect)
232+
- `force_cpu`: bool - If True, force CPU usage even if CUDA/MPS is available. Auto-detects CUDA, then MPS (Apple Silicon), then defaults to CPU. We recommend running on CUDA for best performance. (default: False)
111233
- `model_name`: str - Hugging Face model identifier (default: "moondream/moondream3-preview")
112234
- `options`: AgentOptions - Model directory configuration. If not provided, uses default which defaults to tempfile.gettempdir()
113235

114236
**Performance:** Performance will vary depending on your hardware configuration. CUDA is recommended for best performance on NVIDIA GPUs. The model will be downloaded from HuggingFace on first use.
115237

238+
### CloudVLM Parameters
239+
240+
- `api_key`: str - API key for Moondream Cloud API. If not provided, will attempt to read from `MOONDREAM_API_KEY` environment variable.
241+
- `mode`: Literal["vqa", "caption"] - "vqa" for visual question answering or "caption" for image captioning (default: "vqa")
242+
- `max_workers`: int - Thread pool size for CPU-intensive operations (default: 10)
243+
244+
**Rate Limits:** By default, the Moondream Cloud API has rate limits. Contact the Moondream team to request higher limits.
245+
246+
### LocalVLM Parameters
247+
248+
- `mode`: Literal["vqa", "caption"] - "vqa" for visual question answering or "caption" for image captioning (default: "vqa")
249+
- `max_workers`: int - Thread pool size for async operations (default: 10)
250+
- `force_cpu`: bool - If True, force CPU usage even if CUDA/MPS is available. Auto-detects CUDA, then MPS (Apple Silicon), then defaults to CPU. Note: MPS is automatically converted to CPU due to model compatibility. We recommend running on CUDA for best performance. (default: False)
251+
- `model_name`: str - Hugging Face model identifier (default: "moondream/moondream3-preview")
252+
- `options`: AgentOptions - Model directory configuration. If not provided, uses default_agent_options()
253+
254+
**Performance:** Performance will vary depending on your hardware configuration. CUDA is recommended for best performance on NVIDIA GPUs. The model will be downloaded from HuggingFace on first use.
255+
116256
## Video Publishing
117257

118258
The processor publishes annotated video frames with bounding boxes drawn on detected objects:
@@ -146,16 +286,18 @@ pytest plugins/moondream/tests/ -k "annotation" -v
146286

147287
### Required
148288
- `vision-agents` - Core framework
149-
- `moondream` - Moondream SDK for cloud API (CloudDetectionProcessor only)
289+
- `moondream` - Moondream SDK for cloud API (CloudDetectionProcessor and CloudVLM)
150290
- `numpy>=2.0.0` - Array operations
151291
- `pillow>=10.0.0` - Image processing
152292
- `opencv-python>=4.8.0` - Video annotation
153293
- `aiortc` - WebRTC support
154294

155-
### LocalDetectionProcessor Additional Dependencies
295+
### Local Components Additional Dependencies
156296
- `torch` - PyTorch for model inference
157297
- `transformers` - HuggingFace transformers library for model loading
158298

299+
**Note:** LocalDetectionProcessor and LocalVLM both require these dependencies. We recommend only running the model locally on CUDA devices.
300+
159301
## Links
160302

161303
- [Moondream Documentation](https://docs.moondream.ai/)
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
## Moondream example
2+
Please see root readme for details.

plugins/moondream/example/__init__.py

Whitespace-only changes.
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
import asyncio
2+
import logging
3+
from dotenv import load_dotenv
4+
5+
from vision_agents.core import User, Agent, cli
6+
from vision_agents.core.agents import AgentLauncher
7+
from vision_agents.plugins import deepgram, getstream, elevenlabs, moondream
8+
from vision_agents.core.events import CallSessionParticipantJoinedEvent
9+
import os
10+
11+
logger = logging.getLogger(__name__)
12+
13+
load_dotenv()
14+
15+
async def create_agent(**kwargs) -> Agent:
16+
llm = moondream.CloudVLM(
17+
api_key=os.getenv("MOONDREAM_API_KEY"),
18+
)
19+
# create an agent to run with Stream's edge, openAI llm
20+
agent = Agent(
21+
edge=getstream.Edge(), # low latency edge. clients for React, iOS, Android, RN, Flutter etc.
22+
agent_user=User(
23+
name="My happy AI friend", id="agent"
24+
),
25+
llm=llm,
26+
tts=elevenlabs.TTS(),
27+
stt=deepgram.STT(),
28+
)
29+
return agent
30+
31+
32+
async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
33+
# ensure the agent user is created
34+
await agent.create_user()
35+
# Create a call
36+
call = await agent.create_call(call_type, call_id)
37+
38+
@agent.events.subscribe
39+
async def on_participant_joined(event: CallSessionParticipantJoinedEvent):
40+
if event.participant.user.id != "agent":
41+
await asyncio.sleep(2)
42+
await agent.simple_response("Describe what you currently see")
43+
44+
# Have the agent join the call/room
45+
with await agent.join(call):
46+
# Open the demo UI
47+
await agent.edge.open_demo(call)
48+
# run till the call ends
49+
await agent.finish()
50+
51+
52+
if __name__ == "__main__":
53+
cli(AgentLauncher(create_agent=create_agent, join_call=join_call))
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
[project]
2+
name = "moondream-example"
3+
version = "0.1.0"
4+
description = "Example using Moondream Detect and VLM with Vision Agents"
5+
requires-python = ">=3.10"
6+
dependencies = [
7+
"vision-agents",
8+
"vision-agents-plugins-moondream",
9+
"vision-agents-plugins-getstream",
10+
"vision-agents-plugins-deepgram",
11+
"vision-agents-plugins-elevenlabs",
12+
"vision-agents-plugins-vogent",
13+
"python-dotenv",
14+
]
15+
16+
[tool.uv.sources]
17+
vision-agents = { workspace = true }
18+
vision-agents-plugins-moondream = { workspace = true }
19+
vision-agents-plugins-getstream = { workspace = true }
20+
vision-agents-plugins-deepgram = { workspace = true }
21+
vision-agents-plugins-elevenlabs = { workspace = true }
22+
vision-agents-plugins-vogent = { workspace = true }

plugins/moondream/tests/test_moondream_local.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ def golf_image(self, assets_dir) -> Iterator[Image.Image]:
4141
@pytest.fixture
4242
def moondream_processor(self) -> Iterator[LocalDetectionProcessor]:
4343
"""Create and manage MoondreamLocalProcessor lifecycle."""
44-
processor = LocalDetectionProcessor(device="cpu")
44+
processor = LocalDetectionProcessor(force_cpu=True)
4545
try:
4646
yield processor
4747
finally:
@@ -261,7 +261,7 @@ def is_available():
261261
processor.close()
262262

263263
# Also test explicit MPS parameter
264-
processor2 = LocalDetectionProcessor(device="mps")
264+
processor2 = LocalDetectionProcessor(force_cpu=True)
265265
try:
266266
# Verify explicit MPS is also converted to CPU
267267
assert processor2.device == "cpu"
@@ -270,7 +270,7 @@ def is_available():
270270

271271
def test_device_explicit_cpu(self):
272272
"""Test explicit CPU device selection."""
273-
processor = LocalDetectionProcessor(device="cpu")
273+
processor = LocalDetectionProcessor(force_cpu=True)
274274
try:
275275
assert processor.device == "cpu"
276276
finally:
@@ -282,7 +282,7 @@ def test_device_explicit_cpu(self):
282282
)
283283
def test_device_explicit_cuda(self):
284284
"""Test explicit CUDA device selection (only if CUDA available)."""
285-
processor = LocalDetectionProcessor(device="cuda")
285+
processor = LocalDetectionProcessor()
286286
try:
287287
assert processor.device == "cuda"
288288
finally:

0 commit comments

Comments
 (0)