-
Notifications
You must be signed in to change notification settings - Fork 76
Feat: Add support for Moondream VLM functions #154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 21 commits
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
d681edd
Scaffolding setup for Moondream VLM
Nash0x7E2 67a8529
Basic (broken) impl
Nash0x7E2 45509da
Fix parsing
Nash0x7E2 722662d
Add some handling around processing
Nash0x7E2 a184c78
Basic Moondream VLM example
Nash0x7E2 a9a092e
Remove extra character
Nash0x7E2 f838a1e
Clean up folder structure
Nash0x7E2 a0b5c9d
WIP local version
Nash0x7E2 e0b31d3
Fix broken track imports
Nash0x7E2 eaddf22
LocalVLM tests
Nash0x7E2 e32af63
Unused param
Nash0x7E2 02fad43
Ensure processors are wramed up during launch
Nash0x7E2 a82e2e0
Ruff and MyPy
Nash0x7E2 d1af35c
PR review - CloudVLM
Nash0x7E2 ec534fe
Add missing debug log for processor warmup
Nash0x7E2 2d4f0bc
Improve local device detection
Nash0x7E2 fa9847d
Formatting and clean up
Nash0x7E2 f1ba327
More clean up
Nash0x7E2 f9c91e9
Fix bug with processing lock
Nash0x7E2 97bc613
Ruff and MyPy final checks
Nash0x7E2 a801788
Expose device for verification
Nash0x7E2 83b32f0
Simplify example
Nash0x7E2 46f0f53
Update public doc strings
Nash0x7E2 0de1cdd
Update readme
Nash0x7E2 13fb325
unused import
Nash0x7E2 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
Empty file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,55 @@ | ||
| import asyncio | ||
| import logging | ||
| from dotenv import load_dotenv | ||
|
|
||
| from vision_agents.core import User, Agent, cli | ||
| from vision_agents.core.agents import AgentLauncher | ||
| from vision_agents.plugins import deepgram, getstream, vogent, elevenlabs, moondream, gemini | ||
| from vision_agents.core.events import CallSessionParticipantJoinedEvent | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
| load_dotenv() | ||
|
|
||
| async def create_agent(**kwargs) -> Agent: | ||
| llm = moondream.LocalDetectionProcessor( | ||
| # api_key=os.getenv("MOONDREAM_API_KEY"), | ||
|
|
||
| ) | ||
| # create an agent to run with Stream's edge, openAI llm | ||
| agent = Agent( | ||
| edge=getstream.Edge(), # low latency edge. clients for React, iOS, Android, RN, Flutter etc. | ||
| agent_user=User( | ||
| name="My happy AI friend", id="agent" | ||
| ), | ||
| llm=gemini.LLM("gemini-2.0-flash"), | ||
| tts=elevenlabs.TTS(), | ||
| stt=deepgram.STT(), | ||
| turn_detection=vogent.TurnDetection(), | ||
| processors=[llm] | ||
| ) | ||
| return agent | ||
|
|
||
|
|
||
| async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None: | ||
| # ensure the agent user is created | ||
| await agent.create_user() | ||
| # Create a call | ||
| call = await agent.create_call(call_type, call_id) | ||
|
|
||
| @agent.events.subscribe | ||
| async def on_participant_joined(event: CallSessionParticipantJoinedEvent): | ||
| if event.participant.user.id != "agent": | ||
| await asyncio.sleep(2) | ||
| await agent.simple_response("Describe what you currently see") | ||
|
|
||
| # Have the agent join the call/room | ||
| with await agent.join(call): | ||
| # Open the demo UI | ||
| await agent.edge.open_demo(call) | ||
| # run till the call ends | ||
| await agent.finish() | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| cli(AgentLauncher(create_agent=create_agent, join_call=join_call)) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| [project] | ||
| name = "moondream-example" | ||
| version = "0.1.0" | ||
| description = "Example using Moondream Detect and VLM with Vision Agents" | ||
| requires-python = ">=3.10" | ||
| dependencies = [ | ||
| "vision-agents", | ||
| "vision-agents-plugins-moondream", | ||
| "vision-agents-plugins-getstream", | ||
| "vision-agents-plugins-deepgram", | ||
| "vision-agents-plugins-elevenlabs", | ||
| "vision-agents-plugins-vogent", | ||
| "python-dotenv", | ||
| ] | ||
|
|
||
| [tool.uv.sources] | ||
| vision-agents = { workspace = true } | ||
| vision-agents-plugins-moondream = { workspace = true } | ||
| vision-agents-plugins-getstream = { workspace = true } | ||
| vision-agents-plugins-deepgram = { workspace = true } | ||
| vision-agents-plugins-elevenlabs = { workspace = true } | ||
| vision-agents-plugins-vogent = { workspace = true } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,102 @@ | ||
| """ | ||
| Tests for the Moondream LocalVLM plugin. | ||
|
|
||
| Integration tests require HF_TOKEN environment variable (for gated model access): | ||
|
|
||
| export HF_TOKEN="your-token-here" | ||
| uv run pytest plugins/moondream/tests/test_moondream_local_vlm.py -m integration -v | ||
| """ | ||
| import os | ||
| from pathlib import Path | ||
| from typing import Iterator | ||
|
|
||
| import pytest | ||
| import av | ||
| from PIL import Image | ||
|
|
||
| from vision_agents.plugins.moondream import LocalVLM | ||
|
|
||
|
|
||
| @pytest.fixture(scope="session") | ||
| def golf_image(assets_dir) -> Iterator[Image.Image]: | ||
| """Load the local golf swing test image from tests/test_assets.""" | ||
| asset_path = Path(assets_dir) / "golf_swing.png" | ||
| with Image.open(asset_path) as img: | ||
| yield img.convert("RGB") | ||
|
|
||
|
|
||
| @pytest.fixture | ||
| def golf_frame(golf_image: Image.Image) -> av.VideoFrame: | ||
| """Create an av.VideoFrame from the golf image.""" | ||
| return av.VideoFrame.from_image(golf_image) | ||
|
|
||
|
|
||
| @pytest.fixture | ||
| async def local_vlm_vqa() -> LocalVLM: | ||
| """Create LocalVLM in VQA mode.""" | ||
| hf_token = os.getenv("HF_TOKEN") | ||
| if not hf_token: | ||
| pytest.skip("HF_TOKEN not set") | ||
|
|
||
| vlm = LocalVLM(mode="vqa") | ||
| try: | ||
| await vlm.warmup() | ||
| yield vlm | ||
| finally: | ||
| vlm.close() | ||
|
|
||
|
|
||
| @pytest.fixture | ||
| async def local_vlm_caption() -> LocalVLM: | ||
| """Create LocalVLM in caption mode.""" | ||
| hf_token = os.getenv("HF_TOKEN") | ||
| if not hf_token: | ||
| pytest.skip("HF_TOKEN not set") | ||
|
|
||
| vlm = LocalVLM(mode="caption") | ||
| try: | ||
| await vlm.warmup() | ||
| yield vlm | ||
| finally: | ||
| vlm.close() | ||
|
|
||
|
|
||
| @pytest.mark.integration | ||
| @pytest.mark.skipif(not os.getenv("HF_TOKEN"), reason="HF_TOKEN not set") | ||
| async def test_local_vqa_mode(golf_frame: av.VideoFrame, local_vlm_vqa: LocalVLM): | ||
| """Test LocalVLM VQA mode with a question about the image.""" | ||
|
|
||
| await local_vlm_vqa.warmup() | ||
| assert local_vlm_vqa.model is not None, "Model must be loaded before test" | ||
|
|
||
| local_vlm_vqa._latest_frame = golf_frame | ||
|
|
||
| question = "What sport is being played in this image?" | ||
| response = await local_vlm_vqa.simple_response(question) | ||
|
|
||
| assert response is not None | ||
| assert response.text is not None | ||
| assert len(response.text) > 0 | ||
| assert response.exception is None | ||
|
|
||
| assert "golf" in response.text.lower() | ||
|
|
||
|
|
||
| @pytest.mark.integration | ||
| @pytest.mark.skipif(not os.getenv("HF_TOKEN"), reason="HF_TOKEN not set") | ||
| async def test_local_caption_mode(golf_frame: av.VideoFrame, local_vlm_caption: LocalVLM): | ||
| """Test LocalVLM caption mode to generate a description of the image.""" | ||
|
|
||
| await local_vlm_caption.warmup() | ||
| assert local_vlm_caption.model is not None, "Model must be loaded before test" | ||
|
|
||
| local_vlm_caption._latest_frame = golf_frame | ||
|
|
||
| response = await local_vlm_caption.simple_response("") | ||
|
|
||
| assert response is not None | ||
| assert response.text is not None | ||
| assert len(response.text) > 0 | ||
| assert response.exception is None | ||
|
|
||
| assert len(response.text.strip()) > 0 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,105 @@ | ||
| """ | ||
| Tests for the Moondream CloudVLM plugin. | ||
|
|
||
| Integration tests require MOONDREAM_API_KEY environment variable: | ||
|
|
||
| export MOONDREAM_API_KEY="your-key-here" | ||
| uv run pytest plugins/moondream/tests/test_moondream_vlm.py -m integration -v | ||
|
|
||
| To run only unit tests (no API key needed): | ||
|
|
||
| uv run pytest plugins/moondream/tests/test_moondream_vlm.py -m "not integration" -v | ||
| """ | ||
| import os | ||
| from pathlib import Path | ||
| from typing import Iterator | ||
|
|
||
| import pytest | ||
| import av | ||
| from PIL import Image | ||
|
|
||
| from vision_agents.plugins.moondream import CloudVLM | ||
|
|
||
|
|
||
| @pytest.fixture(scope="session") | ||
| def golf_image(assets_dir) -> Iterator[Image.Image]: | ||
| """Load the local golf swing test image from tests/test_assets.""" | ||
| asset_path = Path(assets_dir) / "golf_swing.png" | ||
| with Image.open(asset_path) as img: | ||
| yield img.convert("RGB") | ||
|
|
||
|
|
||
| @pytest.fixture | ||
| def golf_frame(golf_image: Image.Image) -> av.VideoFrame: | ||
| """Create an av.VideoFrame from the golf image.""" | ||
| return av.VideoFrame.from_image(golf_image) | ||
|
|
||
|
|
||
| @pytest.fixture | ||
| async def vlm_vqa() -> CloudVLM: | ||
| """Create CloudVLM in VQA mode.""" | ||
| api_key = os.getenv("MOONDREAM_API_KEY") | ||
| if not api_key: | ||
| pytest.skip("MOONDREAM_API_KEY not set") | ||
|
|
||
| vlm = CloudVLM(api_key=api_key, mode="vqa") | ||
| try: | ||
| yield vlm | ||
| finally: | ||
| vlm.close() | ||
|
|
||
|
|
||
| @pytest.fixture | ||
| async def vlm_caption() -> CloudVLM: | ||
| """Create CloudVLM in caption mode.""" | ||
| api_key = os.getenv("MOONDREAM_API_KEY") | ||
| if not api_key: | ||
| pytest.skip("MOONDREAM_API_KEY not set") | ||
|
|
||
| vlm = CloudVLM(api_key=api_key, mode="caption") | ||
| try: | ||
| yield vlm | ||
| finally: | ||
| vlm.close() | ||
|
|
||
|
|
||
| @pytest.mark.integration | ||
| @pytest.mark.skipif(not os.getenv("MOONDREAM_API_KEY"), reason="MOONDREAM_API_KEY not set") | ||
| async def test_vqa_mode(golf_frame: av.VideoFrame, vlm_vqa: CloudVLM): | ||
| """Test VQA mode with a question about the image.""" | ||
| # Set the latest frame so _process_frame can access it | ||
| vlm_vqa._latest_frame = golf_frame | ||
|
|
||
| # Ask a question about the image | ||
| question = "What sport is being played in this image?" | ||
| response = await vlm_vqa.simple_response(question) | ||
|
|
||
| # Verify we got a response | ||
| assert response is not None | ||
| assert response.text is not None | ||
| assert len(response.text) > 0 | ||
| assert response.exception is None | ||
|
|
||
| # Verify the response mentions golf (should be in the image) | ||
| assert "golf" in response.text.lower() | ||
|
|
||
|
|
||
| @pytest.mark.integration | ||
| @pytest.mark.skipif(not os.getenv("MOONDREAM_API_KEY"), reason="MOONDREAM_API_KEY not set") | ||
| async def test_caption_mode(golf_frame: av.VideoFrame, vlm_caption: CloudVLM): | ||
| """Test caption mode to generate a description of the image.""" | ||
| # Set the latest frame so _process_frame can access it | ||
| vlm_caption._latest_frame = golf_frame | ||
|
|
||
| # Generate caption (text is not needed for caption mode) | ||
| response = await vlm_caption.simple_response("") | ||
|
|
||
| # Verify we got a response | ||
| assert response is not None | ||
| assert response.text is not None | ||
| assert len(response.text) > 0 | ||
| assert response.exception is None | ||
|
|
||
| # Verify the caption is descriptive (not empty) | ||
| assert len(response.text.strip()) > 0 | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.