🐠 Nemo's Vision: Edge Multimodal Agent

"A privacy-first, dual-spectrum autonomous agent running locally on Raspberry Pi 4B."

Nemo's Vision is an experiment in multimodal orchestration on the edge. It combines real-time object detection, small language models (SLMs), and text-to-speech (TTS) into a single autonomous loop, running entirely on a standard Raspberry Pi 4B (8GB).

Unlike cloud-based solutions, Nemo processes dual video streams and generates contextual narration entirely on-device without any external accelerators (no Hailo, no Coral, no GPU). To achieve near real-time performance on a Cortex-A72 CPU, we moved away from monolithic Vision Language Models (VLMs) in favor of a modular, orchestrated pipeline.

(The Interface: Simultaneous RGB and NoIR streams with bounding box inference, feeding into a generated narrative.)

🧠 The Architecture: Why Modular Wins on Edge

We initially experimented with lightweight Vision Language Models (VLMs) like Moondream2 and NanoLlava. While impressive, running a VLM on the Pi's CPU resulted in high latency (inference >10s), destroying the sense of presence.

To solve this, we decoupled Vision from Reasoning:

Vision (The "Eyes"): Ultralytics YOLO26 (Small).

Evolution: We upgraded from YOLO11s to the newly released YOLO26s. The new NMS-free architecture provides slightly better detection accuracy at similar speeds on the CPU, delivering structured data (labels + bounding box coordinates) instantly.

Reasoning (The "Brain"): Qwen 2.5 (0.5B) via Ollama.

Why: By feeding the structured YOLO data into a tiny but capable SLM, we get descriptive scene understanding ("The teddy bear is in front of the book") with a fraction of the compute cost of a full VLM.

Speech (The "Voice"): KittenTTS.

Why: Ultra-low latency synthesis that fits within the remaining CPU cycles.

🚦 "Traffic Light" Orchestration (The Secret Sauce)

Running three neural networks simultaneously on a Raspberry Pi CPU is a recipe for deadlock. As seen in our resource logs, the system frequently hits 97-98% CPU utilization.

(High CPU contention requires strict thread management to maintain stability.)

To manage this, we engineered a custom "Traffic Light" Service Orchestrator:

State Management: A shared, thread-safe state object acts as the central nervous system.
Sequential Locking: The system utilizes a semaphore-style lock. When the Narrator (LLM) needs to "think," it signals the Vision service to throttle down (sleep), freeing up critical CPU cycles for inference. Once the thought is generated, the Vision service wakes up while the TTS service takes over.
Result: A fluid, non-blocking experience that feels "alive" despite the hardware operating at its absolute limit.
Efficiency: Through careful memory management and quantization, the entire stack (Dual Camera Streams + YOLO26s + Qwen 2.5 + TTS + Web Server) stays well within 2GB of RAM, leaving plenty of headroom on the 8GB board.

🛠️ Tech Stack

Hardware: Raspberry Pi 4B (8GB RAM, Cortex-A72 CPU). No external accelerators.
Vision: Ultralytics YOLO26s (ONNX Runtime for CPU acceleration).
LLM: Qwen 2.5:0.5b (Quantized via Ollama).
TTS: KittenTTS (Real-time synthesis).
Backend: Python (Flask), Threading, OpenCV.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
agents		agents
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
nemo_test.jpg		nemo_test.jpg
remove_service.sh		remove_service.sh
setup_service.sh		setup_service.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🐠 Nemo's Vision: Edge Multimodal Agent

🧠 The Architecture: Why Modular Wins on Edge

🚦 "Traffic Light" Orchestration (The Secret Sauce)

🛠️ Tech Stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🐠 Nemo's Vision: Edge Multimodal Agent

🧠 The Architecture: Why Modular Wins on Edge

🚦 "Traffic Light" Orchestration (The Secret Sauce)

🛠️ Tech Stack

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages