A fully offline, Iron Man-style voice + vision AI assistant running entirely on a $249 Jetson Orin Nano Super (8 GB). No cloud. No API keys. No subscriptions. Just you and your AI.
"At your service, sir."
Quick Start · Features · Performance · Architecture · Roadmap · Community
Most "local AI assistants" are a chatbot with a microphone. This is what happens when you actually build the full Iron Man experience on a $249 board:
| What others do | What J.A.R.V.I.S. does |
|---|---|
| Text chat with local LLM | Wake word → STT → LLM with tools → TTS through Bluetooth earbuds |
| Maybe a webcam feed | TensorRT YOLOE detection + optical flow + ego-motion + trajectory prediction + 3D holograms |
| "Works on my 4090" | Runs on 8 GB shared RAM — LLM + vision + depth + vitals simultaneously |
| Cloud fallback "for now" | Zero cloud dependencies. Everything local. Always. |
| Basic web UI | SvelteKit PWA with live camera, Three.js holograms, vitals, Iron Man HUD, threat alerts |
| No health awareness | rPPG heart rate, fatigue detection, posture scoring, proactive health alerts |
| Crashes on OOM | Multi-layer CUDA OOM recovery with automatic context reduction and model reload |
- openWakeWord — custom wake word, always listening
- Faster-Whisper — local STT, no cloud transcription, warm-started at boot
- Piper TTS — British male voice (Paul Bettany energy)
- Bluetooth — full HFP/A2DP with auto-reconnect daemon (exponential backoff)
- WebRTC VAD — adaptive end-of-speech detection (no more fixed 5s recording)
- Qwen3:1.7b (Q4_K_M) via Ollama — native tool-calling, 100% GPU offload
- 8192-token context — sweet spot for 8 GB: fast inference, no swap pressure
- Intent-based routing — only sends tool schemas when needed (0.5s greetings, not 8s)
- Adaptive thinking —
think=falsefor chat,think=truefor tool calls - JARVIS persona — formal British wit, sarcasm toggle, MCU-accurate responses
- YOLOE-26N (TensorRT FP16) — open-vocabulary detection, set any prompt at runtime
- ByteTrack — multi-object tracking with flow-assisted prediction (reduced ID switches)
- DepthAnything V2 Small (TensorRT FP16) — real-time depth maps for 3D holograms
- MediaPipe — face mesh (EAR fatigue, rPPG heart rate) + pose (posture scoring)
- Threat detection — anomaly scoring with trajectory-based collision prediction
- Always-on background scene — continuous context updated every 5s for spatial awareness
- Proactive intelligence — detects person enter/leave, new objects, env changes
- Proximity alerts — distance-based audio cues in portable mode ("Sir, obstacle ahead")
- Portable mode — 320×320 @ 10 FPS with thermal throttling + battery monitoring
- Optical flow (DIS default, Farneback available) — dense motion vectors with pre-allocated buffers (~6ms at 320x240)
- Ego-motion estimation — RANSAC fundamental matrix with result caching for static scenes (~0.04ms cached, ~2ms uncached)
- Object velocities in m/s — flow + depth fusion via pinhole camera model
- Trajectory prediction — vectorised NumPy batch computation (all objects at once), stationary skip (~0.5ms for 10 objects)
- Collision detection — time-to-collision estimation with proactive voiced alerts: "Sir, bicycle from left at 8 km/h — collision in 2.4 seconds"
- Walk-around awareness — detects user walking/panning/turning, stabilises detections during ego-motion
- Motion-aware context — LLM receives speeds, distances, trajectories, ego-motion state automatically
- Zero extra GPU — entire perception pipeline is CPU-only (OpenCV/NumPy), ~8ms avg / ~10ms p95
- Ambient awareness — always-on DIS flow at 160x120 (~2ms), detects motion/scene changes without full YOLOE
- Zero manual triggers — ambient events auto-escalate to full perception when significant change detected
- Proactive verbalization — collision alerts, scene changes, walking/stationary transitions spoken automatically
- Cooldown system — prevents verbal spam (10s non-critical, 0s safety-critical like collisions)
- Thermal/battery adaptive — auto-reduces duty cycle at >70°C or <15% battery, pauses at >80°C
- State machine — IDLE (2 Hz) → ACTIVE (5 Hz) → COOLDOWN, with configurable durations
- Live MJPEG camera feed with detection overlays and threat-level borders
- Three.js holograms — real-time 3D point cloud visualization (2D Canvas fallback)
- HUD overlay — Iron Man-style AR tracking with real-time annotations
- Vitals dashboard — fatigue, posture, heart rate, all via WebSocket
- Jetson stats — GPU/CPU/thermal monitoring
- Reminders — create and manage via voice or UI
- Accessible from any device on the LAN
- 367+ unit + E2E tests with pytest (344 unit, 23+ E2E)
- Preflight system checks — validates all subsystems at startup with verbal status
- Multi-layer CUDA OOM protection — pauses vision, unloads model, drops caches, retries with smaller context
- Bluetooth auto-reconnect — daemon monitors every 10s, verifies audio route after reconnect
- BT-aware VAD — longer silence threshold (2.0s) when BT audio detected (compensates codec latency)
- Listening chime — pre-synthesized "Listening, sir" played instantly on wake word detection
- Camera auto-reconnect on USB disconnect
- WebSocket reliability — message sequencing, reorder buffer (50ms hold for gaps), rate limiting, heartbeat with health tracking (good/degraded/lost), ack-based loading
- PWA button debouncing — ack-based loading states (resolved on server response, not timeout),
aria-busyaccessibility - Verbal error recovery — TTS-spoken recovery messages on STT/LLM/vision failures instead of silent drops
- Connection health — pong-based heartbeat monitoring, auto-reconnect on 3 missed pongs, state resync on reconnect
- Graceful degradation — every subsystem is optional, pipeline continues if one fails
Real benchmarks on Jetson Orin Nano Super (8 GB), MAXN_SUPER, jetson_clocks:
| Scenario | Latency | Notes |
|---|---|---|
| Greeting / status / time | 0.5 – 0.7s | think=false, no tools — instant |
| Tool call (joke, reminder) | 3.6 – 8.4s | think=true, selected tools only |
| Vision query (pre-fetched) | 0.7s | Scene already in context |
| Full voice loop (wake → reply) | < 4s | STT + LLM + TTS for simple queries |
Context size benchmarks
| num_ctx | VRAM | GPU% | Chat Latency | Verdict |
|---|---|---|---|---|
| 2048 | 1.6 GB | 100% | 12.9s | KV thrashing — unusable |
| 4096 | 1.7 GB | 100% | 4.1s | Acceptable |
| 8192 | 2.0 GB | 100% | 3.5s | Production pick |
| 12288 | 2.3 GB | 100% | ~4s | Swap pressure |
| 16384 | 2.6 GB | 30/70 | Slow | Spills to CPU — no go |
Memory budget breakdown
| Component | RAM | Notes |
|---|---|---|
| Qwen3:1.7b @ 8192 ctx | ~2.0 GB | 100% GPU, flash attention + q8_0 KV |
| YOLOE-26N TensorRT | ~0.3 GB | FP16 engine |
| DepthAnything V2 Small | ~0.4 GB | FP16 engine, optional |
| Perception pipeline | ~0.0 GB | CPU-only (OpenCV/NumPy), ~8ms avg (DIS + cache) |
| Ambient awareness | ~0.001 GB | CPU-only, 160x120 DIS flow, ~2ms per check |
| MediaPipe (face + pose) | ~0.1 GB | CPU inference |
| Faster-Whisper small | ~0.5 GB | Loaded on demand |
| OS + Desktop + Python | ~3.5 GB | JetPack 6.x + X11 |
| Total | ~6.8 GB | Fits in 7.6 GB with headroom |
- Jetson Orin Nano Super (8 GB) with JetPack 6.x
- USB webcam + Bluetooth earbuds (or USB mic + speakers)
- Ollama installed (one-line install)
# Clone and enter
git clone https://github.com/steffenpharai/Jarvis.git && cd Jarvis
# Setup Python environment
python3 -m venv venv && source venv/bin/activate
pip install --upgrade pip && pip install -r requirements.txt
# Download all models (wake word, STT, TTS voice)
bash scripts/bootstrap_models.sh
# Pull the LLM
ollama pull qwen3:1.7b
# Configure Ollama for 8GB Jetson (flash attention, 8-bit KV cache, etc.)
sudo bash scripts/configure-ollama-systemd.sh
sudo systemctl daemon-reload && sudo systemctl restart ollama
# Build the PWA frontend
cd pwa && npm install && npm run build && cd ..
# Launch! (full-stack: voice + vision + PWA + Iron Man HUD)
python main.py --serveOpen http://<jetson-ip>:8000 from any device on your network. That's it.
Optional: TensorRT engines for vision
source venv/bin/activate && . /etc/profile.d/cuda.sh
# YOLOE-26N detection engine (required for vision)
bash scripts/export_yolo_engine.sh
# DepthAnything V2 depth engine (required for 3D holograms)
bash scripts/export_depth_engine.shEngine builds run on-device and take several minutes. Once built, they're cached in models/.
Optional: CUDA + PyTorch for Jetson
# System dependencies
sudo apt-get install -y python3-pip libopenblas-dev
# cuSPARSELt (required for PyTorch 24.06+ on JetPack 6.x)
bash scripts/install-cusparselt.sh
# CUDA in PATH
sudo bash scripts/install-cuda-path.sh
# PyTorch with CUDA (Jetson wheel)
source venv/bin/activate && . /etc/profile.d/cuda.sh
bash scripts/install-pytorch-cuda-nvidia.sh
# Verify
python -c "import torch; print('CUDA:', torch.cuda.is_available())"source venv/bin/activate
python main.py --serve # Full-stack: API + PWA + voice + vision
python main.py --serve --portable # Walk-around mode: 320x320, 10 FPS, thermal-aware
python main.py --orchestrator # Voice-only agentic loop (no web UI)
python main.py --e2e # Voice loop without tools
python main.py --one-shot "Hello" # Single text query (no mic needed)
python main.py --dry-run # Validate config
python main.py --test-audio # List audio devices
python main.py --yolo-visualize # Live camera + YOLOE detections (OpenCV window)| Tool | What it does |
|---|---|
vision_analyze |
Re-scan camera with optional open-vocabulary prompt |
hologram_render |
Generate 3D hologram and push to all connected PWA clients |
create_reminder |
Save a reminder with optional time |
tell_joke |
Deliver a J.A.R.V.I.S.-quality one-liner |
toggle_sarcasm |
Toggle sarcasm mode (you've been warned) |
Time, system stats, scene description, vitals, threat level, and reminders are injected directly into context — no tool call overhead for those.
graph TB
subgraph VOICE["🎙️ Voice Pipeline"]
WW[openWakeWord] --> STT[Faster-Whisper STT]
STT --> ORCH
TTS[Piper TTS<br/>British Male] --> BT[Bluetooth<br/>HFP/A2DP]
end
subgraph BRAIN["🧠 LLM Brain"]
ORCH[Orchestrator<br/>Intent Router] --> LLM[Qwen3:1.7b<br/>Ollama · 100% GPU]
LLM --> TOOLS[Tool Executor]
TOOLS --> ORCH
MEM[Short/Long-term<br/>Memory] --> ORCH
end
subgraph VISION["👁️ Vision Suite"]
CAM[USB Camera] --> YOLO[YOLOE-26N<br/>TensorRT]
CAM --> FLOW[Optical Flow<br/>Farneback/DIS]
CAM --> DEPTH[DepthAnything V2<br/>TensorRT]
CAM --> MP[MediaPipe<br/>Face + Pose]
FLOW --> EGO[Ego-Motion<br/>RANSAC]
FLOW --> TRACK[ByteTrack<br/>Flow-Assisted]
YOLO --> TRACK
TRACK --> TRAJ[Trajectory<br/>Prediction]
DEPTH --> TRAJ
EGO --> TRAJ
TRAJ --> THREAT[Threat<br/>Scorer]
MP --> VITALS[Vitals<br/>EAR · Posture · rPPG]
end
subgraph SERVER["🌐 Server"]
API[FastAPI] --> WS[WebSocket<br/>Bridge]
API --> MJPEG[MJPEG<br/>Stream]
API --> REST[REST API]
end
subgraph PWA["📱 SvelteKit PWA"]
CHAT[Chat Panel]
HOLO[Three.js<br/>Hologram]
HUD[Iron Man<br/>HUD Overlay]
VIT[Vitals Panel]
DASH[Jetson Stats]
end
ORCH --> TTS
VISION --> WS
VISION --> ORCH
WS --> PWA
LLM --> API
style VOICE fill:#1a1a2e,stroke:#e94560,color:#fff
style BRAIN fill:#1a1a2e,stroke:#0f3460,color:#fff
style VISION fill:#1a1a2e,stroke:#16213e,color:#fff
style SERVER fill:#1a1a2e,stroke:#533483,color:#fff
style PWA fill:#1a1a2e,stroke:#e94560,color:#fff
Vision pipeline detail
Camera Frame (t)
├─ YOLOE-26N (TensorRT) → detections + open-vocab prompting
├─ Optical Flow (DIS, pre-alloc buffer, 320x240) → dense motion vectors (~6ms)
├─ DepthAnything V2 Small → depth map + 3D point cloud
├─ MediaPipe Face Mesh → EAR fatigue detection, rPPG heart rate
├─ MediaPipe Pose → posture scoring
│
▼ Perception Fusion (CPU-only, ~8ms avg / ~10ms p95)
├─ Ego-motion estimation (RANSAC + cache for static scenes, ~0.04ms cached)
├─ Flow-assisted ByteTrack (60% flow / 40% Kalman prediction)
├─ Ego-motion compensation → true object velocities (m/s, vectorised NumPy)
├─ Trajectory prediction (vectorised batch, stationary skip, ~0.5ms)
├─ Collision detection (time-to-collision + severity alerts)
└─ ThreatScorer → threat assessment with trajectory awareness
↓
WebSocket broadcast → PWA (hologram, vitals, threat, collisions)
↓
Enriched LLM context → "person approaching at 1.2m/s, 3.8m away"
Ambient Awareness (always-on, parallel thread):
Camera Frame → DIS Flow 160x120 (~2ms) → ego-motion check + motion energy
↓
Trigger: motion_detected | ego_motion_start/stop | scene_change
↓
Escalate → Full YOLOE + Perception → Proactive TTS
main.py CLI dispatcher and entry point
orchestrator.py Async agentic loop (context, tools, proactive vision)
tools.py Tool registry (vision, hologram, reminders, joke, sarcasm)
memory.py Session summary and persistence
run_tests.py Test runner helper
config/
settings.py Jetson/Ollama tuning parameters
prompts.py JARVIS persona and system prompts
audio/
input.py Mic selection and audio capture
output.py Audio playback (PulseAudio / ALSA)
vad.py WebRTC VAD-based adaptive recording
bluetooth.py BT HFP/A2DP auto-reconnect daemon
voice/
wakeword.py openWakeWord wake word detection
stt.py Faster-Whisper local STT (warm-started)
tts.py Piper TTS (British male voice)
llm/
ollama_client.py Ollama client (OOM-hardened, context reduction)
context.py XML-tagged context builder for LLM
utils/
autoconfig.py Preflight checks and startup validation
logging_config.py Centralised logging setup
power.py Jetson power, thermal, battery, GPU monitoring
reminders.py Local JSON-based reminder CRUD
vision/
camera.py USB camera with auto-reconnect + portable mode
detector_yolo.py YOLOE-26N TensorRT (open-vocab via set_classes)
detector_mediapipe.py MediaPipe face mesh + pose detector
tracker.py ByteTrack tracking with flow-assisted prediction
depth.py DepthAnything V2 Small TensorRT (depth + point clouds)
flow.py Optical flow estimation (Farneback/DIS + sparse LK)
ego_motion.py Camera ego-motion via RANSAC fundamental matrix
trajectory.py Trajectory prediction + collision detection + alerts
perception.py Fused perception pipeline (flow→ego→velocity→trajectory)
ambient.py Ambient awareness — always-on motion detection (hands-free mode)
vitals.py Fatigue (EAR), posture scoring, rPPG heart rate
threat.py Threat/anomaly scoring with trajectory awareness
proximity.py Distance-based proximity alerts for portable mode
scene.py Natural-language scene description for LLM context
shared.py Pipeline orchestration and singletons
visualize.py OpenCV live visualization (--yolo-visualize)
server/
app.py FastAPI: REST, MJPEG, vision broadcast loop
bridge.py WebSocket bridge (hologram, vitals, threat broadcasts)
streaming.py MJPEG frame streaming helpers
pwa/ SvelteKit PWA frontend
ChatPanel Voice/text interaction + chat persistence
CameraStream Live MJPEG with detection overlays
HologramView Three.js 3D / 2D Canvas fallback
HudOverlay Iron Man-style AR tracking annotations
VitalsPanel Real-time fatigue, posture, heart rate
VitalsMini Compact vitals strip for mobile
Dashboard Jetson GPU/CPU/thermal stats
Reminders Voice/UI reminder management
ListeningOrb Animated listening state indicator
VoiceControls Mic/speaker toggle controls
SettingsPanel Runtime configuration UI
StatusBar Connection status + system indicators
Toast Notification toasts
scripts/ Setup, export, and bootstrap scripts
tests/ 367+ tests (344 unit + 23+ E2E) with pytest
models/ TTS voices, TensorRT engines
| Component | Recommendation | Notes |
|---|---|---|
| Compute | Jetson Orin Nano Super 8GB | $249, 67 TOPS, shared 8GB LPDDR5 |
| Storage | 128GB+ NVMe SSD or high-speed microSD | SSD strongly recommended for swap |
| Camera | Any USB UVC webcam | Logitech C920/C922 work great |
| Component | Why |
|---|---|
| Bluetooth earbuds (e.g. Pixel Buds) | Wireless voice I/O via HFP/A2DP |
| USB microphone | More reliable than BT for mic input |
| Active cooling / fan | Sustained vision workloads generate heat |
| NVMe SSD (512GB) | Faster model loading, better swap |
sudo nvpmodel -q # Should show MAXN_SUPER
sudo jetson_clocks # Lock max CPU/GPU/EMC clocks
jtop # Monitor (install: sudo pip3 install jetson-stats)All settings are environment variables with sane defaults. Key ones:
| Variable | Default | Description |
|---|---|---|
OLLAMA_MODEL |
qwen3:1.7b |
LLM model |
OLLAMA_NUM_CTX |
8192 |
Context window (sweet spot for 8GB) |
OLLAMA_NUM_PREDICT |
512 |
Max output tokens |
JARVIS_DEPTH_ENABLED |
0 |
Enable 3D depth / holograms |
JARVIS_PERCEPTION_ENABLED |
1 |
Enable advanced perception pipeline |
JARVIS_PORTABLE |
0 |
Portable mode (lower res, thermal-aware) |
JARVIS_SERVE_PORT |
8000 |
Server port |
JARVIS_VISION_BROADCAST_SEC |
2 |
Vision broadcast interval |
Full environment variable reference
| Variable | Default | Description |
|---|---|---|
| LLM / Ollama | ||
OLLAMA_BASE_URL |
http://127.0.0.1:11434 |
Ollama API endpoint |
OLLAMA_MODEL |
qwen3:1.7b |
Default LLM model |
OLLAMA_FALLBACK_MODEL |
qwen3:1.7b |
Fallback model on OOM |
OLLAMA_NUM_CTX |
8192 |
Context window size |
OLLAMA_NUM_CTX_MAX |
8192 |
Hard cap for context |
OLLAMA_NUM_PREDICT |
512 |
Max output tokens (includes thinking tokens) |
OLLAMA_THINK |
0 |
Global think flag (1 = enable) |
OLLAMA_TEMPERATURE |
0.6 |
Sampling temperature |
| Vision | ||
JARVIS_CAMERA_INDEX |
0 |
Camera device index |
JARVIS_CAMERA_DEVICE |
(none) | Force camera device path |
JARVIS_DEPTH_ENABLED |
0 |
Enable DepthAnything depth |
JARVIS_VISION_BROADCAST_SEC |
2 |
Vision broadcast interval (seconds) |
JARVIS_VISION_DEPTH_EVERY |
3 |
Depth every Nth broadcast |
| Perception | ||
JARVIS_PERCEPTION_ENABLED |
1 |
Enable advanced perception pipeline |
JARVIS_FLOW_METHOD |
dis |
Optical flow method (dis or farneback) |
JARVIS_FLOW_WIDTH |
320 |
Flow computation width |
JARVIS_FLOW_HEIGHT |
240 |
Flow computation height |
JARVIS_TRAJ_HORIZON |
3.0 |
Trajectory prediction horizon (seconds) |
JARVIS_COLLISION_ZONE_M |
2.0 |
Collision alert distance threshold (metres) |
JARVIS_MOTION_WAKE_THRESHOLD |
0.05 |
Motion magnitude to trigger active scanning |
| Voice / Audio | ||
JARVIS_TTS_VOICE |
models/voices/en_GB-alan-medium.onnx |
Piper voice model path |
| Server | ||
JARVIS_SERVE_HOST |
0.0.0.0 |
Server bind address |
JARVIS_SERVE_PORT |
8000 |
Server port |
JARVIS_WS_PATH |
/ws |
WebSocket endpoint path |
JARVIS_HTTPS_CERT |
(none) | Path to TLS certificate (.pem) for wss:// |
JARVIS_HTTPS_KEY |
(none) | Path to TLS private key (.key) for wss:// |
| Orchestrator | ||
JARVIS_CONTEXT_MAX_TURNS |
4 |
Max history turns |
JARVIS_SUMMARY_EVERY_N |
6 |
Summarise memory every N turns |
JARVIS_PROACTIVE_IDLE_SEC |
300 |
Seconds idle before proactive comment |
JARVIS_MAX_TOOL_CALLS |
3 |
Max tool calls per LLM turn |
| Ambient / Hands-free | ||
JARVIS_AMBIENT_ENABLED |
0 |
Enable always-on ambient awareness (auto-enabled in portable mode) |
JARVIS_PROACTIVE_WALK_SEC |
15 |
Full scan interval in walk mode (seconds) |
JARVIS_THERMAL_AMBIENT_C |
70 |
Thermal threshold for ambient duty cycle reduction |
JARVIS_BATTERY_LOW_PCT |
15 |
Battery % threshold for conservation mode |
JARVIS_PROACTIVE_COOLDOWN_SEC |
10 |
Min seconds between non-critical proactive messages |
| Portable mode | ||
JARVIS_PORTABLE |
0 |
Enable portable mode |
JARVIS_PORTABLE_WIDTH |
320 |
Camera width (portable) |
JARVIS_PORTABLE_HEIGHT |
320 |
Camera height (portable) |
JARVIS_PORTABLE_FPS |
10 |
Camera FPS (portable) |
JARVIS_PORTABLE_DEPTH_SKIP |
3 |
Run depth every Nth frame |
JARVIS_PORTABLE_VITALS_SKIP |
5 |
Run vitals every Nth frame |
JARVIS_PORTABLE_PERCEPTION_SKIP |
2 |
Skip perception every Nth frame |
JARVIS_THERMAL_PAUSE_C |
80 |
Pause vision above this temp (°C) |
source venv/bin/activate
ruff check . # Lint
pytest tests/unit/ # 344 unit tests
pytest tests/e2e/ -m e2e # E2E tests (requires hardware)
python main.py --dry-run # Smoke test| Module | Coverage |
|---|---|
audio/* |
Playback, Bluetooth reconnect/daemon, VAD recording (BT-aware threshold) |
vision/* |
Scene, pipeline, tracker, depth, vitals, threat, proximity, flow, ego-motion, trajectory, perception, ambient awareness |
server/* |
WebSocket bridge, message sequencing, hologram/vitals/threat handling |
llm/* |
Ollama client, context builder, OOM recovery with vision pause |
tools.py |
Tool schemas, registry, execution |
orchestrator.py |
Intent routing, tool dispatch, proactive intelligence, background scene, ambient event handling |
utils/* |
Preflight checks, power/battery monitoring, reminders |
| E2E | Vision benchmarks, hologram pipeline, vitals, portable mode, perception latency (<15ms), ambient awareness, hands-free mode |
- Advanced perception pipeline — optical flow, ego-motion, trajectory prediction, collision detection (Tesla FSD / SpaceX Dragon inspired)
- Flow-assisted tracking — 60/40 flow/Kalman blending in ByteTrack for fewer ID switches
- Walk-around awareness — ego-motion estimation with walking/panning/turning classification
- Proactive collision alerts — time-to-collision estimation with voiced warnings
- Perception <15ms — DIS default, pre-alloc buffers, ego-motion caching, vectorised trajectory (avg 8ms, p95 10ms)
- Hands-free walk-around mode — ambient awareness loop (160x120 DIS, 2-5 Hz), proactive verbalization, thermal/battery auto-pause
- Fluidity fixes — BT-aware VAD, listening chime, verbal error recovery, ack-based PWA loading, WS reorder buffer, connection health
- RAFT TensorRT — neural optical flow for higher accuracy at ~30ms (replace DIS for high-accuracy mode)
- Lightweight SLAM — ORB-SLAM3 mini or DROID-SLAM lite for persistent 3D maps
- VLM integration — LLaVA / Qwen-VL for native image understanding (replace scene-description injection)
- Multi-room / multi-camera — USB hub + camera switching per room
- ROS 2 bridge — publish detections/depth/vitals as ROS topics for robotics integration
- Multi-agent support — multiple JARVIS instances coordinating across Jetsons
- Speaker diarization — distinguish between household members
- Docker image — one-pull setup for JetPack 6.x (see Dockerfile)
- Home Assistant integration — control smart home devices via voice
- Fine-tuned JARVIS voice — custom Piper voice model trained on Paul Bettany samples
- Mobile app — React Native companion for push notifications + remote mic
- Gesture control — MediaPipe hands for Iron Man-style hand gestures
Want to tackle one of these? See CONTRIBUTING.md.
Ollama OOM / cudaMalloc failed
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
sudo bash scripts/configure-ollama-systemd.sh
sudo systemctl daemon-reload && sudo systemctl restart ollamaThe Python client also auto-recovers: on OOM it unloads the model, drops caches, and retries with progressively smaller context (8192→4096→2048→1024).
Model only partially on GPU
Check with ollama ps. If you see CPU%, drop caches and restart Ollama. Memory fragmentation from repeated context changes can cause spill. Close unnecessary desktop apps.
Slow responses (>10s)
Check ollama ps — model should be 100% GPU at 8192 ctx. For plain chat, ensure intent routing sends no tools (should be 0.5–0.7s). If still slow, reduce OLLAMA_NUM_CTX.
Bluetooth mic not working
Switch buds to HFP profile in bluetoothctl or Blueman. Or use a USB microphone for input and keep A2DP for TTS output. The auto-reconnect daemon will monitor and re-establish BT connections automatically — check logs for "BT auto-reconnect" messages.
No camera / vision errors
Plug a USB UVC camera. Set JARVIS_CAMERA_INDEX or JARVIS_CAMERA_DEVICE to select the right device. Check ls /dev/video*.
Hologram shows "No data"
Ensure --serve is running and WebSocket is connected (check StatusBar in PWA). Run bash scripts/export_depth_engine.sh and set JARVIS_DEPTH_ENABLED=1 for 3D point clouds.
If you're running this on your Jetson, star the repo! It helps others find it.
- Issues — Report bugs or request features
- Pull Requests — Contribute code (see CONTRIBUTING.md)
- Discussions — Ask questions, share your setup
Running J.A.R.V.I.S. on your Jetson? We'd love to see it! Open a Discussion with photos/video of your setup and we'll feature it here.
Built on the shoulders of giants:
- NVIDIA Jetson — the hardware that makes edge AI real
- Ollama — local LLM inference done right
- Ultralytics YOLOE — state-of-the-art open-vocab detection
- DepthAnything V2 — monocular depth estimation
- Faster-Whisper — CTranslate2-powered STT
- Piper TTS — fast local text-to-speech
- openWakeWord — custom wake word detection
- MediaPipe — face and pose estimation
- Three.js — 3D visualization in the browser
- dusty-nv/jetson-containers — inspiration for Jetson AI packaging
- Jetson AI Lab — the Jetson community's home base
J.A.R.V.I.S. is MIT licensed. Built with unreasonable ambition on a tiny board.
"I do have a life outside of making you look good, sir. It's just not very interesting."