J.A.R.V.I.S. — Your Offline AI Assistant on Jetson

A fully offline, Iron Man-style voice + vision AI assistant running entirely on a $249 Jetson Orin Nano Super (8 GB). No cloud. No API keys. No subscriptions. Just you and your AI.

"At your service, sir."

Quick Start · Features · Performance · Architecture · Roadmap · Community

Why J.A.R.V.I.S.?

Most "local AI assistants" are a chatbot with a microphone. This is what happens when you actually build the full Iron Man experience on a $249 board:

What others do	What J.A.R.V.I.S. does
Text chat with local LLM	Wake word → STT → LLM with tools → TTS through Bluetooth earbuds
Maybe a webcam feed	TensorRT YOLOE detection + optical flow + ego-motion + trajectory prediction + 3D holograms
"Works on my 4090"	Runs on 8 GB shared RAM — LLM + vision + depth + vitals simultaneously
Cloud fallback "for now"	Zero cloud dependencies. Everything local. Always.
Basic web UI	SvelteKit PWA with live camera, Three.js holograms, vitals, Iron Man HUD, threat alerts
No health awareness	rPPG heart rate, fatigue detection, posture scoring, proactive health alerts
Crashes on OOM	Multi-layer CUDA OOM recovery with automatic context reduction and model reload

✨ Features

Voice Pipeline

openWakeWord — custom wake word, always listening
Faster-Whisper — local STT, no cloud transcription, warm-started at boot
Piper TTS — British male voice (Paul Bettany energy)
Bluetooth — full HFP/A2DP with auto-reconnect daemon (exponential backoff)
WebRTC VAD — adaptive end-of-speech detection (no more fixed 5s recording)

LLM Brain

Qwen3:1.7b (Q4_K_M) via Ollama — native tool-calling, 100% GPU offload
8192-token context — sweet spot for 8 GB: fast inference, no swap pressure
Intent-based routing — only sends tool schemas when needed (0.5s greetings, not 8s)
Adaptive thinking — think=false for chat, think=true for tool calls
JARVIS persona — formal British wit, sarcasm toggle, MCU-accurate responses

Vision Suite

YOLOE-26N (TensorRT FP16) — open-vocabulary detection, set any prompt at runtime
ByteTrack — multi-object tracking with flow-assisted prediction (reduced ID switches)
DepthAnything V2 Small (TensorRT FP16) — real-time depth maps for 3D holograms
MediaPipe — face mesh (EAR fatigue, rPPG heart rate) + pose (posture scoring)
Threat detection — anomaly scoring with trajectory-based collision prediction
Always-on background scene — continuous context updated every 5s for spatial awareness
Proactive intelligence — detects person enter/leave, new objects, env changes
Proximity alerts — distance-based audio cues in portable mode ("Sir, obstacle ahead")
Portable mode — 320×320 @ 10 FPS with thermal throttling + battery monitoring

Advanced Perception (Tesla FSD / SpaceX Dragon inspired)

Optical flow (DIS default, Farneback available) — dense motion vectors with pre-allocated buffers (~6ms at 320x240)
Ego-motion estimation — RANSAC fundamental matrix with result caching for static scenes (~0.04ms cached, ~2ms uncached)
Object velocities in m/s — flow + depth fusion via pinhole camera model
Trajectory prediction — vectorised NumPy batch computation (all objects at once), stationary skip (~0.5ms for 10 objects)
Collision detection — time-to-collision estimation with proactive voiced alerts: "Sir, bicycle from left at 8 km/h — collision in 2.4 seconds"
Walk-around awareness — detects user walking/panning/turning, stabilises detections during ego-motion
Motion-aware context — LLM receives speeds, distances, trajectories, ego-motion state automatically
Zero extra GPU — entire perception pipeline is CPU-only (OpenCV/NumPy), ~8ms avg / ~10ms p95

Hands-Free Walk-Around Mode (NEW)

Ambient awareness — always-on DIS flow at 160x120 (~2ms), detects motion/scene changes without full YOLOE
Zero manual triggers — ambient events auto-escalate to full perception when significant change detected
Proactive verbalization — collision alerts, scene changes, walking/stationary transitions spoken automatically
Cooldown system — prevents verbal spam (10s non-critical, 0s safety-critical like collisions)
Thermal/battery adaptive — auto-reduces duty cycle at >70°C or <15% battery, pauses at >80°C
State machine — IDLE (2 Hz) → ACTIVE (5 Hz) → COOLDOWN, with configurable durations

Iron Man PWA

Live MJPEG camera feed with detection overlays and threat-level borders
Three.js holograms — real-time 3D point cloud visualization (2D Canvas fallback)
HUD overlay — Iron Man-style AR tracking with real-time annotations
Vitals dashboard — fatigue, posture, heart rate, all via WebSocket
Jetson stats — GPU/CPU/thermal monitoring
Reminders — create and manage via voice or UI
Accessible from any device on the LAN

Robustness

367+ unit + E2E tests with pytest (344 unit, 23+ E2E)
Preflight system checks — validates all subsystems at startup with verbal status
Multi-layer CUDA OOM protection — pauses vision, unloads model, drops caches, retries with smaller context
Bluetooth auto-reconnect — daemon monitors every 10s, verifies audio route after reconnect
BT-aware VAD — longer silence threshold (2.0s) when BT audio detected (compensates codec latency)
Listening chime — pre-synthesized "Listening, sir" played instantly on wake word detection
Camera auto-reconnect on USB disconnect
WebSocket reliability — message sequencing, reorder buffer (50ms hold for gaps), rate limiting, heartbeat with health tracking (good/degraded/lost), ack-based loading
PWA button debouncing — ack-based loading states (resolved on server response, not timeout), aria-busy accessibility
Verbal error recovery — TTS-spoken recovery messages on STT/LLM/vision failures instead of silent drops
Connection health — pong-based heartbeat monitoring, auto-reconnect on 3 missed pongs, state resync on reconnect
Graceful degradation — every subsystem is optional, pipeline continues if one fails

⚡ Performance

Real benchmarks on Jetson Orin Nano Super (8 GB), MAXN_SUPER, jetson_clocks:

Scenario	Latency	Notes
Greeting / status / time	0.5 – 0.7s	`think=false`, no tools — instant
Tool call (joke, reminder)	3.6 – 8.4s	`think=true`, selected tools only
Vision query (pre-fetched)	0.7s	Scene already in context
Full voice loop (wake → reply)	< 4s	STT + LLM + TTS for simple queries

Context size benchmarks

num_ctx	VRAM	GPU%	Chat Latency	Verdict
2048	1.6 GB	100%	12.9s	KV thrashing — unusable
4096	1.7 GB	100%	4.1s	Acceptable
8192	2.0 GB	100%	3.5s	Production pick
12288	2.3 GB	100%	~4s	Swap pressure
16384	2.6 GB	30/70	Slow	Spills to CPU — no go

Memory budget breakdown

Component	RAM	Notes
Qwen3:1.7b @ 8192 ctx	~2.0 GB	100% GPU, flash attention + q8_0 KV
YOLOE-26N TensorRT	~0.3 GB	FP16 engine
DepthAnything V2 Small	~0.4 GB	FP16 engine, optional
Perception pipeline	~0.0 GB	CPU-only (OpenCV/NumPy), ~8ms avg (DIS + cache)
Ambient awareness	~0.001 GB	CPU-only, 160x120 DIS flow, ~2ms per check
MediaPipe (face + pose)	~0.1 GB	CPU inference
Faster-Whisper small	~0.5 GB	Loaded on demand
OS + Desktop + Python	~3.5 GB	JetPack 6.x + X11
Total	~6.8 GB	Fits in 7.6 GB with headroom

🚀 Quick Start

Prerequisites

Jetson Orin Nano Super (8 GB) with JetPack 6.x
USB webcam + Bluetooth earbuds (or USB mic + speakers)
Ollama installed (one-line install)

One-command setup

# Clone and enter
git clone https://github.com/steffenpharai/Jarvis.git && cd Jarvis

# Setup Python environment
python3 -m venv venv && source venv/bin/activate
pip install --upgrade pip && pip install -r requirements.txt

# Download all models (wake word, STT, TTS voice)
bash scripts/bootstrap_models.sh

# Pull the LLM
ollama pull qwen3:1.7b

# Configure Ollama for 8GB Jetson (flash attention, 8-bit KV cache, etc.)
sudo bash scripts/configure-ollama-systemd.sh
sudo systemctl daemon-reload && sudo systemctl restart ollama

# Build the PWA frontend
cd pwa && npm install && npm run build && cd ..

# Launch! (full-stack: voice + vision + PWA + Iron Man HUD)
python main.py --serve

Open http://<jetson-ip>:8000 from any device on your network. That's it.

Optional: TensorRT engines for vision

source venv/bin/activate && . /etc/profile.d/cuda.sh

# YOLOE-26N detection engine (required for vision)
bash scripts/export_yolo_engine.sh

# DepthAnything V2 depth engine (required for 3D holograms)
bash scripts/export_depth_engine.sh

Engine builds run on-device and take several minutes. Once built, they're cached in models/.

Optional: CUDA + PyTorch for Jetson

# System dependencies
sudo apt-get install -y python3-pip libopenblas-dev

# cuSPARSELt (required for PyTorch 24.06+ on JetPack 6.x)
bash scripts/install-cusparselt.sh

# CUDA in PATH
sudo bash scripts/install-cuda-path.sh

# PyTorch with CUDA (Jetson wheel)
source venv/bin/activate && . /etc/profile.d/cuda.sh
bash scripts/install-pytorch-cuda-nvidia.sh

# Verify
python -c "import torch; print('CUDA:', torch.cuda.is_available())"

🔧 Usage

source venv/bin/activate

python main.py --serve              # Full-stack: API + PWA + voice + vision
python main.py --serve --portable   # Walk-around mode: 320x320, 10 FPS, thermal-aware
python main.py --orchestrator       # Voice-only agentic loop (no web UI)
python main.py --e2e                # Voice loop without tools
python main.py --one-shot "Hello"   # Single text query (no mic needed)
python main.py --dry-run            # Validate config
python main.py --test-audio         # List audio devices
python main.py --yolo-visualize     # Live camera + YOLOE detections (OpenCV window)

Tools available to the LLM

Tool	What it does
`vision_analyze`	Re-scan camera with optional open-vocabulary prompt
`hologram_render`	Generate 3D hologram and push to all connected PWA clients
`create_reminder`	Save a reminder with optional time
`tell_joke`	Deliver a J.A.R.V.I.S.-quality one-liner
`toggle_sarcasm`	Toggle sarcasm mode (you've been warned)

Time, system stats, scene description, vitals, threat level, and reminders are injected directly into context — no tool call overhead for those.

🏗️ Architecture

graph TB
    subgraph VOICE["🎙️ Voice Pipeline"]
        WW[openWakeWord] --> STT[Faster-Whisper STT]
        STT --> ORCH
        TTS[Piper TTS<br/>British Male] --> BT[Bluetooth<br/>HFP/A2DP]
    end

    subgraph BRAIN["🧠 LLM Brain"]
        ORCH[Orchestrator<br/>Intent Router] --> LLM[Qwen3:1.7b<br/>Ollama · 100% GPU]
        LLM --> TOOLS[Tool Executor]
        TOOLS --> ORCH
        MEM[Short/Long-term<br/>Memory] --> ORCH
    end

    subgraph VISION["👁️ Vision Suite"]
        CAM[USB Camera] --> YOLO[YOLOE-26N<br/>TensorRT]
        CAM --> FLOW[Optical Flow<br/>Farneback/DIS]
        CAM --> DEPTH[DepthAnything V2<br/>TensorRT]
        CAM --> MP[MediaPipe<br/>Face + Pose]
        FLOW --> EGO[Ego-Motion<br/>RANSAC]
        FLOW --> TRACK[ByteTrack<br/>Flow-Assisted]
        YOLO --> TRACK
        TRACK --> TRAJ[Trajectory<br/>Prediction]
        DEPTH --> TRAJ
        EGO --> TRAJ
        TRAJ --> THREAT[Threat<br/>Scorer]
        MP --> VITALS[Vitals<br/>EAR · Posture · rPPG]
    end

    subgraph SERVER["🌐 Server"]
        API[FastAPI] --> WS[WebSocket<br/>Bridge]
        API --> MJPEG[MJPEG<br/>Stream]
        API --> REST[REST API]
    end

    subgraph PWA["📱 SvelteKit PWA"]
        CHAT[Chat Panel]
        HOLO[Three.js<br/>Hologram]
        HUD[Iron Man<br/>HUD Overlay]
        VIT[Vitals Panel]
        DASH[Jetson Stats]
    end

    ORCH --> TTS
    VISION --> WS
    VISION --> ORCH
    WS --> PWA
    LLM --> API

    style VOICE fill:#1a1a2e,stroke:#e94560,color:#fff
    style BRAIN fill:#1a1a2e,stroke:#0f3460,color:#fff
    style VISION fill:#1a1a2e,stroke:#16213e,color:#fff
    style SERVER fill:#1a1a2e,stroke:#533483,color:#fff
    style PWA fill:#1a1a2e,stroke:#e94560,color:#fff

Vision pipeline detail

Camera Frame (t)
  ├─ YOLOE-26N (TensorRT) → detections + open-vocab prompting
  ├─ Optical Flow (DIS, pre-alloc buffer, 320x240) → dense motion vectors (~6ms)
  ├─ DepthAnything V2 Small → depth map + 3D point cloud
  ├─ MediaPipe Face Mesh → EAR fatigue detection, rPPG heart rate
  ├─ MediaPipe Pose → posture scoring
  │
  ▼ Perception Fusion (CPU-only, ~8ms avg / ~10ms p95)
  ├─ Ego-motion estimation (RANSAC + cache for static scenes, ~0.04ms cached)
  ├─ Flow-assisted ByteTrack (60% flow / 40% Kalman prediction)
  ├─ Ego-motion compensation → true object velocities (m/s, vectorised NumPy)
  ├─ Trajectory prediction (vectorised batch, stationary skip, ~0.5ms)
  ├─ Collision detection (time-to-collision + severity alerts)
  └─ ThreatScorer → threat assessment with trajectory awareness
       ↓
  WebSocket broadcast → PWA (hologram, vitals, threat, collisions)
       ↓
  Enriched LLM context → "person approaching at 1.2m/s, 3.8m away"

Ambient Awareness (always-on, parallel thread):
  Camera Frame → DIS Flow 160x120 (~2ms) → ego-motion check + motion energy
       ↓
  Trigger: motion_detected | ego_motion_start/stop | scene_change
       ↓
  Escalate → Full YOLOE + Perception → Proactive TTS

🗂️ Project Structure

main.py                  CLI dispatcher and entry point
orchestrator.py          Async agentic loop (context, tools, proactive vision)
tools.py                 Tool registry (vision, hologram, reminders, joke, sarcasm)
memory.py                Session summary and persistence
run_tests.py             Test runner helper

config/
  settings.py            Jetson/Ollama tuning parameters
  prompts.py             JARVIS persona and system prompts

audio/
  input.py               Mic selection and audio capture
  output.py              Audio playback (PulseAudio / ALSA)
  vad.py                 WebRTC VAD-based adaptive recording
  bluetooth.py           BT HFP/A2DP auto-reconnect daemon

voice/
  wakeword.py            openWakeWord wake word detection
  stt.py                 Faster-Whisper local STT (warm-started)
  tts.py                 Piper TTS (British male voice)

llm/
  ollama_client.py       Ollama client (OOM-hardened, context reduction)
  context.py             XML-tagged context builder for LLM

utils/
  autoconfig.py          Preflight checks and startup validation
  logging_config.py      Centralised logging setup
  power.py               Jetson power, thermal, battery, GPU monitoring
  reminders.py           Local JSON-based reminder CRUD

vision/
  camera.py              USB camera with auto-reconnect + portable mode
  detector_yolo.py       YOLOE-26N TensorRT (open-vocab via set_classes)
  detector_mediapipe.py  MediaPipe face mesh + pose detector
  tracker.py             ByteTrack tracking with flow-assisted prediction
  depth.py               DepthAnything V2 Small TensorRT (depth + point clouds)
  flow.py                Optical flow estimation (Farneback/DIS + sparse LK)
  ego_motion.py          Camera ego-motion via RANSAC fundamental matrix
  trajectory.py          Trajectory prediction + collision detection + alerts
  perception.py          Fused perception pipeline (flow→ego→velocity→trajectory)
  ambient.py             Ambient awareness — always-on motion detection (hands-free mode)
  vitals.py              Fatigue (EAR), posture scoring, rPPG heart rate
  threat.py              Threat/anomaly scoring with trajectory awareness
  proximity.py           Distance-based proximity alerts for portable mode
  scene.py               Natural-language scene description for LLM context
  shared.py              Pipeline orchestration and singletons
  visualize.py           OpenCV live visualization (--yolo-visualize)

server/
  app.py                 FastAPI: REST, MJPEG, vision broadcast loop
  bridge.py              WebSocket bridge (hologram, vitals, threat broadcasts)
  streaming.py           MJPEG frame streaming helpers

pwa/                     SvelteKit PWA frontend
  ChatPanel              Voice/text interaction + chat persistence
  CameraStream           Live MJPEG with detection overlays
  HologramView           Three.js 3D / 2D Canvas fallback
  HudOverlay             Iron Man-style AR tracking annotations
  VitalsPanel            Real-time fatigue, posture, heart rate
  VitalsMini             Compact vitals strip for mobile
  Dashboard              Jetson GPU/CPU/thermal stats
  Reminders              Voice/UI reminder management
  ListeningOrb           Animated listening state indicator
  VoiceControls          Mic/speaker toggle controls
  SettingsPanel          Runtime configuration UI
  StatusBar              Connection status + system indicators
  Toast                  Notification toasts

scripts/                 Setup, export, and bootstrap scripts
tests/                   367+ tests (344 unit + 23+ E2E) with pytest
models/                  TTS voices, TensorRT engines

🔩 Hardware

Required

Component	Recommendation	Notes
Compute	Jetson Orin Nano Super 8GB	$249, 67 TOPS, shared 8GB LPDDR5
Storage	128GB+ NVMe SSD or high-speed microSD	SSD strongly recommended for swap
Camera	Any USB UVC webcam	Logitech C920/C922 work great

Power Mode

sudo nvpmodel -q          # Should show MAXN_SUPER
sudo jetson_clocks         # Lock max CPU/GPU/EMC clocks
jtop                       # Monitor (install: sudo pip3 install jetson-stats)

⚙️ Configuration

All settings are environment variables with sane defaults. Key ones:

Variable	Default	Description
`OLLAMA_MODEL`	`qwen3:1.7b`	LLM model
`OLLAMA_NUM_CTX`	`8192`	Context window (sweet spot for 8GB)
`OLLAMA_NUM_PREDICT`	`512`	Max output tokens
`JARVIS_DEPTH_ENABLED`	`0`	Enable 3D depth / holograms
`JARVIS_PERCEPTION_ENABLED`	`1`	Enable advanced perception pipeline
`JARVIS_PORTABLE`	`0`	Portable mode (lower res, thermal-aware)
`JARVIS_SERVE_PORT`	`8000`	Server port
`JARVIS_VISION_BROADCAST_SEC`	`2`	Vision broadcast interval

Full environment variable reference

Variable	Default	Description
	LLM / Ollama
`OLLAMA_BASE_URL`	`http://127.0.0.1:11434`	Ollama API endpoint
`OLLAMA_MODEL`	`qwen3:1.7b`	Default LLM model
`OLLAMA_FALLBACK_MODEL`	`qwen3:1.7b`	Fallback model on OOM
`OLLAMA_NUM_CTX`	`8192`	Context window size
`OLLAMA_NUM_CTX_MAX`	`8192`	Hard cap for context
`OLLAMA_NUM_PREDICT`	`512`	Max output tokens (includes thinking tokens)
`OLLAMA_THINK`	`0`	Global think flag (`1` = enable)
`OLLAMA_TEMPERATURE`	`0.6`	Sampling temperature
	Vision
`JARVIS_CAMERA_INDEX`	`0`	Camera device index
`JARVIS_CAMERA_DEVICE`	(none)	Force camera device path
`JARVIS_DEPTH_ENABLED`	`0`	Enable DepthAnything depth
`JARVIS_VISION_BROADCAST_SEC`	`2`	Vision broadcast interval (seconds)
`JARVIS_VISION_DEPTH_EVERY`	`3`	Depth every Nth broadcast
	Perception
`JARVIS_PERCEPTION_ENABLED`	`1`	Enable advanced perception pipeline
`JARVIS_FLOW_METHOD`	`dis`	Optical flow method (`dis` or `farneback`)
`JARVIS_FLOW_WIDTH`	`320`	Flow computation width
`JARVIS_FLOW_HEIGHT`	`240`	Flow computation height
`JARVIS_TRAJ_HORIZON`	`3.0`	Trajectory prediction horizon (seconds)
`JARVIS_COLLISION_ZONE_M`	`2.0`	Collision alert distance threshold (metres)
`JARVIS_MOTION_WAKE_THRESHOLD`	`0.05`	Motion magnitude to trigger active scanning
	Voice / Audio
`JARVIS_TTS_VOICE`	`models/voices/en_GB-alan-medium.onnx`	Piper voice model path
	Server
`JARVIS_SERVE_HOST`	`0.0.0.0`	Server bind address
`JARVIS_SERVE_PORT`	`8000`	Server port
`JARVIS_WS_PATH`	`/ws`	WebSocket endpoint path
`JARVIS_HTTPS_CERT`	(none)	Path to TLS certificate (.pem) for wss://
`JARVIS_HTTPS_KEY`	(none)	Path to TLS private key (.key) for wss://
	Orchestrator
`JARVIS_CONTEXT_MAX_TURNS`	`4`	Max history turns
`JARVIS_SUMMARY_EVERY_N`	`6`	Summarise memory every N turns
`JARVIS_PROACTIVE_IDLE_SEC`	`300`	Seconds idle before proactive comment
`JARVIS_MAX_TOOL_CALLS`	`3`	Max tool calls per LLM turn
	Ambient / Hands-free
`JARVIS_AMBIENT_ENABLED`	`0`	Enable always-on ambient awareness (auto-enabled in portable mode)
`JARVIS_PROACTIVE_WALK_SEC`	`15`	Full scan interval in walk mode (seconds)
`JARVIS_THERMAL_AMBIENT_C`	`70`	Thermal threshold for ambient duty cycle reduction
`JARVIS_BATTERY_LOW_PCT`	`15`	Battery % threshold for conservation mode
`JARVIS_PROACTIVE_COOLDOWN_SEC`	`10`	Min seconds between non-critical proactive messages
	Portable mode
`JARVIS_PORTABLE`	`0`	Enable portable mode
`JARVIS_PORTABLE_WIDTH`	`320`	Camera width (portable)
`JARVIS_PORTABLE_HEIGHT`	`320`	Camera height (portable)
`JARVIS_PORTABLE_FPS`	`10`	Camera FPS (portable)
`JARVIS_PORTABLE_DEPTH_SKIP`	`3`	Run depth every Nth frame
`JARVIS_PORTABLE_VITALS_SKIP`	`5`	Run vitals every Nth frame
`JARVIS_PORTABLE_PERCEPTION_SKIP`	`2`	Skip perception every Nth frame
`JARVIS_THERMAL_PAUSE_C`	`80`	Pause vision above this temp (°C)

🧪 Testing

source venv/bin/activate

ruff check .                        # Lint
pytest tests/unit/                  # 344 unit tests
pytest tests/e2e/ -m e2e            # E2E tests (requires hardware)
python main.py --dry-run            # Smoke test

Module	Coverage
`audio/*`	Playback, Bluetooth reconnect/daemon, VAD recording (BT-aware threshold)
`vision/*`	Scene, pipeline, tracker, depth, vitals, threat, proximity, flow, ego-motion, trajectory, perception, ambient awareness
`server/*`	WebSocket bridge, message sequencing, hologram/vitals/threat handling
`llm/*`	Ollama client, context builder, OOM recovery with vision pause
`tools.py`	Tool schemas, registry, execution
`orchestrator.py`	Intent routing, tool dispatch, proactive intelligence, background scene, ambient event handling
`utils/*`	Preflight checks, power/battery monitoring, reminders
E2E	Vision benchmarks, hologram pipeline, vitals, portable mode, perception latency (<15ms), ambient awareness, hands-free mode

🗺️ Roadmap

Completed

Advanced perception pipeline — optical flow, ego-motion, trajectory prediction, collision detection (Tesla FSD / SpaceX Dragon inspired)
Flow-assisted tracking — 60/40 flow/Kalman blending in ByteTrack for fewer ID switches
Walk-around awareness — ego-motion estimation with walking/panning/turning classification
Proactive collision alerts — time-to-collision estimation with voiced warnings
Perception <15ms — DIS default, pre-alloc buffers, ego-motion caching, vectorised trajectory (avg 8ms, p95 10ms)
Hands-free walk-around mode — ambient awareness loop (160x120 DIS, 2-5 Hz), proactive verbalization, thermal/battery auto-pause
Fluidity fixes — BT-aware VAD, listening chime, verbal error recovery, ack-based PWA loading, WS reorder buffer, connection health

Planned

Want to tackle one of these? See CONTRIBUTING.md.

🛠️ Troubleshooting

Ollama OOM / cudaMalloc failed

sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
sudo bash scripts/configure-ollama-systemd.sh
sudo systemctl daemon-reload && sudo systemctl restart ollama

The Python client also auto-recovers: on OOM it unloads the model, drops caches, and retries with progressively smaller context (8192→4096→2048→1024).

Model only partially on GPU

Check with ollama ps. If you see CPU%, drop caches and restart Ollama. Memory fragmentation from repeated context changes can cause spill. Close unnecessary desktop apps.

Slow responses (>10s)

Check ollama ps — model should be 100% GPU at 8192 ctx. For plain chat, ensure intent routing sends no tools (should be 0.5–0.7s). If still slow, reduce OLLAMA_NUM_CTX.

Bluetooth mic not working

Switch buds to HFP profile in bluetoothctl or Blueman. Or use a USB microphone for input and keep A2DP for TTS output. The auto-reconnect daemon will monitor and re-establish BT connections automatically — check logs for "BT auto-reconnect" messages.

No camera / vision errors

Plug a USB UVC camera. Set JARVIS_CAMERA_INDEX or JARVIS_CAMERA_DEVICE to select the right device. Check ls /dev/video*.

Hologram shows "No data"

Ensure --serve is running and WebSocket is connected (check StatusBar in PWA). Run bash scripts/export_depth_engine.sh and set JARVIS_DEPTH_ENABLED=1 for 3D point clouds.

🌟 Community

If you're running this on your Jetson, star the repo! It helps others find it.

Get Involved

Issues — Report bugs or request features
Pull Requests — Contribute code (see CONTRIBUTING.md)
Discussions — Ask questions, share your setup

Show Off Your Build

Running J.A.R.V.I.S. on your Jetson? We'd love to see it! Open a Discussion with photos/video of your setup and we'll feature it here.

🙏 Acknowledgements

Built on the shoulders of giants:

NVIDIA Jetson — the hardware that makes edge AI real
Ollama — local LLM inference done right
Ultralytics YOLOE — state-of-the-art open-vocab detection
DepthAnything V2 — monocular depth estimation
Faster-Whisper — CTranslate2-powered STT
Piper TTS — fast local text-to-speech
openWakeWord — custom wake word detection
MediaPipe — face and pose estimation
Three.js — 3D visualization in the browser
dusty-nv/jetson-containers — inspiration for Jetson AI packaging
Jetson AI Lab — the Jetson community's home base

J.A.R.V.I.S. is MIT licensed. Built with unreasonable ambition on a tiny board.

"I do have a life outside of making you look good, sir. It's just not very interesting."

Component	Why
Bluetooth earbuds (e.g. Pixel Buds)	Wireless voice I/O via HFP/A2DP
USB microphone	More reliable than BT for mic input
Active cooling / fan	Sustained vision workloads generate heat
NVMe SSD (512GB)	Faster model loading, better swap

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.cursor		.cursor
.github		.github
audio		audio
config		config
docs		docs
gui		gui
llm		llm
models		models
pwa		pwa
scripts		scripts
server		server
tests		tests
utils		utils
vision		vision
voice		voice
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
VERIFICATION.md		VERIFICATION.md
docker-compose.yml		docker-compose.yml
main.py		main.py
memory.py		memory.py
orchestrator.py		orchestrator.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_tests.py		run_tests.py
tools.py		tools.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

J.A.R.V.I.S. — Your Offline AI Assistant on Jetson

Why J.A.R.V.I.S.?

✨ Features

Voice Pipeline

LLM Brain

Vision Suite

Advanced Perception (Tesla FSD / SpaceX Dragon inspired)

Hands-Free Walk-Around Mode (NEW)

Iron Man PWA

Robustness

⚡ Performance

🚀 Quick Start

Prerequisites

One-command setup

🔧 Usage

Tools available to the LLM

🏗️ Architecture

🗂️ Project Structure

🔩 Hardware

Required

Recommended

Power Mode

⚙️ Configuration

🧪 Testing

🗺️ Roadmap

Completed

Planned

🛠️ Troubleshooting

🌟 Community

Get Involved

Show Off Your Build

🙏 Acknowledgements

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages