Jarvis 2.0 is a next-generation multimodal conversational AI assistant 🗣️, designed for real-time ⚡, low-latency, and emotionally intelligent ❤️ interaction.
This project integrates 🔗 the high-performance, websocket-based audio streaming 🌊 architecture of Unmute with the powerful audio-language reasoning 🦩 of Audio Flamingo 3.
We utilize Unmute's robust Voice Activity Detection (VAD) 🎙️ and its integration with Kyutai's STT/TTS models to create a seamless, responsive conversational pipeline. Instead of a standard text LLM, Jarvis 2.0 uses Nvidia's Audio Flamingo 3 as its central "brain" 🧠, allowing for a deeper understanding 👂 of not just what is said, but how it's said.
Jarvis 2.0 functions by creating a real-time, bidirectional audio stream 🔄🔊 between the user and the AI.
- VAD & Streaming: 🎤 The frontend captures user audio and, using Unmute's VAD implementation, streams it over a websocket 🕸️ to the backend as the user speaks.
- Transcription: ✍️ The backend forwards this audio to Kyutai's Speech-to-Text (STT) model, which generates a live transcription.
- Core Reasoning: 💡 The transcribed text is sent to the Audio Flamingo 3 🦩 model. This advanced Audio-Language Model (ALM) generates a context-aware, nuanced, and intelligent response.
- Speech Synthesis: 🗣️ The text response from Audio Flamingo 3 is streamed, as it's generated, to Kyutai's Text-to-Speech (TTS) model.
- Response: 🎧 The TTS model generates audio, which is streamed back to the user's browser 💻, enabling a fluid, low-latency conversation.
graph LR
UVI[User Voice Input] --> F(Frontend)
F -->|Audio File| B(Backend)
B <-->|WEB SOCKET| STT(STT)
B <-->|WEB SOCKET| TTS(TTS)
B <-->|HTTP| AF3(AF3)
B <--> LLM(LLM)
LLM <--> SDK(OpenAI Agent SDK)
SDK <--> TC(Tool Calling)
- ⚡ Extremely Low Latency: Built on Unmute's architecture, streaming STT, LLM, and TTS tokens simultaneously for lower time-to-first-word."
- 🧠 Advanced AI Reasoning: Powered by Audio Flamingo 3 🦩, providing state-of-the-art responses.
- 🌊 Real-time Streaming: Full-duplex audio transport over websockets.
- 🎙️ Robust VAD: Intelligently detects end-of-speech or natural spaces to provide a natural turn-taking experience.
- 🧩 Modular: Easily swap out the core model (Audio Flamingo 3) for other backends like GPT-4o, Ollama, or Mistral.
- 👂 Spatial & Emotion Detection: The core model (Audio Flamingo 3) understands audio and is able to detect the surrounding environment 🌍 and the user's tone 😄😢 from the input audio, something which has not yet been achieved by other open source models.
Alternatively, you can run all services manually. This is more complex due to dependencies.
💻 Software requirements:
uv: Install withcurl -LsSf https://astral.sh/uv/install.sh | shcargo: Install withcurl https://sh.rustup.rs -sSf | shpnpm: Install withcurl -fsSL https://get.pnpm.io/install.sh | sh -cuda 12.1: Needed for the Rust processes (tts and stt).
./dockerless/start_frontend.sh
./dockerless/start_backend.sh
./dockerless/start_llm.sh # Requires GPU VRAM
./dockerless/start_stt.sh # Requires GPU VRAM
./dockerless/start_tts.sh # Requires GPU VRAMThe website should now be accessible at 🌐 http://localhost:3000.
If you're running Jarvis 2.0 on a remote machine (e.g., jarvis-box) and accessing it from your local machine, you must use SSH port forwarding.
Note
🔒 Browsers restrict microphone 🎤 access on non-secure (http://) connections, except for localhost. Port forwarding makes the remote server accessible via your localhost, bypassing this restriction.
🐳 For Docker Compose: The default setup runs on port 80. Forward this to your local port 3333 🔑:
ssh -N -L 3333:localhost:80 jarvis-boxNow open http://localhost:3333 in your browser.
🛠️ For Dockerless: You must forward the frontend (3000) and backend (8000) ports separately 🔑:
ssh -N -L 8000:localhost:8000 -L 3000:localhost:3000 jarvis-boxNow open http://localhost:3000 in your browser.
For simplicity, HTTPS is not included in the default setups. For production deployments, we recommend using a reverse proxy like Caddy or Nginx, or adapting the Docker Swarm documentation provided by the Unmute project.
- Press "S" to toggle subtitles for both you and Jarvis.
- A dev mode can be enabled in
useKeyboardShortcuts.tsby changingALLOW_DEV_MODEtotrue. Press "D" to see the debug view.
All character prompts, voices, and system messages are defined in voices.yaml. To add a new character, simply add a new entry. The backend caches this file on startup, so you will need to restart the backend service to see changes.
The backend is compatible with any OpenAI-compatible API. While it's configured for our VLLM-hosted Audio Flamingo 3 by default, you can easily point it to another service.
Edit your docker-compose.yml and change the environment variables for the backend service.
Example: Using Ollama (🦙)
backend:
image: jarvis-backend:latest
[..]
environment:
[..]
- KYUTAI_LLM_URL=http://host.docker.internal:11434
- KYUTAI_LLM_MODEL=llama3 # or any model you have pulled
- KYUTAI_LLM_API_KEY=ollama
extra_hosts:
- "host.docker.internal:host-gateway"Example: Using OpenAI (🤖)
backend:
image: jarvis-backend:latest
[..]
environment:
[..]
- KYUTAI_LLM_URL=https://api.openai.com/v1
- KYUTAI_LLM_MODEL=gpt-4o
- KYUTAI_LLM_API_KEY=sk-..If you use an external API, you can remove the llm (VLLM) service from your docker-compose.yml to save 💾 GPU resources.
Tool calling is not yet natively supported by the backend, but it's a highly requested feature.
The easiest way to integrate it is to make it invisible to the Jarvis backend. You can create a small FastAPI server that wraps VLLM, intercepts the requests, performs tool calls, and then returns the final response. See this comment for a conceptual overview.
Jarvis 2.0 stands on the shoulders of giants 🧑🔬. This project would not be possible without the foundational work from the Kyutai team on Unmute. We extend our sincere thanks 💖 to them for open-sourcing their high-performance audio pipeline, which serves as the backbone of this project.
This project is licensed under the MIT License. See the LICENSE file for details.