🤖✨ Jarvis 2.0

Jarvis 2.0 is a next-generation multimodal conversational AI assistant 🗣️, designed for real-time ⚡, low-latency, and emotionally intelligent ❤️ interaction.

This project integrates 🔗 the high-performance, websocket-based audio streaming 🌊 architecture of Unmute with the powerful audio-language reasoning 🦩 of Audio Flamingo 3.

We utilize Unmute's robust Voice Activity Detection (VAD) 🎙️ and its integration with Kyutai's STT/TTS models to create a seamless, responsive conversational pipeline. Instead of a standard text LLM, Jarvis 2.0 uses Nvidia's Audio Flamingo 3 as its central "brain" 🧠, allowing for a deeper understanding 👂 of not just what is said, but how it's said.

⚙️ How It Works

Jarvis 2.0 functions by creating a real-time, bidirectional audio stream 🔄🔊 between the user and the AI.

VAD & Streaming: 🎤 The frontend captures user audio and, using Unmute's VAD implementation, streams it over a websocket 🕸️ to the backend as the user speaks.
Transcription: ✍️ The backend forwards this audio to Kyutai's Speech-to-Text (STT) model, which generates a live transcription.
Core Reasoning: 💡 The transcribed text is sent to the Audio Flamingo 3 🦩 model. This advanced Audio-Language Model (ALM) generates a context-aware, nuanced, and intelligent response.
Speech Synthesis: 🗣️ The text response from Audio Flamingo 3 is streamed, as it's generated, to Kyutai's Text-to-Speech (TTS) model.
Response: 🎧 The TTS model generates audio, which is streamed back to the user's browser 💻, enabling a fluid, low-latency conversation.

graph LR
    UVI[User Voice Input] --> F(Frontend)
    F -->|Audio File| B(Backend)
    B <-->|WEB SOCKET| STT(STT)
    B <-->|WEB SOCKET| TTS(TTS)
    B <-->|HTTP| AF3(AF3)
    B <--> LLM(LLM)
    LLM <--> SDK(OpenAI Agent SDK)
    SDK <--> TC(Tool Calling)

🌟 Features

⚡ Extremely Low Latency: Built on Unmute's architecture, streaming STT, LLM, and TTS tokens simultaneously for lower time-to-first-word."
🧠 Advanced AI Reasoning: Powered by Audio Flamingo 3 🦩, providing state-of-the-art responses.
🌊 Real-time Streaming: Full-duplex audio transport over websockets.
🎙️ Robust VAD: Intelligently detects end-of-speech or natural spaces to provide a natural turn-taking experience.
🧩 Modular: Easily swap out the core model (Audio Flamingo 3) for other backends like GPT-4o, Ollama, or Mistral.
👂 Spatial & Emotion Detection: The core model (Audio Flamingo 3) understands audio and is able to detect the surrounding environment 🌍 and the user's tone 😄😢 from the input audio, something which has not yet been achieved by other open source models.

🛠️ Running without Docker (Dockerless)

Alternatively, you can run all services manually. This is more complex due to dependencies.

💻 Software requirements:

uv: Install with curl -LsSf https://astral.sh/uv/install.sh | sh
cargo: Install with curl https://sh.rustup.rs -sSf | sh
pnpm: Install with curl -fsSL https://get.pnpm.io/install.sh | sh -
cuda 12.1: Needed for the Rust processes (tts and stt).

▶️ Start services: Start each of the services one by one in a different terminal 🖥️:

./dockerless/start_frontend.sh
./dockerless/start_backend.sh
./dockerless/start_llm.sh        # Requires GPU VRAM
./dockerless/start_stt.sh        # Requires GPU VRAM
./dockerless/start_tts.sh        # Requires GPU VRAM

The website should now be accessible at 🌐 http://localhost:3000.

📡 Connecting to a Remote Server

If you're running Jarvis 2.0 on a remote machine (e.g., jarvis-box) and accessing it from your local machine, you must use SSH port forwarding.

Note

🔒 Browsers restrict microphone 🎤 access on non-secure (http://) connections, except for localhost. Port forwarding makes the remote server accessible via your localhost, bypassing this restriction.

🐳 For Docker Compose: The default setup runs on port 80. Forward this to your local port 3333 🔑:

ssh -N -L 3333:localhost:80 jarvis-box

Now open http://localhost:3333 in your browser.

🛠️ For Dockerless: You must forward the frontend (3000) and backend (8000) ports separately 🔑:

ssh -N -L 8000:localhost:8000 -L 3000:localhost:3000 jarvis-box

Now open http://localhost:3000 in your browser.

🔐 HTTPS Support

For simplicity, HTTPS is not included in the default setups. For production deployments, we recommend using a reverse proxy like Caddy or Nginx, or adapting the Docker Swarm documentation provided by the Unmute project.

🔧 Modifying Jarvis 2.0

💬🐞 Subtitles and Dev Mode

Press "S" to toggle subtitles for both you and Jarvis.
A dev mode can be enabled in useKeyboardShortcuts.ts by changing ALLOW_DEV_MODE to true. Press "D" to see the debug view.

🎭🗣️ Changing Characters/Voices

All character prompts, voices, and system messages are defined in voices.yaml. To add a new character, simply add a new entry. The backend caches this file on startup, so you will need to restart the backend service to see changes.

🔄🧠 Using Different LLM/ALM Servers

The backend is compatible with any OpenAI-compatible API. While it's configured for our VLLM-hosted Audio Flamingo 3 by default, you can easily point it to another service.

Edit your docker-compose.yml and change the environment variables for the backend service.

Example: Using Ollama (🦙)

  backend:
    image: jarvis-backend:latest
    [..]
    environment:
      [..]
      - KYUTAI_LLM_URL=http://host.docker.internal:11434
      - KYUTAI_LLM_MODEL=llama3 # or any model you have pulled
      - KYUTAI_LLM_API_KEY=ollama
    extra_hosts:
      - "host.docker.internal:host-gateway"

Example: Using OpenAI (🤖)

  backend:
    image: jarvis-backend:latest
    [..]
    environment:
      [..]
      - KYUTAI_LLM_URL=https://api.openai.com/v1
      - KYUTAI_LLM_MODEL=gpt-4o
      - KYUTAI_LLM_API_KEY=sk-..

If you use an external API, you can remove the llm (VLLM) service from your docker-compose.yml to save 💾 GPU resources.

🛠️📞 Tool Calling

Tool calling is not yet natively supported by the backend, but it's a highly requested feature.

The easiest way to integrate it is to make it invisible to the Jarvis backend. You can create a small FastAPI server that wraps VLLM, intercepts the requests, performs tool calls, and then returns the final response. See this comment for a conceptual overview.

🙏 Acknowledgements

Jarvis 2.0 stands on the shoulders of giants 🧑‍🔬. This project would not be possible without the foundational work from the Kyutai team on Unmute. We extend our sincere thanks 💖 to them for open-sourcing their high-performance audio pipeline, which serves as the backbone of this project.

📜 License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
dockerless		dockerless
docs		docs
frontend		frontend
notebooks		notebooks
services		services
tests		tests
unmute		unmute
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SWARM.md		SWARM.md
audio.txt		audio.txt
bake_deploy_prod.sh		bake_deploy_prod.sh
bake_deploy_staging.sh		bake_deploy_staging.sh
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_all.sh		run_all.sh
setup_gpu_swarm_node.py		setup_gpu_swarm_node.py
stop_all.sh		stop_all.sh
swarm-deploy.yml		swarm-deploy.yml
test_audio.wa		test_audio.wa
uv.lock		uv.lock
voices.yaml		voices.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖✨ Jarvis 2.0

⚙️ How It Works

🌟 Features

🛠️ Running without Docker (Dockerless)

📡 Connecting to a Remote Server

🔐 HTTPS Support

🔧 Modifying Jarvis 2.0

💬🐞 Subtitles and Dev Mode

🎭🗣️ Changing Characters/Voices

🔄🧠 Using Different LLM/ALM Servers

🛠️📞 Tool Calling

🙏 Acknowledgements

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🤖✨ Jarvis 2.0

⚙️ How It Works

🌟 Features

🛠️ Running without Docker (Dockerless)

📡 Connecting to a Remote Server

🔐 HTTPS Support

🔧 Modifying Jarvis 2.0

💬🐞 Subtitles and Dev Mode

🎭🗣️ Changing Characters/Voices

🔄🧠 Using Different LLM/ALM Servers

🛠️📞 Tool Calling

🙏 Acknowledgements

📜 License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages