Skip to content

AISocietyIITJ/Jarvis2.O

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🤖✨ Jarvis 2.0

Jarvis 2.0 is a next-generation multimodal conversational AI assistant 🗣️, designed for real-time ⚡, low-latency, and emotionally intelligent ❤️ interaction.

This project integrates 🔗 the high-performance, websocket-based audio streaming 🌊 architecture of Unmute with the powerful audio-language reasoning 🦩 of Audio Flamingo 3.

We utilize Unmute's robust Voice Activity Detection (VAD) 🎙️ and its integration with Kyutai's STT/TTS models to create a seamless, responsive conversational pipeline. Instead of a standard text LLM, Jarvis 2.0 uses Nvidia's Audio Flamingo 3 as its central "brain" 🧠, allowing for a deeper understanding 👂 of not just what is said, but how it's said.

⚙️ How It Works

Jarvis 2.0 functions by creating a real-time, bidirectional audio stream 🔄🔊 between the user and the AI.

  1. VAD & Streaming: 🎤 The frontend captures user audio and, using Unmute's VAD implementation, streams it over a websocket 🕸️ to the backend as the user speaks.
  2. Transcription: ✍️ The backend forwards this audio to Kyutai's Speech-to-Text (STT) model, which generates a live transcription.
  3. Core Reasoning: 💡 The transcribed text is sent to the Audio Flamingo 3 🦩 model. This advanced Audio-Language Model (ALM) generates a context-aware, nuanced, and intelligent response.
  4. Speech Synthesis: 🗣️ The text response from Audio Flamingo 3 is streamed, as it's generated, to Kyutai's Text-to-Speech (TTS) model.
  5. Response: 🎧 The TTS model generates audio, which is streamed back to the user's browser 💻, enabling a fluid, low-latency conversation.
graph LR
    UVI[User Voice Input] --> F(Frontend)
    F -->|Audio File| B(Backend)
    B <-->|WEB SOCKET| STT(STT)
    B <-->|WEB SOCKET| TTS(TTS)
    B <-->|HTTP| AF3(AF3)
    B <--> LLM(LLM)
    LLM <--> SDK(OpenAI Agent SDK)
    SDK <--> TC(Tool Calling)
Loading

🌟 Features

  • ⚡ Extremely Low Latency: Built on Unmute's architecture, streaming STT, LLM, and TTS tokens simultaneously for lower time-to-first-word."
  • 🧠 Advanced AI Reasoning: Powered by Audio Flamingo 3 🦩, providing state-of-the-art responses.
  • 🌊 Real-time Streaming: Full-duplex audio transport over websockets.
  • 🎙️ Robust VAD: Intelligently detects end-of-speech or natural spaces to provide a natural turn-taking experience.
  • 🧩 Modular: Easily swap out the core model (Audio Flamingo 3) for other backends like GPT-4o, Ollama, or Mistral.
  • 👂 Spatial & Emotion Detection: The core model (Audio Flamingo 3) understands audio and is able to detect the surrounding environment 🌍 and the user's tone 😄😢 from the input audio, something which has not yet been achieved by other open source models.

🛠️ Running without Docker (Dockerless)

Alternatively, you can run all services manually. This is more complex due to dependencies.

💻 Software requirements:

  • uv: Install with curl -LsSf https://astral.sh/uv/install.sh | sh
  • cargo: Install with curl https://sh.rustup.rs -sSf | sh
  • pnpm: Install with curl -fsSL https://get.pnpm.io/install.sh | sh -
  • cuda 12.1: Needed for the Rust processes (tts and stt).

▶️ Start services: Start each of the services one by one in a different terminal 🖥️:

./dockerless/start_frontend.sh
./dockerless/start_backend.sh
./dockerless/start_llm.sh        # Requires GPU VRAM
./dockerless/start_stt.sh        # Requires GPU VRAM
./dockerless/start_tts.sh        # Requires GPU VRAM

The website should now be accessible at 🌐 http://localhost:3000.

📡 Connecting to a Remote Server

If you're running Jarvis 2.0 on a remote machine (e.g., jarvis-box) and accessing it from your local machine, you must use SSH port forwarding.

Note

🔒 Browsers restrict microphone 🎤 access on non-secure (http://) connections, except for localhost. Port forwarding makes the remote server accessible via your localhost, bypassing this restriction.

🐳 For Docker Compose: The default setup runs on port 80. Forward this to your local port 3333 🔑:

ssh -N -L 3333:localhost:80 jarvis-box

Now open http://localhost:3333 in your browser.

🛠️ For Dockerless: You must forward the frontend (3000) and backend (8000) ports separately 🔑:

ssh -N -L 8000:localhost:8000 -L 3000:localhost:3000 jarvis-box

Now open http://localhost:3000 in your browser.

🔐 HTTPS Support

For simplicity, HTTPS is not included in the default setups. For production deployments, we recommend using a reverse proxy like Caddy or Nginx, or adapting the Docker Swarm documentation provided by the Unmute project.

🔧 Modifying Jarvis 2.0

💬🐞 Subtitles and Dev Mode

  • Press "S" to toggle subtitles for both you and Jarvis.
  • A dev mode can be enabled in useKeyboardShortcuts.ts by changing ALLOW_DEV_MODE to true. Press "D" to see the debug view.

🎭🗣️ Changing Characters/Voices

All character prompts, voices, and system messages are defined in voices.yaml. To add a new character, simply add a new entry. The backend caches this file on startup, so you will need to restart the backend service to see changes.

🔄🧠 Using Different LLM/ALM Servers

The backend is compatible with any OpenAI-compatible API. While it's configured for our VLLM-hosted Audio Flamingo 3 by default, you can easily point it to another service.

Edit your docker-compose.yml and change the environment variables for the backend service.

Example: Using Ollama (🦙)

  backend:
    image: jarvis-backend:latest
    [..]
    environment:
      [..]
      - KYUTAI_LLM_URL=http://host.docker.internal:11434
      - KYUTAI_LLM_MODEL=llama3 # or any model you have pulled
      - KYUTAI_LLM_API_KEY=ollama
    extra_hosts:
      - "host.docker.internal:host-gateway"

Example: Using OpenAI (🤖)

  backend:
    image: jarvis-backend:latest
    [..]
    environment:
      [..]
      - KYUTAI_LLM_URL=https://api.openai.com/v1
      - KYUTAI_LLM_MODEL=gpt-4o
      - KYUTAI_LLM_API_KEY=sk-..

If you use an external API, you can remove the llm (VLLM) service from your docker-compose.yml to save 💾 GPU resources.

🛠️📞 Tool Calling

Tool calling is not yet natively supported by the backend, but it's a highly requested feature.

The easiest way to integrate it is to make it invisible to the Jarvis backend. You can create a small FastAPI server that wraps VLLM, intercepts the requests, performs tool calls, and then returns the final response. See this comment for a conceptual overview.

🙏 Acknowledgements

Jarvis 2.0 stands on the shoulders of giants 🧑‍🔬. This project would not be possible without the foundational work from the Kyutai team on Unmute. We extend our sincere thanks 💖 to them for open-sourcing their high-performance audio pipeline, which serves as the backbone of this project.

📜 License

This project is licensed under the MIT License. See the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors