Skip to content

tuann04/duvent-mvp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

System Architecture

1. Overview

Duvent is a real-time voice language learning application designed for "Home Lab" deployment with a path to cloud scalability. It uses a Hybrid Microservices architecture to separate business logic from heavy AI computation.

2. High-Level Architecture Diagram

+-----------------+       +-----------------------+
|                 |       |                       |
|  User / Browser |<----->|   Frontend (Next.js)  |
|                 |       |                       |
+-----------------+       +-----------+-----------+
        ^                             |
        | WS (Audio Stream)           | REST (Auth/State)
        v                             v
+-------------------------------------------------------+
|                 Backend (Golang)                      |
|             "The Orchestrator"                        |
+-------------------------------------------------------+
        |                  |                 |
   gRPC | (Bidirectional)  | HTTP (JSON)     | TCP (SQL)
        |                  v                 v
        |         +-----------------+  +--------------+
        |         |  Ollama (LLM)   |  |  PostgreSQL  |
        |         |  (Localhost)    |  |              |
        |         +-----------------+  +--------------+
        v
+-----------------------+
|   AI Service (Python) |
|      "The Worker"     |
|  (Whisper + Kokoro)   |
+-----------------------+

3. Component Details

3.1. Frontend (Next.js)

  • Responsibility:
    • Voice Activity Detection (VAD): Detects speech to optimize bandwidth.
    • WebSocket Client: Streams audio chunks to the Go Backend.
    • UI: Chat interface, topic selection, audio visualization.
  • Communication:
    • WebSocket -> Backend (Real-time Audio).
    • REST/HTTP -> Backend (Management).

3.2. Backend (Golang)

  • Responsibility:
    • WebSocket Server: Terminates user connections.
    • Session Management: Tracks conversation state.
    • Orchestration:
      • Routes incoming audio -> Python AI Service (gRPC).
      • Routes recognized text -> Ollama (HTTP).
      • Routes LLM response -> Python AI Service (gRPC) for TTS.
      • Streams TTS audio -> User.
  • Key Libraries: gorilla/websocket, grpc-go, pgx (Postgres driver).

3.3. AI Service (Python)

  • Responsibility:
    • gRPC Server: Exposes StreamAudio and SynthesizeSpeech endpoints.
    • STT (Speech-to-Text): Runs faster-whisper (or whisper.cpp) on CPU/GPU.
    • TTS (Text-to-Speech): Runs kokoro on CPU/GPU.
    • Optimization: Keeps models loaded in memory.
  • Key Libraries: grpcio, faster-whisper, kokoro, torch.

3.4. LLM Provider (Ollama)

  • Responsibility:
    • Runs the Large Language Model (e.g., Llama 3 8B Quantized).
    • Handles chat context and generation.
  • Interface: Standard Ollama REST API (/api/chat).

3.5. Database (PostgreSQL)

  • Responsibility: Persists Users, Topics, Conversations, and Feedback logs.
  • Schema: Relational data + JSONB for flexible feedback structures.

4. Data Flow: The "Phone Call" Loop

  1. User Speaks: Audio chunks sent via WS to Golang.
  2. Transcribe: Golang streams audio via gRPC to Python AI Service.
  3. Result: Python streams text transcript back to Golang.
  4. Think: Golang buffers transcript (until silence/sentence end), then sends text to Ollama (HTTP).
  5. Respond: Ollama returns text response to Golang.
  6. Synthesize: Golang streams response text via gRPC to Python AI Service.
  7. Play: Python streams audio bytes back to Golang -> Golang forwards to User via WS.

5. Deployment Strategy

  • Development (Local): All components run on localhost. Docker for Postgres.
  • Production (Hybrid):
    • Backend/DB on Cloud VPS.
    • AI Service/Ollama on GPU Node (or Local machine via Tunnel).

5. Database Schema (Draft)

Users

Field Type Description
id UUID Primary Key
created_at TIMESTAMP

Topics

Field Type Description
id UUID Primary Key
name VARCHAR e.g., "Software Engineering Interview"
system_prompt TEXT The initial instruction for the AI

Conversations

Field Type Description
id UUID Primary Key
user_id UUID FK to Users
topic_id UUID FK to Topics
status VARCHAR active, completed
created_at TIMESTAMP

Messages

Field Type Description
id UUID Primary Key
conversation_id UUID FK to Conversations
role VARCHAR user, assistant
content TEXT The text content
audio_path VARCHAR Path to local file: /uploads/conv_{id}_{seq}.mp3
feedback JSONB Structured feedback for this turn
created_at TIMESTAMP

6. API Endpoints

REST (Management)

  • POST /api/session/init - Create guest user.
  • GET /api/topics - List topics.
  • GET /api/conversations/{id}/history - Get past messages.

WebSocket (Real-time)

  • WS /ws/conversation/{topic_id} - Main entry point for the call.
    • Upstream (Client -> Server):
      • Binary: Audio Data (PCM/WebM).
      • JSON: { "event": "start_speaking" }
      • JSON: { "event": "stop_speaking" }
    • Downstream (Server -> Client):
      • Binary: Audio Data (MP3/PCM).
      • JSON: { "event": "transcript", "data": "..." }
      • JSON: { "event": "feedback", "data": "..." }

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors