Skip to content

wanjawischmeier/realtime-translation-backend

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

228 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The backend for our realtime translation project. Expected to be ran alongside the frontend.

This project is using the wanjawischmeier/WhisperLiveKit fork of QuentinFuxa's Whisper wrapper to transcribe audio locally and in realtime. It is able to translate this transcript into a list of dynamically requested languages using LibreTranslate and send out transcript chunks to the respective frontends using a websocket connection. This pipeline is able to support multiple streamers and viewers in a room system. When streamers connect to and activate a room, they are able to send their microphone audio to the server for processing.

Getting started

Dependencies

  • Python 3.9.23 (pyenv)
  • Poetry
  • FFmpeg
sudo apt-get install ffmpeg

# If using pyenv
pyenv install 3.9.23 # if not installed already
pyenv local 3.9.23
poetry env use /home/username/.pyenv/versions/3.9.23/bin/python

Run using

# With predefined parameters
bash backend.sh

# Or manually
poetry run python src/whisper_server.py

Parameter explanation

-vac # Very important, should be always on
--buffer-trimming sentence # waits for sentence to be finished before processing
--buffer-trimming segment # processes after certain amount of time without waiting for context
# Segment is more stable when people speak very fast without breaks
# Sentence is a bit more accurate, but may cause lag when people speak too fast
--confidence-validation # Makes it a lot faster but slightly less accurate
--punctuation-split # Adds points between each chunk, doesnt matter if its a sentence or not
--min-chunk-size 1 # default 1, slightly lower or higher can tweak it a bit - higher leads to cut sentences, lower to more accuracy, but increases workload for GPU
--device e.g. cuda # run via cpu or gpu
--compute-type float16/float32 # float32 is more precise but takes more computing power - depends on GPU architecture

Architecture

Unbenannt

Endpoints

  • http://localhost:3000: Umami frontend stats
  • http://localhost:8090: Beszel backend performance stats
  • http://localhost:5000: LibreTranslate instance
  • http://localhost:8000: FastAPI backend for http traffic
    • GET /health: Health check, returns status
    • GET /room_list: Returns a room list
    • GET /vote: Get vote list
    • GET /vote/{id}/{action}: Action can be add or remove
    • POST /auth: Checks password, returns result
    • POST /transcript_list: Returns a list of transcript infos
    • POST /room/{room_id}/transcript/{target_lang}: Compiles and returns the entire transcript of a given room in the target_lang as a string. Joins all partial transcripts available for that room.
    • POST /room/{room_id}/close: Closes that room, can only be performed with admin password.
  • ws://localhost:8000/room/{room_id}/{role}/{source_lang}/{target_lang}
    • FastAPI websocket for handling streaming
    • Bidirectional
      • expects audio stream from host (audio/webm;codecs=opus)
      • sends all available transcriptions to host and clients in chunks
    • Expects correct password in authenticated cookie, otherwise refuses connection
    • Parameters
      • room_id: unique room identifier
      • role: Can be host or client
      • source_lang/target_lang: The respective country codes, e.g. de, en en

Ngrok config:

endpoints:
  - name: frontend
    upstream:
      url: 5173
  - name: backend>
    url: https://dynamic-freely-chigger.ngrok-free.app
    upstream:
      url: 8000

Start using

ngrok start --all

Data structures

Room list

{
  # Languages available for transcription by the whisper engine
  "available_source_langs": [
    "de",
    "en",
    # ...
  ],

  # Languages that can be translated into by LibreTranslate
  "available_target_langs": [
    "ar",
    "az",
    # ...
  ],

  # The maximum number of rooms that can be handled by the hardware simultaniously
  "max_active_rooms": 2,

  # List of all rooms that are relevant at this point in time
  "rooms": [
    {
      # Information provided per room
      "id": "",
      "title": "",
      "description": "",
      "track": "",
      "location": "",
      "presenter": "",
      "host_connection_id": "",
      "source_lang": ""
    }
  ]
}

Transcript chunk

{
  "last_n_sents": [
    {
      "line_idx": 0,
      "beg": 0,
      "end": 13,
      "speaker": -1,
      "sentences": [
        {
          "sent_idx": 0,
          "content": {
            "en": "",
            "de": "",
          }
        },
        {
          "sent_idx": 1,
          "content": {
            "en": "",
            # NOTE: Not all sentences will be available in the same languages, as translation happens asynchronously
          }
        },
        {
          "sent_idx": 2,
          "content": {
            "en": "",
            "de": "",
          }
        }
      ]
    }
  ],
  "incomplete_sentence": "",
  "transcription_delay": 10.610000000000001,
  "translation_delay": 0
}

Health check

# If server is ready to accept requests
{"status": "ok"}

# If server is running, but not ready to accept requests
{"status": "not ready"}

Auth check

# If password is valid
{"status": "ok"}

# If password is invalid
{"status": "fail"}

Transcript infos

[
  {
    "id": "room_id_0",
    "firstChunkTimestamp": 0,
    "lastChunkTimestamp": 0
  },
  {
    "id": "room_id_1",
    "firstChunkTimestamp": 0,
    "lastChunkTimestamp": 0
  },
  # ...
]

Umami

Used for tracking certain events and pageviews coming in from the frontend.

To run:

cd stats/umami
docker compose up -d

Beszel

Used for tracking backend performance metrics (gpu utilization etc.)

To run:

# To start the beszel server
cd stats/beszel
docker compose up -d

# To start the agent instance for the current system
cd agent # in stats/beszel/agent
docker compose up -d

TODOs

Important

  • Whisper Engine an Rauminstanzen binden
  • Räume richtig öffnen/schließen
    • Ein Raum wird geöffnet wenn der Host joint
    • Ein Raum wird geschlossen, wenn der host rausgegangen ist (+ 5 minpuffer, sodass Host neu reingehen kann falls mensch nur kurz rausfliegt)
    • Wenn sich die Host-Sprache ändert (erfordert neustart der engine),soll der host aus dem raum rausgehen und mit der neuen Sprache neu reingehen
    • Wenn der host einem bereits offenen raum mit geänderten parametern joint, wird der raum vom room manager neu gestartet
    • Send "ready" packet
  • Eine Restart-Option für Räume im Frontend implementieren
  • Websocket connects/disconnects handlen und Bugs fixen
    • Unique host id
    • Fix: Client disconnects dont get recognized correctly
    • Fix: Rooms get prematurely closed upon host reconnects
    • Preserve source lang across host reconnects
    • Everyone should get kicked out of room if it closes
    • Fix host disconnect after long time
  • Raumliste an frontend schicken (Endpoint)
  • Auth cookie zum Authentifizieren nutzen
  • Check if room is "DO-NOT-RECORD" and prevent activating it
  • Use AVAILABLE_WHISPER_LANGS & AVAILABLE_LT_LANGS to verify frontend requests
  • Endpoint to fetch human readable transcript for room (join all partial transcripts, with date timestamp)
    • Provide endpoint
    • Join all partial transcripts
    • Load from memory or from disk if thats not available
    • Endpoint to provide list of all room id's that have transcripts stored to disk
      • Available as transcript info at /transcript_list
      • Also store and provide room metadata alongside (@whoami)
      • Respect user preferences on wether to store transcripts (@substatoo)
      • Respect user preferences on wether clients can download transcripts (@substatoo)
  • Respect whisper instance limit when activating rooms
  • Whisper device, compute_type passthrough to cli from custom WhisperLiveKit fork
  • Support whisper model unloading (in custom fork)
    • Propably fine, now handled by gc
  • Performance monitoring
    • https://beszel.dev/guide/gpu
    • (Write stats to log file? Not strictly necessary) -> Is now in umami
    • Docker compose is set up in stats/beszel
  • Umami stats
    • Docker compose is set up in stats/umami
  • Fix country coding in transcription chunks
    • No longer provide default sentence, instead make a sentences content field a dict of country codes
  • Move whisper engine to seperate process
  • Proper target langs subscribe/unsubscribe
    • Prevent doubling of target langs
    • Ignore target langs that are equal to source lang (don't add to list)
  • Send initial transcript chunk on client connection
  • Move transcript and room system to seperate files in dedicated dirs
  • Pace translation worker (@substratoo)
    • As of now will just work through all sentences in one loop if a new language gets subscribed to
  • Add admin acc
    • Ability to force close rooms as admin
  • Help markdown file (@whoami)
  • Translation worker should only try to fetch the most recent n sentences (in reverse order, so most recent first)

For potential future updates

  • Fix: Ending process does not work properly some threads seems to stay running
    • Fix CTRL-C
  • (Pause fetch loop when connected host is not streaming?)
  • Fix: Multiple hosts not allowed error
    • Very rare, have not been able to pin it down
    • Is maybe fine for now as rooms can be restarted
  • Convert pickle files for transcripts into conventional database implementation

About

This project is using the wanjawischmeier/WhisperLiveKit fork of QuentinFuxa's Whisper wrapper to transcribe and translate audio locally and in realtime.

Resources

Stars

Watchers

Forks

Contributors

Languages