Add JS client / web API bindings

Having a JS bindings for *Raven-server*'s web API would be useful, especially to be able to use *Raven-avatar* in web apps, such as LLM frontends.

We have existing Python bindings, see [`raven.client.api`](https://github.com/Technologicat/raven/blob/main/raven/client/api.py) (functions `avatar_*`) and [`raven.client.tts`](https://github.com/Technologicat/raven/blob/main/raven/client/tts.py) (functions `tts_prepare` and `tts_speak_lipsynced`). See also [`raven.client.avatar_renderer`](https://github.com/Technologicat/raven/blob/main/raven/client/avatar_renderer.py) (receiver and GUI blitter) and [`raven.client.avatar_controller`](https://github.com/Technologicat/raven/blob/main/raven/client/avatar_controller.py) (TTS preprocessing, playback control, and subtitling).

Most of this should be quick and straightforward to port to JS, but there are a some issues worth considering:

- Once everything was optimized, one of the remaining speed bottlenecks of the original *Talkinghead* was the PNG encoder.

  Upscaling makes the issue much worse. PNG encoders are simply not fast enough for sending a 1024x1024 video stream at 25 fps.
 
  PNG is easy to display on the client side, since browsers natively support `multipart/x-mixed-replace` streams when the video frames are encoded in a browser-supported image format.
 
  Since on the server side, the avatar uses a GPU anyway, I experimented with [nvImageCodec](https://github.com/NVIDIA/nvImageCodec). Encoding PNG on the GPU was fast, but the library was brittle to use, and I couldn't get it to work at all with RGBA images, which the avatar absolutely needs. Also, I don't want to add a hard NVIDIA dependency (see #1).

  If anyone has written a fast PNG RGBA encoder in PyTorch, using that on the server side could be one option.

  So the question is, how do we encode (and decode) the video stream quickly enough for realtime, when the server is Python and the client is JS? Almost any solution that works is fine.

- Hence, currently *Raven-avatar* uses an alternative lossless RGBA format that encodes much (30x) faster and almost as tightly as PNG, namely [QOI](https://en.wikipedia.org/wiki/QOI_(image_format)) (*Quite OK Image*). This is fast enough for 1024x1024 at 25 fps.

  But AFAIK, QOI is not natively supported by web browsers. There are some JS decoders [[1]](https://github.com/kchapelier/qoijs) [[2]](https://github.com/PaddeK/qoi-image), so theoretically we could decode each video frame into an array and then blit that onto a canvas, but I haven't benchmarked them. That also requires a bit more client-side code (although probably not more than a similar Python-based client GUI app has; for example, [`raven.avatar.settings_editor_app`](https://github.com/Technologicat/raven/blob/main/raven/avatar/settings_editor/app.py) works this way).

  The avatar's send format is configurable, so in a pinch we can still use PNG, but that means no upscaling, as well as higher CPU usage, both due to the slow encoder.

- The lipsync code is a couple hundred SLOC, but ultimately simple. Once the lipsync driver gets the word-level timestamps and phonemes from the server (together with the speech audio file), it's essentially just applying a phoneme-to-morph lookup table and sending mouth morph overrides to the avatar in realtime.

  There's no need to port the Kokoro-FastAPI part, since currently, only the internal Kokoro backend in *Raven-server* serves the phonemes with word breaks matching those of the word-level timestamps (which is required in order not to crash for some inputs; e.g. _"2025"_ → _"twenty twenty five"_ is treated as a single word).

  The main question mark here is how to play audio in a web browser. We absolutely need to be notified when the audio ends (or when the user stops the audio), so that the lipsync driver knows to shut down (thus disabling the overrides on the avatar's mouth morphs).

  Ideally, we would also like to get the audio playback latency from the audio player, to be able to sync the video correctly. But there is a time offset feature that can be used in a pinch.

- The GUI renderer and TTS/subtitle controller modules add ~1100 SLOC in total, but they make it much easier to integrate the avatar in apps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JS client / web API bindings #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add JS client / web API bindings #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions