Skip to content

Add JS client / web API bindings #3

@Technologicat

Description

@Technologicat

Having a JS bindings for Raven-server's web API would be useful, especially to be able to use Raven-avatar in web apps, such as LLM frontends.

We have existing Python bindings, see raven.client.api (functions avatar_*) and raven.client.tts (functions tts_prepare and tts_speak_lipsynced). See also raven.client.avatar_renderer (receiver and GUI blitter) and raven.client.avatar_controller (TTS preprocessing, playback control, and subtitling).

Most of this should be quick and straightforward to port to JS, but there are a some issues worth considering:

  • Once everything was optimized, one of the remaining speed bottlenecks of the original Talkinghead was the PNG encoder.

    Upscaling makes the issue much worse. PNG encoders are simply not fast enough for sending a 1024x1024 video stream at 25 fps.

    PNG is easy to display on the client side, since browsers natively support multipart/x-mixed-replace streams when the video frames are encoded in a browser-supported image format.

    Since on the server side, the avatar uses a GPU anyway, I experimented with nvImageCodec. Encoding PNG on the GPU was fast, but the library was brittle to use, and I couldn't get it to work at all with RGBA images, which the avatar absolutely needs. Also, I don't want to add a hard NVIDIA dependency (see AMD GPU support #1).

    If anyone has written a fast PNG RGBA encoder in PyTorch, using that on the server side could be one option.

    So the question is, how do we encode (and decode) the video stream quickly enough for realtime, when the server is Python and the client is JS? Almost any solution that works is fine.

  • Hence, currently Raven-avatar uses an alternative lossless RGBA format that encodes much (30x) faster and almost as tightly as PNG, namely QOI (Quite OK Image). This is fast enough for 1024x1024 at 25 fps.

    But AFAIK, QOI is not natively supported by web browsers. There are some JS decoders [1] [2], so theoretically we could decode each video frame into an array and then blit that onto a canvas, but I haven't benchmarked them. That also requires a bit more client-side code (although probably not more than a similar Python-based client GUI app has; for example, raven.avatar.settings_editor_app works this way).

    The avatar's send format is configurable, so in a pinch we can still use PNG, but that means no upscaling, as well as higher CPU usage, both due to the slow encoder.

  • The lipsync code is a couple hundred SLOC, but ultimately simple. Once the lipsync driver gets the word-level timestamps and phonemes from the server (together with the speech audio file), it's essentially just applying a phoneme-to-morph lookup table and sending mouth morph overrides to the avatar in realtime.

    There's no need to port the Kokoro-FastAPI part, since currently, only the internal Kokoro backend in Raven-server serves the phonemes with word breaks matching those of the word-level timestamps (which is required in order not to crash for some inputs; e.g. "2025""twenty twenty five" is treated as a single word).

    The main question mark here is how to play audio in a web browser. We absolutely need to be notified when the audio ends (or when the user stops the audio), so that the lipsync driver knows to shut down (thus disabling the overrides on the avatar's mouth morphs).

    Ideally, we would also like to get the audio playback latency from the audio player, to be able to sync the video correctly. But there is a time offset feature that can be used in a pinch.

  • The GUI renderer and TTS/subtitle controller modules add ~1100 SLOC in total, but they make it much easier to integrate the avatar in apps.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions