The new VibeVoice models can do much longer audio outputs, and up to 4 speakers. It would be great if vox-box could wrap this to make it easier to use as an OpenAI API-compatible TTS endpoint.
Here's their own Python reference example:
https://github.com/microsoft/VibeVoice/blob/main/demo/inference_from_file.py