Add VibeVoice as a TTS backend option

The new VibeVoice models can do much longer audio outputs, and up to 4 speakers. It would be great if vox-box could wrap this to make it easier to use as an OpenAI API-compatible TTS endpoint.

Here's their own Python reference example:

https://github.com/microsoft/VibeVoice/blob/main/demo/inference_from_file.py