-
Notifications
You must be signed in to change notification settings - Fork 830
Description
Feature description
Automatic Speech Recognition would be lovely to see on burn! There's amazing models galore out there now, and it would be lovely to get some running. Particularly in a streaming fashion.
Feature motivation
I would really like to be able to talk to my computer and do work via speech to text.
(Optional) Suggest a Solution
Theres a ton of not bad options for what to port. I'll rattle off some interesting ones that have caught my eye:
- https://github.com/QwenLM/Qwen3-ASR.git qwen3-asr with ForcedAligner.
- https://github.com/kyutai-labs/delayed-streams-modeling
- https://github.com/kyutai-labs/moshi
- https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3
- https://github.com/MoonshotAI/Kimi-Audio
- https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B
- https://huggingface.co/microsoft/VibeVoice-ASR
(And many more)
Streaming, timestamps (Qwen calls it "forced aligner"), and diaritization would all be wonderful. Kimi-Audio notably seems to be very featureful in its capabilities.
Notes
There are some very good community projects for speech to text already! https://github.com/tracel-ai/models?tab=readme-ov-file#community-contributions
Thanks laggui for the mention in the discussion I opened on this feature request: #4376 (comment)
vLLM ticket where they added realtime support over websockets (in case it's useful): vllm-project/vllm#33187