| title | emoji | colorFrom | colorTo | sdk | app_file | pinned | license | short_description | models | tags | sdk_version | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50-Language Speech-to-Speech Translator using Whisper & mBART |
π£οΈ |
blue |
indigo |
gradio |
app.py |
false |
mit |
50-Language Speech Translator with Whisper & mBART. |
|
|
6.5.1 |
This Hugging Face Space is a multimodal demo that performs end-to-end speech translation by chaining together speech recognition, machine translation, and text-to-speech synthesis.
It allows users to speak in one language and hear the translated speech in another, supporting 50 languages.
The application follows a linear processing pipeline:
-
Automatic Speech Recognition (ASR)
Spoken audio is transcribed into text using Whisper (Large v3 Turbo). -
Neural Machine Translation (NMT)
The transcribed text is translated into a selected target language using mBART-50, which supports 50 languages. -
Text-to-Speech (TTS)
The translated text is converted back into audio using gTTS (Google Text-to-Speech).
The result is a seamless speech-to-speech translation experience.
flowchart TD
Start([User Records Audio]) --> ASR[Automatic Speech Recognition<br/>openai/whisper-large-v3-turbo]
ASR --> |Transcribed Text| NMT[Neural Machine Translation<br/>facebook/mbart-large-50-many-to-many-mmt]
NMT --> |Translated Text<br/>Target Language| TTS[Text-to-Speech<br/>gTTS - Google Text-to-Speech]
TTS --> Output([Audio Output])
style Start stroke:#2563eb,stroke-width:3px
style ASR stroke:#dc2626,stroke-width:3px
style NMT stroke:#7c3aed,stroke-width:3px
style TTS stroke:#059669,stroke-width:3px
style Output stroke:#2563eb,stroke-width:3px
- UI: Gradio
- Speech Recognition: Whisper
- Translation: Facebook mBART-50
- Text-to-Speech: gTTS
- Hosting: Hugging Face Spaces , Vercel
The demo supports 50 languages, including but not limited to:
Arabic, Chinese, English, French, German, Hindi, Japanese, Korean, Portuguese, Spanish, Vietnamese, and many more.
- Models were selected to balance language coverage, latency, and availability on Hugging Face Spaces.
- Always-on or fast-loading models were preferred to avoid cold-start delays.
- The demo focuses on clarity and reliability rather than pushing the largest possible models.
- Long audio inputs may increase processing time.
- Translation quality can vary for less common language pairs.
- TTS voices depend on gTTS language support.
This project is released under the MIT License.