Audio Studio AI is an interactive application built with Python and Streamlit, allowing users to generate high-quality audio content locally using advanced text-to-speech technology. It features both a beautiful web interface and a powerful REST API for integration. Perfect for creating voiceovers for videos, podcasts, and other audio content.
- Local AI-powered text-to-speech with high-quality voice synthesis
- Multiple language support with various voices for each language
- Sentence-based audio generation with customizable pauses
- Individual sentence preview and editing
- Export/Import functionality for sentence configurations
- Multiple output formats (WAV, MP3) with dynamic format support
- Customizable speech speed for each sentence
- Beautiful Streamlit web interface for easy interaction
- REST API for integration with other applications
- Local processing - no cloud dependencies required
git clone https://github.com/paulocoutinhox/audio-studio-ai.git
cd audio-studio-aipython3 -m venv .venv
source .venv/bin/activate # macOS/Linux
.venv\Scripts\activate # Windowspip install -r requirements.txtDownload the required model files from the model repository and place them in the models/ directory of the project. See the Model Support section for details.
The application expects the model files to be in the models/ directory. See the Model Support section for specific file requirements.
You can configure the following settings in the sidebar:
- Output format (WAV or MP3)
- Minimum and maximum pause duration between sentences
- Model and voices file paths
-
Run the Streamlit Application
streamlit run app.py
-
Steps in the Web UI
- Add sentences using the "Add Sentence" button
- For each sentence:
- Enter the text
- Select the language
- Choose a voice
- Adjust the speech speed
- Use the up/down arrows to reorder sentences
- Delete sentences using the trash icon
- Click "Generate Audio" to create the final audio
- Preview individual sentences or download the complete audio
-
Start the API Server
python api.py
-
API Documentation
- Server runs on:
http://localhost:8000 - Interactive docs:
http://localhost:8000/docs - 📚 Complete API Documentation - Examples, endpoints, and integration guides
- Server runs on:
-
Quick API Example
curl -X POST "http://localhost:8000/generate-audio" \ -H "Content-Type: application/json" \ -d '{ "sentences": [{ "text": "Hello world!", "lang": "en-us", "voice": "af_sarah", "speed": 1.0 }], "output_format": "mp3" }'
The application supports multiple languages with various voices for each:
- American English (en-us)
- Multiple voices including af_sarah, af_nova, af_river, and more
- British English (en-gb)
- Voices like bf_alice, bf_emma, bm_daniel, and more
- Japanese (ja)
- Voices including jf_alpha, jf_gongitsune, jm_kumo, and more
- Mandarin Chinese (zh)
- Multiple voices like zf_xiaobei, zf_xiaoxiao, zm_yunjian, and more
- Spanish (es)
- Voices including ef_dora, em_alex
- French (fr)
- Voice ff_siwis
- Hindi (hi)
- Voices including hf_alpha, hf_beta, hm_omega
- Italian (it)
- Voices if_sara, im_nicola
- Brazilian Portuguese (pt-br)
- Voices pf_dora, pm_alex, pm_santa
This application currently supports the Kokoro TTS model for high-quality text-to-speech synthesis. To use the application:
-
Download the following files from Hugging Face - Kokoro-82M:
kokoro-v1.0.onnxvoices-v1.0.bin
-
Download the model files using these direct links:
-
Place these files in the
models/directory of the project.
The Kokoro model provides:
- High-quality voice synthesis
- Support for multiple languages
- Various voice options for each language
- Fast local processing
- No cloud dependencies
audio-studio-ai/
│
├── 📝 README.md # Project documentation and guide
├── 📚 API.md # Complete REST API documentation
├── 🎯 app.py # Main Streamlit application interface
├── 🚀 api.py # FastAPI REST API server
├── ⚙️ config.py # Configuration and settings management
├── 🛠️ utils.py # Utility functions and helpers
├── 📦 requirements.txt # Project dependencies list
│
├── 🤖 models/ # AI model files directory
│ ├── 🧠 kokoro-v1.0.onnx # TTS neural network model
│ └── 🗣️ voices-v1.0.bin # Voice data and configurations
│
├── 🎵 temp/ # Temporary audio files storage
│
└── 🎨 extras/ # Additional resources
└── 🖼️ images/ # Images, icons and assets
- Fork the repository
- Create a new branch (
git checkout -b feature-xyz) - Commit changes (
git commit -m "Added new feature") - Push to the branch (
git push origin feature-xyz) - Open a pull request
For issues or contributions, open a GitHub issue or contact: 💎 paulocoutinhox@gmail.com 🔗 GitHub
Copyright (c) 2025, Paulo Coutinho