Skip to content

v1.4.0: Chunking, Predefined Voices, Enhanced Cloning & Performance

Latest

Choose a tag to compare

@devnen devnen released this 29 Apr 02:36
· 19 commits to main since this release

This release introduces major features for handling long text, providing consistent voices, and improving performance, along with significant enhancements to voice cloning and configuration management.

🚀 New Features:

  • Large Text Processing (Chunking): Automatically splits long text inputs based on sentence structure and speaker tags ([S1]/[S2]), enabling generation for documents of any length. Configurable via UI/API (split_text, chunk_size).
  • Predefined Voices: Added 43 ready-to-use, curated synthetic voices located in the ./voices directory. Selectable in the UI for consistent, high-quality output without cloning setup. Server automatically handles required transcripts.
  • Enhanced Voice Cloning: Improved backend pipeline with automatic reference audio processing (mono conversion, resampling, truncation) and transcript handling (prioritizes local .txt file over experimental Whisper fallback). Backend now handles transcript prepending automatically.
  • Whisper Integration: Added openai-whisper as an experimental fallback for automatic transcript generation during cloning if a .txt file is missing.
  • Generation Seed: Added seed parameter (UI/API) to influence generation. Using a fixed integer seed with Predefined/Cloned voices enhances consistency across chunks or separate generations.
  • API Enhancements:
    • /tts endpoint now supports transcript (for explicit clone transcript), split_text, chunk_size, and seed.
    • /v1/audio/speech endpoint now supports seed.
  • Terminal Progress: Long text generation using chunking now displays a tqdm progress bar in the terminal.
  • UI Configuration Management: Added UI section to view/edit config.yaml settings and save generation defaults.
  • Configuration System: Migrated to config.yaml for primary runtime configuration. .env is now used mainly for initial seeding or resetting defaults via the UI.

🔧 Fixes & Enhancements:

  • VRAM Usage Fixed & Optimized: Resolved memory leaks and significantly reduced VRAM usage (approx. 14GB+ down to ~7GB) through code optimizations and BF16 default.
  • Performance: Significant speed improvements reported (approaching 95% real-time on tested hardware: AMD Ryzen 9 9950X3D + NVIDIA RTX 3090).
  • Audio Post-Processing: Automatically applies silence trimming, internal silence fixing, and unvoiced segment removal (using Parselmouth) to improve audio quality and remove artifacts.
  • UI State Persistence: Web UI now saves/restores settings (text, mode, files, parameters) in config.yaml.
  • UI Improvements: Better loading indicators, refined chunking controls, seed input, theme toggle, dynamic preset loading from ui/presets.yaml, warning modals.
  • Cloning Workflow: Backend now handles transcript prepending automatically; UI workflow simplified.
  • Dependency Management: Added tqdm, PyYAML, openai-whisper, parselmouth.
  • Code Refactoring: Aligned internal engine code with refactored dia library structure; updated config.py to use YamlConfigManager.

Note: The configuration system has changed significantly. Settings are now primarily managed via config.yaml. See the documentation for details.