Description

This is a generative-AI audiobook creation tool that supports a growing list of text-to-speech models which utilize zero shot voice cloning:

VibeVoice 1.5B
Chatterbox TTS
Fish OpenAudio S1-mini
Higgs Audio V2
Oute TTS

The app features a number of quality control measures designed to mitigate the inherently variable nature of generative text-to-speech models:

Rational segmentation of long text at paragraph/sentence/phrase boundaries, as needed
Detection and correction of inference errors and hallucinations using speech-to-text comparison to the source text
Semantically-aware modulation of caesuras between concatenated sound segments (think "prosody")
Industry standard loudness normalization (EBU R 128)

Plain-vanilla interactive console interface.

Web-based player:

The app embeds text and timing information into the metadata of the FLAC and M4A files it generates, allowing for the included web app to display the audiobook's text in sync with the generated audio (similar to Kindle+Audible or the Google Play Books app). The web app can be launched directly from the html source (no need for a web server), or from the mapped github.io url.

Example outputs, all using the same source text and same 15-second voice clone sample:

Bonus feature: Enhance existing audiobooks

Using speech-to-text, the app is able to embed its custom metadata into pre-existing (ie, professionally produced) audiobook files so that they can be opened and used with the custom player/reader.

Select Options > Enhance existing audiobook, and select your source audiobook file (typically M4A or M4B) and corresponding book text. This feature is experimental.

How to create an audiobook (quick summary):

Assign a working project directory.
Select a short reference audio clip for the voice clone.
Select the source text. 3b. Optionally define file split points.
Start inferencing, and ... be prepared to do some waiting.
Concatenate the generated audio segments to create the final FLAC or M4A file/s.
Optionally use the aforementioned web player/reader to play/read your audiobook.

Installation

First, ffmpeg must be in your system path.

Clone the repository and cd into it:

git clone tts-audiobook-tool
cd tts-audiobook-tool

A separate virtual environment must be created for each model you want to use. Perform the operations as described in one or more of the sections below (and refer to the respective TTS model's project github pages for further guidance as needed). Model-specific options will be enabled automatically in the app based on which virtual environment has been enabled.

In all cases, the CUDA flavor of torch requires an extra install step in the typical manner: First uninstall torch pip uninstall torch torchvision torchaudio -y, and then install the CUDA version of torch in its place (See Pytorch install page).

Finally, run the app by entering:

python -m tts_audiobook_tool

Install for VibeVoice

Initialize a Python v3.11 (not 3.12) virtual environment named "venv-vibevoice". For example:

path\to\python3.11\python.exe -m venv venv-vibevoice

Activate the virtual environment:

venv-vibevoice\Scripts\activate.bat

Install dependencies:

pip install -r requirements-vibevoice.txt

Note that because Microsoft has (temporarily?) removed the source code from their github repo, we are currently pulling from a third-party archived version.

Additional steps for CUDA:

Uninstall the vanilla version of torch:

pip uninstall torch

Install torch 2.6 for CUDA v12.6:

pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu126

Finally, install Flash attention and Triton. The procedure for doing so varies by operating system. On Windows, I'm using wheels with the following filenames: flash_attn-2.7.4+cu126torch2.6.0cxx11abiFALSE-cp311-cp311-win_amd64.whl and triton-3.0.0-cp311-cp311-win_amd64.whl

Install for Chatterbox TTS:

Initialize a Python v3.12 virtual environment named "venv-chatterbox". For example:

path\to\python3.12\python.exe -m venv venv-chatterbox

Activate the virtual environment:

venv-chatterbox\Scripts\activate.bat

Install dependencies:

pip install -r requirements-chatterbox.txt

Install for Higgs Audio V2:

ℹ️ Note! Higgs V2 pretty much requires 24GB VRAM (yes really)

Initialize a Python v3.12 virtual environment named "venv-higgs". For example:

path\to\python3.12\python.exe -m venv venv-higgs

Activate the virtual environment:

venv-higgs\Scripts\activate.bat

Install dependencies:

pip install -r requirements-higgs.txt

Note that the above requirements file draws from a personal fork of the higgs-audio library due to the fact that the original repo is missing some needed __init__.py files at the moment.

Install for Fish OpenAudio-S1-mini:

Initialize a Python v3.12 virtual environment named "venv-fish". For example:

path\to\python3.12\python.exe -m venv venv-fish

Activate the virtual environment:

venv-fish\Scripts\activate.bat

Install dependencies:

pip install -r requirements-fish.txt

And if using CUDA on Windows, also do:

pip install triton-windows

And then, two extra steps:

You have to opt in to gain access to the Fish/OpenAudio model by visiting the FishAudio Hugging Face page, using a logged-in HF account.

Then, generate a Hugging Face access token and paste the token at the command line after entering:

huggingface-cli login

Install for Oute TTS:

Initialize a Python v3.12 virtual environment named "venv-oute". For example:

path\to\python3.12\python.exe -m venv venv-oute

Activate the virtual environment:

venv-oute\Scripts\activate.bat

Install dependencies:

pip install -r requirements-oute.txt

Oute TTS model configuration

Running the app optimally with Oute TTS requires extra steps due to the way the model supports multiple backends, model sizes and quantizations. You will need to review and hand-edit the Python file config_oute.py accordingly.

The OuteTTS Github project page documents these various options. But here are some recommendations based on my own testing...

Nvidia (CUDA) cards:

Prefer the ExLllama2 backend if at all possible: backend=outetts.Backend.EXL2 (See the example Oute config in config_oute.py). However, this requires successfully installing into the environment three extra things:

exllama2 library (pip install exllamav2)
Flash Attention
Triton (eg, on Windows, pip install triton-windows).

Alternatively, Backend.HF is also hardware accelerated but considerably slower.

I couldn't get acceleration going using Backend.LLAMACPP but I'm not sure if that's just be me.

Mac with Apple silicon:

Use Backend.LLAMACPP.

Usage notes

The app saves its state between sessions, so you can interrupt the program at any time and resume later (important due to how long generating a full-length novel can take).

Additionally, setting chapter cut points can be useful to generate and export a long work in manageable chunks over time, allowing you to to use early chapter files before the full text is completed.

Note too that it's possible to utilize different voices and even different models while generating the audio segments for a given project.

Voice cloning

When prepping reference audio for voice cloning, it's worthwhile to prepare three or so different sound samples from a given source (not just one), and then test each one out in turn on a short passage of the intended text, as the quality and characteristics of each voice clone from the same source can vary quite a bit (as well as the word error rate).

Inference speeds, expectations

These are my anecdotal inference speeds. For inference, the app adopts each respective model's reference implementation logic as much as possible.

TTS Model	Hardware	Speed	Notes
VibeVoice 1.5B	GTX 3080 Ti	~120% realtime	with Flash attention 2 enabled
Higgs V2 3B	GTX 4090	200+% realtime	inference speed inversely proportional to voice sample duration, FYI
Higgs V2 3B	GTX 3080 Ti	N/A	does not fit in 12 GB VRAM
Fish OpenAudio S1-mini	GTX 3080 Ti	500+% realtime	best combination of inference speed and quality output IMO
Chatterbox	GTX 3080 Ti	~130% realtime
Chatterbox	Macbook Pro M1 (MPS)	~20% realtime
Oute	GTX 3080 Ti	~85% realtime	using `outetts.Backend.EXL2`
Oute	Macbook Pro M1 (MPS)	~20% realtime	using `outetts.Backend.LLAMACPP`

Update highlights

2025-09-12

Added support for VibeVoice 1.5B.

2025-08-10

Migrated from openai-whisper to faster-whisper (faster, less memory, equivalent accuracy).

2025-08-06

Added support for Higgs Audio V2 (3B base model).

2025-07-18

Added support for Fish OpenAudio S1-mini

Logic to detect dropped phrases at end of generated audio segments (common occurrence with the Fish model)

Added utility to transcode and concatenate directory of MP3 chapter files to M4A (meant for use with the "Enhance existing audiobook" tool)

2025-07-02

New feature: Real-time generation and playback (Options > Real-time generation and playback)

This serves as a quicker and more "casual" alternative to the regular Generate audio UI flow, and allows for more-or-less immediate and uninterrupted audio playback (contingent on system performance, ofc). It employs the same quality control measures except for loudness normalization, and does not save its output.

2025-06-28 (many)

Generated audio segments now have silence trimmed off the ends, and in the concatenation step, stitched-together lines have pauses of varying lengths inserted at paragraph, sentence, and phrase boundaries, resulting in much improved prosody / flow.

Loudness normalization is now being applied correctly (on the final audio file instead of per audio segment)

Better detection of undesired repeating phrases (Oute especially)

Better detection and fix for spooky Chatterbox noises at the end of prompts

Short 1-2 word sentences now get grouped with adjacent sentences to minimize Chatterbox and Oute's issues with short prompts

Streamlined handling of audio data throughout

Encoding audiobook files in AAC/M4A format no longer requires intermediate FLAC step

Streamlined some UI

Some improvements to the web player/reader

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Description

Web-based player:

Bonus feature: Enhance existing audiobooks

How to create an audiobook (quick summary):

Installation

Install for VibeVoice

Additional steps for CUDA:

Install for Chatterbox TTS:

Install for Higgs Audio V2:

Install for Fish OpenAudio-S1-mini:

Install for Oute TTS:

Oute TTS model configuration

Usage notes

Voice cloning

Inference speeds, expectations

Update highlights

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
browser_player		browser_player
tts_audiobook_tool		tts_audiobook_tool
.gitignore		.gitignore
README.md		README.md
requirements-chatterbox.txt		requirements-chatterbox.txt
requirements-fish.txt		requirements-fish.txt
requirements-higgs.txt		requirements-higgs.txt
requirements-oute.txt		requirements-oute.txt
requirements-vibevoice.txt		requirements-vibevoice.txt

zeropointnine/tts-audiobook-tool

Folders and files

Latest commit

History

Repository files navigation

Description

Web-based player:

Bonus feature: Enhance existing audiobooks

How to create an audiobook (quick summary):

Installation

Install for VibeVoice

Additional steps for CUDA:

Install for Chatterbox TTS:

Install for Higgs Audio V2:

Install for Fish OpenAudio-S1-mini:

Install for Oute TTS:

Oute TTS model configuration

Usage notes

Voice cloning

Inference speeds, expectations

Update highlights

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages