🎙️ Piper TTS Forge Tested on Debian Linux Variants (Windows instructions are theoretical guidelines only)
A streamlined toolkit for training custom Neural Text-to-Speech (TTS) voices using Piper.
This project automates the most painful parts of voice cloning:
- Automatic slicing and transcription using OpenAI Whisper
- Dataset formatting for Piper
- Training, checkpoint management, and export
- A real-time dashboard to listen to your model as it learns
Read this before running 2_slice_and_transcribe.py.
Training checkpoints are large. Backups (Script 8) duplicate the training folder, requiring more space.
The slicer uses the Whisper “large” model by default for maximum transcription accuracy.
- Requirement: ~10 GB VRAM or more (RTX 3080 / 4070 or better).
For GPUs with less VRAM (RTX 3060, 2060, GTX 1080, etc.):
Switch Whisper to the medium model to avoid crashes:
# Edit 2_slice_and_transcribe.py
# FROM:
model = whisper.load_model("large", device=device)
# TO:
model = whisper.load_model("medium", device=device)The medium model is faster, uses less VRAM, and is ~95% as accurate.
Your directory must look like this before starting. You will manually create the piper/ folder in the next step.
.
├── piper/ # <--- YOU MUST BUILD THIS MANUALLY
│ ├── piper # The executable file (piper.exe on Windows)
│ └── src/ # The Python source code folder
├── raw_audio/ # Put your long .wav / .mp3 files here
├── config.py # <-- EDIT THIS FIRST
├── environment.yml
└── [1-8]_*.py # Automation scripts
Linux (Ubuntu / Debian):
sudo apt-get install espeak-ng g++Windows (theoretical guideline, not tested):
- Visual Studio C++ Build Tools
- eSpeak-NG (Install and ensure it is in your system PATH)
conda env create -f environment.yml
conda activate piper-trainerThis project requires both the engine (to run audio) and the source code (to train). You must download two separate files and merge them.
Step A: Get the Executable
- Go to the Piper Releases Page.
- Download the compressed file for your OS (e.g.,
piper_windows_amd64.ziporpiper_linux_x86_64.tar.gz). - Extract it. You should now have a folder named
pipercontaining the executable. - Place this
piperfolder in the root of this project.
Step B: Get the Source Code
- On the same Releases page, scroll to the Assets section.
- Download Source code (zip).
- Extract it. You will see a folder like
piper-2023.11.14-2. - Locate the
srcfolder inside.
Step C: Merge Them
- Copy the
srcfolder from Step B. - Paste it inside your
piperfolder from Step A.
Your folder structure should now match the diagram above.
Open config.py and set your VOICE_NAME. Then run:
python 1_setup.pyThis script verifies your folder structure. If you are missing the Base Model (checkpoint), it will provide the URL to download it manually.
- Drop recordings into
raw_audio/ - Format: WAV, MP3, FLAC, M4A
- Length: 15–60 minutes total
- Quality: Single speaker, no music, minimal background noise
python 2_slice_and_transcribe.pyInspect dataset/metadata.csv and remove junk lines (e.g., "Copyright", "Subtitle").
python 3_preprocess.pyConverts audio and text into Piper-ready tensors.
python 4_train.pyPress Ctrl+C to pause safely. Run the script again to resume.
While training runs in one terminal, open another and run:
python 5_dashboard.pyGenerates an audio file named:
👉 preview_progress.wav 👈
Listen frequently—it updates automatically as training progresses.
Listen to preview_progress.wav.
If at the "Sweet Spot," stop training (Ctrl+C) and run:
python 8_checkpoint_manager.pySelect Option 1 (Backup). To restore if overfitting occurs, run the script again and choose Restore.
python 6_export.pyFinal files appear in final_models/.
python 7_talk.pygraph TD;
A[🎤 Raw Audio] -->|Script 2| B(🔪 Slicer & Whisper);
B -->|Generates| C[📂 Dataset & Metadata];
C -->|Script 3| D(⚙️ Preprocessing);
D -->|Generates| E[🔢 Tensors];
E -->|Script 4| F(🚂 Training Loop);
F -->|Script 5| G(📡 Dashboard / Preview);
G -->|Stop Training| H(🛡️ Backup Manager);
H -->|Script 6| I(📦 Export ONNX);
I -->|Script 7| J(🗣️ Inference);
| Stage | Epochs (Approx) | Sound Characteristics | Action |
|---|---|---|---|
| Warmup | 0 - 500 | Muffled, skipping words, noise static | Keep Going |
| Learning | 500 - 1500 | Recognizable voice, lacks cadence | Monitor |
| Sweet Spot | 1500 - 3500 | Clear, emotional, good breathing, natural | STOP & BACKUP |
| Overfit | 4000+ | Metallic buzz, robotic pitch | Restore Backup |
- CUDA Out of Memory: Lower
BATCH_SIZEinconfig.py(16 → 8 → 4) - “Piper source code not found”: Ensure
piper/src/exists. Likely forgot to merge Source Code into binary. - Voice sounds metallic: Overfitted; restore an earlier backup.
Ah! Perfect — here’s a concise experimental section you can drop in your README without touching the rest:
Install Conda Env:
conda env create -f environment_bigvgan.yml(Steps 9-11) 9_vocoder_setup.py -> 11_inference_studio.py
This section is optional and intended for advanced experimentation. It adds a neural vocoder stage (BigVGAN) to improve audio realism beyond Piper’s native output.
Purpose:
*To enhance breath, transient clarity, and natural timbre
Workflow:
- Export Piper’s acoustic model (mel spectrograms)
- Prepare your dataset of raw audio corresponding to the mel outputs
- Run
9_vocoder_setup.py/9_vocoder_setup_finetune.pyto fetch or fine-tune BigVGAN - Train vocoder (
10_train_vocoder.py) while monitoringpreview_progress.wav - Bridge mel → waveform in inference (
Script 11)
Notes:
- This is experimental: results may vary depending on dataset size and GPU resources
- Phase separation ensures Piper focuses on linguistic modeling, while BigVGAN learns waveform realism
- You can pause / resume fine-tuning without losing progress
- Testing shows that batch size makes a difference in stability and quality. Please adjust this number until you fill 80% of VRAM
Goal: Produce studio-quality, realistic TTS that surpasses Piper’s ceiling, while keeping the acoustic model intact.
This automation toolkit is open source. The Piper engine is MIT licensed (c) Rhasspy contributors.