An unofficial PyTorch implementation of "Autoregressive Speech Synthesis without Vector Quantization" paper. This repository provides a complete pipeline for training and inference of the MELLE model.
As shown in librispeech_exp/melle/step_400000testclean_results.txt, we used the Llama tokenizer and trained for 400,000 steps on the LibriSpeech-960h training set. On the cross-sentence evaluation of the 3–10 second subset of LibriSpeech test-clean, the WER (Word Error Rate) was 5.69%, with spk_sim_r: 0.5047 and spk_sim_o: 0.4890.
Sample audios can be found in the librispeech_exp/melle/step_400000testclean_generate_samples folder.
Vocoder link: https://huggingface.co/mechanicalsea/speecht5-tts
NOTE: As suggested in the original paper, when training on a smaller dataset of LibriSpeech's size, utilizing a smaller vocabulary (such as phonemes) is anticipated to yield better results.
# Easy prepare environment
bash set_env.sh
# Here is the detail environment setup
# Create conda environment
conda create -n melle python==3.10 -y
conda activate melle
# Install PyTorch and dependencies
pip3 install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
pip3 install -U xformers==0.0.29.post3 --index-url https://download.pytorch.org/whl/cu118
# Install additional requirements
pip install transformers numpy tqdm jiwer webrtcvad librosa-
Download LibriSpeech Dataset:
# Download and extract LibriSpeech train-clean-360 and train-clean-100 # Place data in appropriate directory structure
-
Prepare Training Data:
- The training data should be in JSONL format with
transcriptionandaudio_pathfields - You can use
generate_train_jsonl.pyto generate the training data from LibriSpeech dataset. - Example data files are provided in
data/directory:librispeech_train960.jsonl- Training data (960 hours)librispeech_testclean_prompt.jsonl- Test promptslibrispeech_testclean_generate.jsonl- Generation targets
- The training data should be in JSONL format with
Use the provided shell script for easy training:
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=1 --nproc_per_node=4 --master_port=12345 \
DDP_main.py \
--train_json data/librispeech_train960.jsonl \
--batch_frames 50000 \
--save_dir librispeech_exp \
--using_postnet \
--norm layer \
--transformer_activation relu \
--prenet_activation relu \
--postnet_activation tanh \
--exp_name melle \
--save_interval 50000--batch_frames: Maximum frames per batch (controls memory usage)--lr: Learning rate (default: 5e-4)--epochs: Number of training epochs--max_update_step: Maximum training steps (default: 400,000)--using_postnet: Enable PostNet for enhanced quality--using_rope: Use RoPE instead of learned positional embeddings--norm: Normalization type (rmsorlayer)--transformer_activation: Activation function for Transformer (relu,silu,tanh)
from MELLE import MELLE
from transformers import LlamaTokenizerFast
import torch
# Load model
model = MELLE(
using_rope=False,
using_postnet=True,
using_qwen2mlp=False,
norm='layer',
transformer_activation='relu',
prenet_activation='relu',
postnet_activation='tanh',
).to('cuda')
# Load checkpoint
checkpoint = torch.load('librispeech_exp/melle/step_400000.pt')
model.load_state_dict(checkpoint['model'])
model.eval()
# Generate speech
outputs = model.inference(prompt_mel, txt, max_length=2000)# Test on clean test set
python infer_test_clean.py
# Compute metrics
python infer_compute_metrics.pyThe model supports various architectural choices:
Normalization:
rms: RMSNorm (Qwen2-style)layer: LayerNorm
Activations:
relu: ReLU activationsilu: SiLU/Swish activationtanh: Tanh activation
Position Encoding:
--using_rope: Rotary Position Embedding (RoPE)- Default: Learned positional embeddings
Architecture Options:
--using_postnet: Enable convolutional PostNet--using_qwen2mlp: Use Qwen2-style MLP layers
We provide the results of the MELLE model on the LibriSpeech dataset and the results are stored in librispeech_exp/melle/ directory.
If you find this work inspiring or use our codebase in your research, please consider giving a star ⭐ and a citation:
@article{meng2024melle,
title={Autoregressive speech synthesis without vector quantization},
author={Meng, Lingwei and Zhou, Long and Liu, Shujie and Chen, Sanyuan and Han, Bing and Hu, Shujie and Liu, Yanqing and Li, Jinyu and Zhao, Sheng and Wu, Xixin and others},
journal={arXiv preprint arXiv:2407.08551},
year={2024}
}This project is licensed under the MIT License - see the LICENSE file for details.
