MELLE: Autoregressive Speech Synthesis without Vector Quantization

An unofficial PyTorch implementation of "Autoregressive Speech Synthesis without Vector Quantization" paper. This repository provides a complete pipeline for training and inference of the MELLE model.

As shown in librispeech_exp/melle/step_400000testclean_results.txt, we used the Llama tokenizer and trained for 400,000 steps on the LibriSpeech-960h training set. On the cross-sentence evaluation of the 3–10 second subset of LibriSpeech test-clean, the WER (Word Error Rate) was 5.69%, with spk_sim_r: 0.5047 and spk_sim_o: 0.4890.

Sample audios can be found in the librispeech_exp/melle/step_400000testclean_generate_samples folder.

Vocoder link: https://huggingface.co/mechanicalsea/speecht5-tts

NOTE: As suggested in the original paper, when training on a smaller dataset of LibriSpeech's size, utilizing a smaller vocabulary (such as phonemes) is anticipated to yield better results.

🚀 Quick Start

Environment Setup

# Easy prepare environment
bash set_env.sh

# Here is the detail environment setup
# Create conda environment
conda create -n melle python==3.10 -y
conda activate melle

# Install PyTorch and dependencies
pip3 install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
pip3 install -U xformers==0.0.29.post3 --index-url https://download.pytorch.org/whl/cu118

# Install additional requirements
pip install transformers numpy tqdm jiwer webrtcvad librosa

Data Preparation

Download LibriSpeech Dataset:

# Download and extract LibriSpeech train-clean-360 and train-clean-100
# Place data in appropriate directory structure

Prepare Training Data:
- The training data should be in JSONL format with transcription and audio_path fields
- You can use generate_train_jsonl.py to generate the training data from LibriSpeech dataset.
- Example data files are provided in data/ directory:
  - librispeech_train960.jsonl - Training data (960 hours)
  - librispeech_testclean_prompt.jsonl - Test prompts
  - librispeech_testclean_generate.jsonl - Generation targets

Training

Use the provided shell script for easy training:

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=1 --nproc_per_node=4 --master_port=12345 \
    DDP_main.py \
    --train_json data/librispeech_train960.jsonl \
    --batch_frames 50000 \
    --save_dir librispeech_exp \
    --using_postnet \
    --norm layer \
    --transformer_activation relu \
    --prenet_activation relu \
    --postnet_activation tanh \
    --exp_name melle \
    --save_interval 50000

Training Parameters

--batch_frames: Maximum frames per batch (controls memory usage)
--lr: Learning rate (default: 5e-4)
--epochs: Number of training epochs
--max_update_step: Maximum training steps (default: 400,000)
--using_postnet: Enable PostNet for enhanced quality
--using_rope: Use RoPE instead of learned positional embeddings
--norm: Normalization type (rms or layer)
--transformer_activation: Activation function for Transformer (relu, silu, tanh)

Inference

Single Sample Inference

from MELLE import MELLE
from transformers import LlamaTokenizerFast
import torch

# Load model
model = MELLE(
    using_rope=False,
    using_postnet=True,
    using_qwen2mlp=False,
    norm='layer',
    transformer_activation='relu',
    prenet_activation='relu',
    postnet_activation='tanh',
).to('cuda')

# Load checkpoint
checkpoint = torch.load('librispeech_exp/melle/step_400000.pt')
model.load_state_dict(checkpoint['model'])
model.eval()

# Generate speech
outputs = model.inference(prompt_mel, txt, max_length=2000)

Batch Inference

# Test on clean test set
python infer_test_clean.py

# Compute metrics
python infer_compute_metrics.py

🎯 Model Configurations

Available Configurations

The model supports various architectural choices:

Normalization:

rms: RMSNorm (Qwen2-style)
layer: LayerNorm

Activations:

relu: ReLU activation
silu: SiLU/Swish activation
tanh: Tanh activation

Position Encoding:

--using_rope: Rotary Position Embedding (RoPE)
Default: Learned positional embeddings

Architecture Options:

--using_postnet: Enable convolutional PostNet
--using_qwen2mlp: Use Qwen2-style MLP layers

📊 Results

We provide the results of the MELLE model on the LibriSpeech dataset and the results are stored in librispeech_exp/melle/ directory.

📜 Citation

If you find this work inspiring or use our codebase in your research, please consider giving a star ⭐ and a citation:

@article{meng2024melle,
  title={Autoregressive speech synthesis without vector quantization},
  author={Meng, Lingwei and Zhou, Long and Liu, Shujie and Chen, Sanyuan and Han, Bing and Hu, Shujie and Liu, Yanqing and Li, Jinyu and Zhao, Sheng and Wu, Xixin and others},
  journal={arXiv preprint arXiv:2407.08551},
  year={2024}
}

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
data		data
librispeech_exp/melle		librispeech_exp/melle
modules		modules
.gitignore		.gitignore
DDP_dataset.py		DDP_dataset.py
DDP_main.py		DDP_main.py
MELLE.py		MELLE.py
README.md		README.md
generate_train_jsonl.py		generate_train_jsonl.py
img.png		img.png
infer_compute_metrics.py		infer_compute_metrics.py
infer_one_sample.py		infer_one_sample.py
infer_test_clean.py		infer_test_clean.py
set_env.sh		set_env.sh
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MELLE: Autoregressive Speech Synthesis without Vector Quantization

🚀 Quick Start

Environment Setup

Data Preparation

Training

Training Parameters

Inference

Single Sample Inference

Batch Inference

🎯 Model Configurations

Available Configurations

📊 Results

📜 Citation

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MELLE: Autoregressive Speech Synthesis without Vector Quantization

🚀 Quick Start

Environment Setup

Data Preparation

Training

Training Parameters

Inference

Single Sample Inference

Batch Inference

🎯 Model Configurations

Available Configurations

📊 Results

📜 Citation

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages