|
| 1 | +# DiffSinger - PyTorch Implementation |
| 2 | + |
| 3 | +PyTorch implementation of [DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis](https://arxiv.org/abs/2105.02446) (TTS Extension). |
| 4 | + |
| 5 | +<p align="center"> |
| 6 | + <img src="img/model_1.png" width="80%"> |
| 7 | +</p> |
| 8 | + |
| 9 | +<p align="center"> |
| 10 | + <img src="img/model_2.png" width="80%"> |
| 11 | +</p> |
| 12 | + |
| 13 | +# Status (2021.06.03) |
| 14 | +- [x] Naive Version of DiffSinger |
| 15 | +- [ ] Shallow Diffusion Mechanism: Training boundary predictor by leveraging pre-trained auxiliary decoder + Training denoiser using `k` as a maximum time step |
| 16 | + |
| 17 | +# Quickstart |
| 18 | + |
| 19 | +## Dependencies |
| 20 | +You can install the Python dependencies with |
| 21 | +``` |
| 22 | +pip3 install -r requirements.txt |
| 23 | +``` |
| 24 | + |
| 25 | +## Inference |
| 26 | + |
| 27 | +You have to download the [pretrained models]() and put them in ``output/ckpt/LJSpeech/``. |
| 28 | + |
| 29 | +For English single-speaker TTS, run |
| 30 | +``` |
| 31 | +python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 900000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml |
| 32 | +``` |
| 33 | +The generated utterances will be put in ``output/result/``. |
| 34 | + |
| 35 | + |
| 36 | +## Batch Inference |
| 37 | +Batch inference is also supported, try |
| 38 | + |
| 39 | +``` |
| 40 | +python3 synthesize.py --source preprocessed_data/LJSpeech/val.txt --restore_step 900000 --mode batch -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml |
| 41 | +``` |
| 42 | +to synthesize all utterances in ``preprocessed_data/LJSpeech/val.txt`` |
| 43 | + |
| 44 | +## Controllability |
| 45 | +The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. |
| 46 | +For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by |
| 47 | + |
| 48 | +``` |
| 49 | +python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 900000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml --duration_control 0.8 --energy_control 0.8 |
| 50 | +``` |
| 51 | + |
| 52 | +# Training |
| 53 | + |
| 54 | +## Datasets |
| 55 | + |
| 56 | +The supported datasets are |
| 57 | + |
| 58 | +- [LJSpeech](https://keithito.com/LJ-Speech-Dataset/): a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total. |
| 59 | +- (will be added more) |
| 60 | + |
| 61 | +## Preprocessing |
| 62 | + |
| 63 | +First, run |
| 64 | +``` |
| 65 | +python3 prepare_align.py config/LJSpeech/preprocess.yaml |
| 66 | +``` |
| 67 | +for some preparations. |
| 68 | + |
| 69 | +As described in the paper, [Montreal Forced Aligner](https://montreal-forced-aligner.readthedocs.io/en/latest/) (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. |
| 70 | +Alignments for the LJSpeech datasets are provided [here](https://drive.google.com/drive/folders/1DBRkALpPd6FL9gjHMmMEdHODmkgNIIK4?usp=sharing) from [ming024's FastSpeech2](https://github.com/ming024/FastSpeech2). |
| 71 | +You have to unzip the files in ``preprocessed_data/LJSpeech/TextGrid/``. |
| 72 | + |
| 73 | +After that, run the preprocessing script by |
| 74 | +``` |
| 75 | +python3 preprocess.py config/LJSpeech/preprocess.yaml |
| 76 | +``` |
| 77 | + |
| 78 | +Alternately, you can align the corpus by yourself. |
| 79 | +Download the official MFA package and run |
| 80 | +``` |
| 81 | +./montreal-forced-aligner/bin/mfa_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt english preprocessed_data/LJSpeech |
| 82 | +``` |
| 83 | +or |
| 84 | +``` |
| 85 | +./montreal-forced-aligner/bin/mfa_train_and_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt preprocessed_data/LJSpeech |
| 86 | +``` |
| 87 | + |
| 88 | +to align the corpus and then run the preprocessing script. |
| 89 | +``` |
| 90 | +python3 preprocess.py config/LJSpeech/preprocess.yaml |
| 91 | +``` |
| 92 | + |
| 93 | +## Training |
| 94 | + |
| 95 | +Train your model with |
| 96 | +``` |
| 97 | +python3 train.py -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml |
| 98 | +``` |
| 99 | + |
| 100 | +# TensorBoard |
| 101 | + |
| 102 | +Use |
| 103 | +``` |
| 104 | +tensorboard --logdir output/log/LJSpeech |
| 105 | +``` |
| 106 | + |
| 107 | +to serve TensorBoard on your localhost. |
| 108 | +The loss curves, synthesized mel-spectrograms, and audios are shown. |
| 109 | + |
| 110 | + |
| 111 | + |
| 112 | +# Implementation Issues |
| 113 | + |
| 114 | +1. **Pitch extractor comparison (on LJ001-0006.wav)** |
| 115 | + |
| 116 | + <p align="center"> |
| 117 | + <img src="img/pitch_extractor_comparison.png" width="100%"> |
| 118 | + </p> |
| 119 | + |
| 120 | + **pyworld** is used to extract f0 (fundamental frequency) as pitch information in this implementation. Empirically, however, I found that all three methods were equally acceptable for clean datasets (e.g., LJSpeech) as above figures. Note that **pysptk** would work better for noisy datasets (as described in [STYLER](https://github.com/keonlee9420/STYLER)). |
| 121 | + |
| 122 | +2. Stack two layers of `FFTBlock` for the lyrics encoder (text encoder). |
| 123 | +3. (Naive version) The number of learnable parameters is `34.337M`, which is larger than the original paper (`26.744M`). The `diffusion` module takes a significant portion of whole parameters. |
| 124 | +4. I did not remove the energy prediction of FastSpeech2 since it is not critical to the model training or performance (as described in [LightSpeech](https://arxiv.org/abs/2102.04040)). It should be easily removed without any performance degradation. |
| 125 | +5. Use **HiFi-GAN** instead of **Parallel WaveGAN (PWG)** for vocoding. |
| 126 | + |
| 127 | +# Citation |
| 128 | + |
| 129 | +``` |
| 130 | +@misc{lee2021diffsinger, |
| 131 | + author = {Lee, Keon}, |
| 132 | + title = {DiffSinger}, |
| 133 | + year = {2021}, |
| 134 | + publisher = {GitHub}, |
| 135 | + journal = {GitHub repository}, |
| 136 | + howpublished = {\url{https://github.com/keonlee9420/DiffSinger}} |
| 137 | +} |
| 138 | +``` |
| 139 | + |
| 140 | +# References |
| 141 | +- Authors' codebase |
| 142 | +- [ming024's FastSpeech2](https://github.com/ming024/FastSpeech2) (Later than 2021.02.26 ver.) |
| 143 | +- [hojonathanho's diffusion](https://github.com/hojonathanho/diffusion) |
| 144 | +- [lmnt-com's diffwave](https://github.com/lmnt-com/diffwave) |
0 commit comments