|
| 1 | +# DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism |
| 2 | +[](https://arxiv.org/abs/2105.02446) |
| 3 | + |
| 4 | +This repository is the official PyTorch implementation of our AAAI-2022 [paper](https://arxiv.org/abs/2105.02446), in which we propose DiffSinger (for Singing-Voice-Synthesis) and DiffSpeech (for Text-to-Speech). |
| 5 | + |
| 6 | +Besides, more detailed & improved code framework, which contains the implementations of FastSpeech 2, DiffSpeech and our NeurIPS-2021 work [PortaSpeech](https://openreview.net/forum?id=xmJsuh8xlq) is coming soon :sparkles: :sparkles: :sparkles:. |
| 7 | +<table style="width:100%"> |
| 8 | + <tr> |
| 9 | + <th>DiffSinger/DiffSpeech at training</th> |
| 10 | + <th>DiffSinger/DiffSpeech at inference</th> |
| 11 | + </tr> |
| 12 | + <tr> |
| 13 | + <td><img src="resources/model_a.png" alt="Training" height="300"></td> |
| 14 | + <td><img src="resources/model_b.png" alt="Inference" height="300"></td> |
| 15 | + </tr> |
| 16 | +</table> |
| 17 | + |
| 18 | +:rocket: **News**: |
| 19 | + - Dec.01, 2021: DiffSinger was accepted by AAAI-2022. |
| 20 | + - Sep.29, 2021: Our recent work `PortaSpeech: Portable and High-Quality Generative Text-to-Speech` was accepted by NeurIPS-2021 [](https://arxiv.org/abs/2109.15166) . |
| 21 | + - May.06, 2021: We submitted DiffSinger to Arxiv [](https://arxiv.org/abs/2105.02446). |
| 22 | + |
| 23 | +## Environments |
| 24 | +```sh |
| 25 | +conda create -n your_env_name python=3.8 |
| 26 | +source activate your_env_name |
| 27 | +pip install -r requirements_2080.txt (GPU 2080Ti, CUDA 10.2) |
| 28 | +or pip install -r requirements_3090.txt (GPU 3090, CUDA 11.4) |
| 29 | +``` |
| 30 | + |
| 31 | +## DiffSpeech (TTS version) |
| 32 | +### 1. Data Preparation |
| 33 | + |
| 34 | +a) Download and extract the [LJ Speech dataset](https://keithito.com/LJ-Speech-Dataset/), then create a link to the dataset folder: `ln -s /xxx/LJSpeech-1.1/ data/raw/` |
| 35 | + |
| 36 | +b) Download and Unzip the [ground-truth duration](https://drive.google.com/file/d/1SqwIISwaBZDiCW1MHTHx-MKX6_NQJ_f4/view?usp=sharing) extracted by [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.0.1/montreal-forced-aligner_linux.tar.gz): `tar -xvf mfa_outputs.tar; mv mfa_outputs data/processed/ljspeech/` |
| 37 | + |
| 38 | +c) Run the following scripts to pack the dataset for training/inference. |
| 39 | + |
| 40 | +```sh |
| 41 | +CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config configs/tts/lj/fs2.yaml |
| 42 | + |
| 43 | +# `data/binary/ljspeech` will be generated. |
| 44 | +``` |
| 45 | + |
| 46 | +### 2. Training Example |
| 47 | + |
| 48 | +```sh |
| 49 | +CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/lj_ds_beta6.yaml --exp_name xxx --reset |
| 50 | +``` |
| 51 | + |
| 52 | + |
| 53 | +### 3. Inference Example |
| 54 | + |
| 55 | +```sh |
| 56 | +CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/lj_ds_beta6.yaml --exp_name xxx --reset --infer |
| 57 | +``` |
| 58 | + |
| 59 | +We also provide: |
| 60 | + - the pre-trained model of [DiffSpeech](https://drive.google.com/file/d/1AHRuNS379v2_lNuz4-Mjlpii7TZsfs3f/view?usp=sharing); |
| 61 | + - the pre-trained model of [HifiGAN](https://drive.google.com/file/d/1Z3DJ9fvvzIci9DAf8jwchQs-Ulgpx6l8/view?usp=sharing) vocoder; |
| 62 | + - the individual pre-trained model of [FastSpeech 2](https://drive.google.com/file/d/1Zp45YjKkkv5vQSA7woHIqEggfyLqQdqs/view?usp=sharing) for the shallow diffusion mechanism in DiffSpeech; |
| 63 | + |
| 64 | +Remember to put the pre-trained models in `checkpoints` directory. |
| 65 | + |
| 66 | +About the determination of 'k' in shallow diffusion: We recommend the trick introduced in Appendix B. We have already provided the proper 'k' for Ljspeech dataset in the config files. |
| 67 | + |
| 68 | + |
| 69 | +## DiffSinger (SVS version) |
| 70 | + |
| 71 | +### 0. Data Acquirement |
| 72 | +- [ ] WIP. |
| 73 | +We will provide a form to apply for PopCS dataset. |
| 74 | + |
| 75 | +### 1. Data Preparation |
| 76 | +- [ ] WIP. |
| 77 | +Similar to DiffSpeech. |
| 78 | + |
| 79 | +### 2. Training Example |
| 80 | +```sh |
| 81 | +CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_ds_beta6.yaml --exp_name xxx --reset |
| 82 | +# or |
| 83 | +CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_ds_beta6_offline.yaml --exp_name xxx --reset |
| 84 | +``` |
| 85 | +### 3. Inference Example |
| 86 | +```sh |
| 87 | +CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config xxx --exp_name xxx --reset --infer |
| 88 | +``` |
| 89 | +The pre-trained model for SVS will be provided recently. |
| 90 | +<!-- |
| 91 | +Besides, the original PWG-based vocoder for SVS in our paper has been used commercially, but we are working on training a better HifiGAN-based vocoder. |
| 92 | +--> |
| 93 | + |
| 94 | +## Tensorboard |
| 95 | +```sh |
| 96 | +tensorboard --logdir_spec exp_name |
| 97 | +``` |
| 98 | +<table style="width:100%"> |
| 99 | + <tr> |
| 100 | + <td><img src="resources/tfb.png" alt="Tensorboard" height="250"></td> |
| 101 | + </tr> |
| 102 | +</table> |
| 103 | + |
| 104 | +## Mel Visualization |
| 105 | +Along vertical axis, DiffSpeech: [0-80]; FastSpeech2: [80-160]. |
| 106 | + |
| 107 | +<table style="width:100%"> |
| 108 | + <tr> |
| 109 | + <th>DiffSpeech vs. FastSpeech 2</th> |
| 110 | + </tr> |
| 111 | + <tr> |
| 112 | + <td><img src="resources/diffspeech-fs2.png" alt="DiffSpeech-vs-FastSpeech2" height="250"></td> |
| 113 | + </tr> |
| 114 | + <tr> |
| 115 | + <td><img src="resources/diffspeech-fs2-1.png" alt="DiffSpeech-vs-FastSpeech2" height="250"></td> |
| 116 | + </tr> |
| 117 | + <tr> |
| 118 | + <td><img src="resources/diffspeech-fs2-2.png" alt="DiffSpeech-vs-FastSpeech2" height="250"></td> |
| 119 | + </tr> |
| 120 | +</table> |
| 121 | + |
| 122 | +## Audio Demos |
| 123 | +Audio samples can be found in our [demo page](https://diffsinger.github.io/). |
| 124 | + |
| 125 | +We also put part of the audio samples generated by DiffSpeech+HifiGAN (marked as [P]) and GTmel+HifiGAN (marked as [G]) of test set in resources/demos_1218. |
| 126 | + |
| 127 | +(corresponding to the pre-trained model [DiffSpeech](https://drive.google.com/file/d/1AHRuNS379v2_lNuz4-Mjlpii7TZsfs3f/view?usp=sharing)) |
| 128 | + |
| 129 | +## Citation |
| 130 | + @misc{liu2021diffsinger, |
| 131 | + title={DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism}, |
| 132 | + author={Jinglin Liu and Chengxi Li and Yi Ren and Feiyang Chen and Zhou Zhao}, |
| 133 | + year={2021}, |
| 134 | + eprint={2105.02446}, |
| 135 | + archivePrefix={arXiv},} |
| 136 | + |
| 137 | + |
| 138 | +## Acknowledgements |
| 139 | +Our codes are based on the following repos: |
| 140 | +* [denoising-diffusion-pytorch](https://github.com/lucidrains/denoising-diffusion-pytorch) |
| 141 | +* [PyTorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning) |
| 142 | +* [ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) |
| 143 | +* [HifiGAN](https://github.com/jik876/hifi-gan) |
| 144 | +* [espnet](https://github.com/espnet/espnet) |
| 145 | + |
| 146 | +Also thanks [Keon Lee](https://github.com/keonlee9420/DiffSinger) for fast implementation of our work. |
0 commit comments