Skip to content

Commit ce7abdc

Browse files
first commit
0 parents  commit ce7abdc

File tree

177 files changed

+39444
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

177 files changed

+39444
-0
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
.idea
2+
*.pyc
3+
__pycache__/

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2021 Jinglin Liu
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
# DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
2+
[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2105.02446)
3+
4+
This repository is the official PyTorch implementation of our AAAI-2022 [paper](https://arxiv.org/abs/2105.02446), in which we propose DiffSinger (for Singing-Voice-Synthesis) and DiffSpeech (for Text-to-Speech).
5+
6+
Besides, more detailed & improved code framework, which contains the implementations of FastSpeech 2, DiffSpeech and our NeurIPS-2021 work [PortaSpeech](https://openreview.net/forum?id=xmJsuh8xlq) is coming soon :sparkles: :sparkles: :sparkles:.
7+
<table style="width:100%">
8+
<tr>
9+
<th>DiffSinger/DiffSpeech at training</th>
10+
<th>DiffSinger/DiffSpeech at inference</th>
11+
</tr>
12+
<tr>
13+
<td><img src="resources/model_a.png" alt="Training" height="300"></td>
14+
<td><img src="resources/model_b.png" alt="Inference" height="300"></td>
15+
</tr>
16+
</table>
17+
18+
:rocket: **News**:
19+
- Dec.01, 2021: DiffSinger was accepted by AAAI-2022.
20+
- Sep.29, 2021: Our recent work `PortaSpeech: Portable and High-Quality Generative Text-to-Speech` was accepted by NeurIPS-2021 [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2109.15166) .
21+
- May.06, 2021: We submitted DiffSinger to Arxiv [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2105.02446).
22+
23+
## Environments
24+
```sh
25+
conda create -n your_env_name python=3.8
26+
source activate your_env_name
27+
pip install -r requirements_2080.txt (GPU 2080Ti, CUDA 10.2)
28+
or pip install -r requirements_3090.txt (GPU 3090, CUDA 11.4)
29+
```
30+
31+
## DiffSpeech (TTS version)
32+
### 1. Data Preparation
33+
34+
a) Download and extract the [LJ Speech dataset](https://keithito.com/LJ-Speech-Dataset/), then create a link to the dataset folder: `ln -s /xxx/LJSpeech-1.1/ data/raw/`
35+
36+
b) Download and Unzip the [ground-truth duration](https://drive.google.com/file/d/1SqwIISwaBZDiCW1MHTHx-MKX6_NQJ_f4/view?usp=sharing) extracted by [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.0.1/montreal-forced-aligner_linux.tar.gz): `tar -xvf mfa_outputs.tar; mv mfa_outputs data/processed/ljspeech/`
37+
38+
c) Run the following scripts to pack the dataset for training/inference.
39+
40+
```sh
41+
CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config configs/tts/lj/fs2.yaml
42+
43+
# `data/binary/ljspeech` will be generated.
44+
```
45+
46+
### 2. Training Example
47+
48+
```sh
49+
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/lj_ds_beta6.yaml --exp_name xxx --reset
50+
```
51+
52+
53+
### 3. Inference Example
54+
55+
```sh
56+
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/lj_ds_beta6.yaml --exp_name xxx --reset --infer
57+
```
58+
59+
We also provide:
60+
- the pre-trained model of [DiffSpeech](https://drive.google.com/file/d/1AHRuNS379v2_lNuz4-Mjlpii7TZsfs3f/view?usp=sharing);
61+
- the pre-trained model of [HifiGAN](https://drive.google.com/file/d/1Z3DJ9fvvzIci9DAf8jwchQs-Ulgpx6l8/view?usp=sharing) vocoder;
62+
- the individual pre-trained model of [FastSpeech 2](https://drive.google.com/file/d/1Zp45YjKkkv5vQSA7woHIqEggfyLqQdqs/view?usp=sharing) for the shallow diffusion mechanism in DiffSpeech;
63+
64+
Remember to put the pre-trained models in `checkpoints` directory.
65+
66+
About the determination of 'k' in shallow diffusion: We recommend the trick introduced in Appendix B. We have already provided the proper 'k' for Ljspeech dataset in the config files.
67+
68+
69+
## DiffSinger (SVS version)
70+
71+
### 0. Data Acquirement
72+
- [ ] WIP.
73+
We will provide a form to apply for PopCS dataset.
74+
75+
### 1. Data Preparation
76+
- [ ] WIP.
77+
Similar to DiffSpeech.
78+
79+
### 2. Training Example
80+
```sh
81+
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_ds_beta6.yaml --exp_name xxx --reset
82+
# or
83+
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_ds_beta6_offline.yaml --exp_name xxx --reset
84+
```
85+
### 3. Inference Example
86+
```sh
87+
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config xxx --exp_name xxx --reset --infer
88+
```
89+
The pre-trained model for SVS will be provided recently.
90+
<!--
91+
Besides, the original PWG-based vocoder for SVS in our paper has been used commercially, but we are working on training a better HifiGAN-based vocoder.
92+
-->
93+
94+
## Tensorboard
95+
```sh
96+
tensorboard --logdir_spec exp_name
97+
```
98+
<table style="width:100%">
99+
<tr>
100+
<td><img src="resources/tfb.png" alt="Tensorboard" height="250"></td>
101+
</tr>
102+
</table>
103+
104+
## Mel Visualization
105+
Along vertical axis, DiffSpeech: [0-80]; FastSpeech2: [80-160].
106+
107+
<table style="width:100%">
108+
<tr>
109+
<th>DiffSpeech vs. FastSpeech 2</th>
110+
</tr>
111+
<tr>
112+
<td><img src="resources/diffspeech-fs2.png" alt="DiffSpeech-vs-FastSpeech2" height="250"></td>
113+
</tr>
114+
<tr>
115+
<td><img src="resources/diffspeech-fs2-1.png" alt="DiffSpeech-vs-FastSpeech2" height="250"></td>
116+
</tr>
117+
<tr>
118+
<td><img src="resources/diffspeech-fs2-2.png" alt="DiffSpeech-vs-FastSpeech2" height="250"></td>
119+
</tr>
120+
</table>
121+
122+
## Audio Demos
123+
Audio samples can be found in our [demo page](https://diffsinger.github.io/).
124+
125+
We also put part of the audio samples generated by DiffSpeech+HifiGAN (marked as [P]) and GTmel+HifiGAN (marked as [G]) of test set in resources/demos_1218.
126+
127+
(corresponding to the pre-trained model [DiffSpeech](https://drive.google.com/file/d/1AHRuNS379v2_lNuz4-Mjlpii7TZsfs3f/view?usp=sharing))
128+
129+
## Citation
130+
@misc{liu2021diffsinger,
131+
title={DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism},
132+
author={Jinglin Liu and Chengxi Li and Yi Ren and Feiyang Chen and Zhou Zhao},
133+
year={2021},
134+
eprint={2105.02446},
135+
archivePrefix={arXiv},}
136+
137+
138+
## Acknowledgements
139+
Our codes are based on the following repos:
140+
* [denoising-diffusion-pytorch](https://github.com/lucidrains/denoising-diffusion-pytorch)
141+
* [PyTorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning)
142+
* [ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN)
143+
* [HifiGAN](https://github.com/jik876/hifi-gan)
144+
* [espnet](https://github.com/espnet/espnet)
145+
146+
Also thanks [Keon Lee](https://github.com/keonlee9420/DiffSinger) for fast implementation of our work.

configs/config_base.yaml

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# task
2+
binary_data_dir: ''
3+
work_dir: '' # experiment directory.
4+
infer: false # infer
5+
seed: 1234
6+
debug: false
7+
save_codes:
8+
- configs
9+
- modules
10+
- tasks
11+
- utils
12+
- usr
13+
14+
#############
15+
# dataset
16+
#############
17+
ds_workers: 1
18+
test_num: 100
19+
valid_num: 100
20+
endless_ds: false
21+
sort_by_len: true
22+
23+
#########
24+
# train and eval
25+
#########
26+
load_ckpt: ''
27+
save_ckpt: true
28+
save_best: true
29+
num_ckpt_keep: 3
30+
clip_grad_norm: 0
31+
accumulate_grad_batches: 1
32+
log_interval: 100
33+
num_sanity_val_steps: 5 # steps of validation at the beginning
34+
check_val_every_n_epoch: 10
35+
val_check_interval: 2000
36+
max_epochs: 1000
37+
max_updates: 160000
38+
max_tokens: 31250
39+
max_sentences: 100000
40+
max_eval_tokens: -1
41+
max_eval_sentences: -1
42+
test_input_dir: ''

configs/singing/base.yaml

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
base_config:
2+
- configs/tts/base.yaml
3+
- configs/tts/base_zh.yaml
4+
5+
6+
datasets: []
7+
test_prefixes: []
8+
test_num: 0
9+
valid_num: 0
10+
11+
pre_align_cls: data_gen.singing.pre_align.SingingPreAlign
12+
binarizer_cls: data_gen.singing.binarize.SingingBinarizer
13+
pre_align_args:
14+
use_tone: false # for ZH
15+
forced_align: mfa
16+
use_sox: true
17+
hop_size: 128 # Hop size.
18+
fft_size: 512 # FFT size.
19+
win_size: 512 # FFT size.
20+
max_frames: 2400
21+
fmin: 50 # Minimum freq in mel basis calculation.
22+
fmax: 11025 # Maximum frequency in mel basis calculation.
23+
pitch_type: frame
24+
25+
hidden_size: 256
26+
mel_loss: "ssim:0.5|l1:0.5"
27+
lambda_f0: 0.0
28+
lambda_uv: 0.0
29+
lambda_energy: 0.0
30+
lambda_ph_dur: 0.0
31+
lambda_sent_dur: 0.0
32+
lambda_word_dur: 0.0
33+
predictor_grad: 0.0
34+
use_spk_embed: true
35+
use_spk_id: false
36+
37+
max_tokens: 20000
38+
max_updates: 400000
39+
num_spk: 100
40+
save_f0: true
41+
use_gt_dur: true
42+
use_gt_f0: true

configs/singing/fs2.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
base_config:
2+
- configs/tts/fs2.yaml
3+
- configs/singing/base.yaml

configs/tts/base.yaml

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# task
2+
base_config: configs/config_base.yaml
3+
task_cls: ''
4+
#############
5+
# dataset
6+
#############
7+
raw_data_dir: ''
8+
processed_data_dir: ''
9+
binary_data_dir: ''
10+
dict_dir: ''
11+
pre_align_cls: ''
12+
binarizer_cls: data_gen.tts.base_binarizer.BaseBinarizer
13+
pre_align_args:
14+
use_tone: true # for ZH
15+
forced_align: mfa
16+
use_sox: false
17+
txt_processor: en
18+
allow_no_txt: false
19+
denoise: false
20+
binarization_args:
21+
shuffle: false
22+
with_txt: true
23+
with_wav: false
24+
with_align: true
25+
with_spk_embed: true
26+
with_f0: true
27+
with_f0cwt: true
28+
29+
loud_norm: false
30+
endless_ds: true
31+
reset_phone_dict: true
32+
33+
test_num: 100
34+
valid_num: 100
35+
max_frames: 1550
36+
max_input_tokens: 1550
37+
audio_num_mel_bins: 80
38+
audio_sample_rate: 22050
39+
hop_size: 256 # For 22050Hz, 275 ~= 12.5 ms (0.0125 * sample_rate)
40+
win_size: 1024 # For 22050Hz, 1100 ~= 50 ms (If None, win_size: fft_size) (0.05 * sample_rate)
41+
fmin: 80 # Set this to 55 if your speaker is male! if female, 95 should help taking off noise. (To test depending on dataset. Pitch info: male~[65, 260], female~[100, 525])
42+
fmax: 7600 # To be increased/reduced depending on data.
43+
fft_size: 1024 # Extra window size is filled with 0 paddings to match this parameter
44+
min_level_db: -100
45+
num_spk: 1
46+
mel_vmin: -6
47+
mel_vmax: 1.5
48+
ds_workers: 4
49+
50+
#########
51+
# model
52+
#########
53+
dropout: 0.1
54+
enc_layers: 4
55+
dec_layers: 4
56+
hidden_size: 384
57+
num_heads: 2
58+
prenet_dropout: 0.5
59+
prenet_hidden_size: 256
60+
stop_token_weight: 5.0
61+
enc_ffn_kernel_size: 9
62+
dec_ffn_kernel_size: 9
63+
ffn_act: gelu
64+
ffn_padding: 'SAME'
65+
66+
67+
###########
68+
# optimization
69+
###########
70+
lr: 2.0
71+
warmup_updates: 8000
72+
optimizer_adam_beta1: 0.9
73+
optimizer_adam_beta2: 0.98
74+
weight_decay: 0
75+
clip_grad_norm: 1
76+
77+
78+
###########
79+
# train and eval
80+
###########
81+
max_tokens: 30000
82+
max_sentences: 100000
83+
max_eval_sentences: 1
84+
max_eval_tokens: 60000
85+
train_set_name: 'train'
86+
valid_set_name: 'valid'
87+
test_set_name: 'test'
88+
vocoder: pwg
89+
vocoder_ckpt: ''
90+
profile_infer: false
91+
out_wav_norm: false
92+
save_gt: false
93+
save_f0: false
94+
gen_dir_name: ''
95+
use_denoise: false

configs/tts/base_zh.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
pre_align_args:
2+
txt_processor: zh_g2pM
3+
binarizer_cls: data_gen.tts.binarizer_zh.ZhBinarizer

0 commit comments

Comments
 (0)