MoonInTheRiver
diff --git a/‎.gitignore‎
Lines changed: 3 additions & 0 deletions b/‎.gitignore‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎LICENSE‎
Lines changed: 21 additions & 0 deletions b/‎LICENSE‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 146 additions & 0 deletions b/‎README.md‎
Lines changed: 146 additions & 0 deletions
diff --git a/‎configs/config_base.yaml‎
Lines changed: 42 additions & 0 deletions b/‎configs/config_base.yaml‎
Lines changed: 42 additions & 0 deletions
diff --git a/‎configs/singing/base.yaml‎
Lines changed: 42 additions & 0 deletions b/‎configs/singing/base.yaml‎
Lines changed: 42 additions & 0 deletions
diff --git a/‎configs/singing/fs2.yaml‎
Lines changed: 3 additions & 0 deletions b/‎configs/singing/fs2.yaml‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎configs/tts/base.yaml‎
Lines changed: 95 additions & 0 deletions b/‎configs/tts/base.yaml‎
Lines changed: 95 additions & 0 deletions
diff --git a/‎configs/tts/base_zh.yaml‎
Lines changed: 3 additions & 0 deletions b/‎configs/tts/base_zh.yaml‎
Lines changed: 3 additions & 0 deletions
@@ -0,0 +1,3 @@
+.idea
+*.pyc
+__pycache__/
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2021 Jinglin Liu
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
@@ -0,0 +1,146 @@
+# DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
+[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2105.02446)
+
+This repository is the official PyTorch implementation of our AAAI-2022 [paper](https://arxiv.org/abs/2105.02446), in which we propose DiffSinger (for Singing-Voice-Synthesis) and DiffSpeech (for Text-to-Speech).
+ 
+Besides, more detailed & improved code framework, which contains the implementations of FastSpeech 2, DiffSpeech and our NeurIPS-2021 work [PortaSpeech](https://openreview.net/forum?id=xmJsuh8xlq) is coming soon :sparkles: :sparkles: :sparkles:.
+<table style="width:100%">
+  <tr>
+    <th>DiffSinger/DiffSpeech at training</th>
+    <th>DiffSinger/DiffSpeech at inference</th>
+  </tr>
+  <tr>
+    <td><img src="resources/model_a.png" alt="Training" height="300"></td>
+    <td><img src="resources/model_b.png" alt="Inference" height="300"></td>
+  </tr>
+</table>
+
+:rocket: **News**: 
+ - Dec.01, 2021: DiffSinger was accepted by AAAI-2022.
+ - Sep.29, 2021: Our recent work `PortaSpeech: Portable and High-Quality Generative Text-to-Speech` was accepted by NeurIPS-2021 [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2109.15166) .
+ - May.06, 2021: We submitted DiffSinger to Arxiv [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2105.02446).
+ 
+## Environments
+```sh
+conda create -n your_env_name python=3.8
+source activate your_env_name 
+pip install -r requirements_2080.txt   (GPU 2080Ti, CUDA 10.2)
+or pip install -r requirements_3090.txt   (GPU 3090, CUDA 11.4)
+```
+
+## DiffSpeech (TTS version)
+### 1. Data Preparation
+
+a) Download and extract the [LJ Speech dataset](https://keithito.com/LJ-Speech-Dataset/), then create a link to the dataset folder: `ln -s /xxx/LJSpeech-1.1/ data/raw/`
+
+b) Download and Unzip the [ground-truth duration](https://drive.google.com/file/d/1SqwIISwaBZDiCW1MHTHx-MKX6_NQJ_f4/view?usp=sharing) extracted by [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.0.1/montreal-forced-aligner_linux.tar.gz):  `tar -xvf mfa_outputs.tar; mv mfa_outputs data/processed/ljspeech/`
+
+c) Run the following scripts to pack the dataset for training/inference.
+
+```sh
+CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config configs/tts/lj/fs2.yaml
+
+# `data/binary/ljspeech` will be generated.
+```
+
+### 2. Training Example
+
+```sh
+CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/lj_ds_beta6.yaml --exp_name xxx --reset
+```
+
+
+### 3. Inference Example
+
+```sh
+CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/lj_ds_beta6.yaml --exp_name xxx --reset --infer
+```
+
+We also provide:
+ - the pre-trained model of [DiffSpeech](https://drive.google.com/file/d/1AHRuNS379v2_lNuz4-Mjlpii7TZsfs3f/view?usp=sharing);
+ - the pre-trained model of [HifiGAN](https://drive.google.com/file/d/1Z3DJ9fvvzIci9DAf8jwchQs-Ulgpx6l8/view?usp=sharing) vocoder;
+ - the individual pre-trained model of [FastSpeech 2](https://drive.google.com/file/d/1Zp45YjKkkv5vQSA7woHIqEggfyLqQdqs/view?usp=sharing) for the shallow diffusion mechanism in DiffSpeech;
+ 
+Remember to put the pre-trained models in `checkpoints` directory.
+
+About the determination of 'k' in shallow diffusion: We recommend the trick introduced in Appendix B. We have already provided the proper 'k' for Ljspeech dataset in the config files.
+
+
+## DiffSinger (SVS version)
+
+### 0. Data Acquirement
+- [ ] WIP.
+We will provide a form to apply for PopCS dataset.
+
+### 1. Data Preparation
+- [ ] WIP.
+Similar to DiffSpeech. 
+
+### 2. Training Example
+```sh
+CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_ds_beta6.yaml --exp_name xxx --reset
+# or
+CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_ds_beta6_offline.yaml --exp_name xxx --reset
+```
+### 3. Inference Example
+```sh
+CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config xxx --exp_name xxx --reset --infer
+```
+The pre-trained model for SVS will be provided recently. 
+<!--
+Besides, the original PWG-based vocoder for SVS in our paper has been used commercially, but we are working on training a better HifiGAN-based vocoder.
+-->
+
+## Tensorboard
+```sh
+tensorboard --logdir_spec exp_name
+```
+<table style="width:100%">
+  <tr>
+    <td><img src="resources/tfb.png" alt="Tensorboard" height="250"></td>
+  </tr>
+</table>
+
+## Mel Visualization
+Along vertical axis, DiffSpeech: [0-80]; FastSpeech2: [80-160].
+
+<table style="width:100%">
+  <tr>
+    <th>DiffSpeech vs. FastSpeech 2</th>
+  </tr>
+  <tr>
+    <td><img src="resources/diffspeech-fs2.png" alt="DiffSpeech-vs-FastSpeech2" height="250"></td>
+  </tr>
+  <tr>
+    <td><img src="resources/diffspeech-fs2-1.png" alt="DiffSpeech-vs-FastSpeech2" height="250"></td>
+  </tr>
+  <tr>
+    <td><img src="resources/diffspeech-fs2-2.png" alt="DiffSpeech-vs-FastSpeech2" height="250"></td>
+  </tr>
+</table>
+
+## Audio Demos
+Audio samples can be found in our [demo page](https://diffsinger.github.io/).
+
+We also put part of the audio samples generated by DiffSpeech+HifiGAN (marked as [P]) and GTmel+HifiGAN (marked as [G]) of test set in resources/demos_1218. 
+
+(corresponding to the pre-trained model [DiffSpeech](https://drive.google.com/file/d/1AHRuNS379v2_lNuz4-Mjlpii7TZsfs3f/view?usp=sharing))
+
+## Citation
+    @misc{liu2021diffsinger,
+      title={DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism}, 
+      author={Jinglin Liu and Chengxi Li and Yi Ren and Feiyang Chen and Zhou Zhao},
+      year={2021},
+      eprint={2105.02446},
+      archivePrefix={arXiv},}
+
+
+## Acknowledgements
+Our codes are based on the following repos:
+* [denoising-diffusion-pytorch](https://github.com/lucidrains/denoising-diffusion-pytorch)
+* [PyTorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning)
+* [ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN)
+* [HifiGAN](https://github.com/jik876/hifi-gan)
+* [espnet](https://github.com/espnet/espnet)
+
+Also thanks [Keon Lee](https://github.com/keonlee9420/DiffSinger) for fast implementation of our work.
@@ -0,0 +1,42 @@
+# task
+binary_data_dir: ''
+work_dir: '' # experiment directory.
+infer: false # infer
+seed: 1234
+debug: false
+save_codes:
+  - configs
+  - modules
+  - tasks
+  - utils
+  - usr
+
+#############
+# dataset
+#############
+ds_workers: 1
+test_num: 100
+valid_num: 100
+endless_ds: false
+sort_by_len: true
+
+#########
+# train and eval
+#########
+load_ckpt: ''
+save_ckpt: true
+save_best: true
+num_ckpt_keep: 3
+clip_grad_norm: 0
+accumulate_grad_batches: 1
+log_interval: 100
+num_sanity_val_steps: 5  # steps of validation at the beginning
+check_val_every_n_epoch: 10
+val_check_interval: 2000
+max_epochs: 1000
+max_updates: 160000
+max_tokens: 31250
+max_sentences: 100000
+max_eval_tokens: -1
+max_eval_sentences: -1
+test_input_dir: ''
@@ -0,0 +1,42 @@
+base_config:
+  - configs/tts/base.yaml
+  - configs/tts/base_zh.yaml
+
+
+datasets: []
+test_prefixes: []
+test_num: 0
+valid_num: 0
+
+pre_align_cls: data_gen.singing.pre_align.SingingPreAlign
+binarizer_cls: data_gen.singing.binarize.SingingBinarizer
+pre_align_args:
+  use_tone: false # for ZH
+  forced_align: mfa
+  use_sox: true
+hop_size: 128            # Hop size.
+fft_size: 512           # FFT size.
+win_size: 512           # FFT size.
+max_frames: 2400
+fmin: 50                 # Minimum freq in mel basis calculation.
+fmax: 11025               # Maximum frequency in mel basis calculation.
+pitch_type: frame
+
+hidden_size: 256
+mel_loss: "ssim:0.5|l1:0.5"
+lambda_f0: 0.0
+lambda_uv: 0.0
+lambda_energy: 0.0
+lambda_ph_dur: 0.0
+lambda_sent_dur: 0.0
+lambda_word_dur: 0.0
+predictor_grad: 0.0
+use_spk_embed: true
+use_spk_id: false
+
+max_tokens: 20000
+max_updates: 400000
+num_spk: 100
+save_f0: true
+use_gt_dur: true
+use_gt_f0: true
@@ -0,0 +1,3 @@
+base_config:
+  - configs/tts/fs2.yaml
+  - configs/singing/base.yaml
@@ -0,0 +1,95 @@
+# task
+base_config: configs/config_base.yaml
+task_cls: ''
+#############
+# dataset
+#############
+raw_data_dir: ''
+processed_data_dir: ''
+binary_data_dir: ''
+dict_dir: ''
+pre_align_cls: ''
+binarizer_cls: data_gen.tts.base_binarizer.BaseBinarizer
+pre_align_args:
+  use_tone: true # for ZH
+  forced_align: mfa
+  use_sox: false
+  txt_processor: en
+  allow_no_txt: false
+  denoise: false
+binarization_args:
+  shuffle: false
+  with_txt: true
+  with_wav: false
+  with_align: true
+  with_spk_embed: true
+  with_f0: true
+  with_f0cwt: true
+
+loud_norm: false
+endless_ds: true
+reset_phone_dict: true
+
+test_num: 100
+valid_num: 100
+max_frames: 1550
+max_input_tokens: 1550
+audio_num_mel_bins: 80
+audio_sample_rate: 22050
+hop_size: 256  # For 22050Hz, 275 ~= 12.5 ms (0.0125 * sample_rate)
+win_size: 1024  # For 22050Hz, 1100 ~= 50 ms (If None, win_size: fft_size) (0.05 * sample_rate)
+fmin: 80  # Set this to 55 if your speaker is male! if female, 95 should help taking off noise. (To test depending on dataset. Pitch info: male~[65, 260], female~[100, 525])
+fmax: 7600  # To be increased/reduced depending on data.
+fft_size: 1024  # Extra window size is filled with 0 paddings to match this parameter
+min_level_db: -100
+num_spk: 1
+mel_vmin: -6
+mel_vmax: 1.5
+ds_workers: 4
+
+#########
+# model
+#########
+dropout: 0.1
+enc_layers: 4
+dec_layers: 4
+hidden_size: 384
+num_heads: 2
+prenet_dropout: 0.5
+prenet_hidden_size: 256
+stop_token_weight: 5.0
+enc_ffn_kernel_size: 9
+dec_ffn_kernel_size: 9
+ffn_act: gelu
+ffn_padding: 'SAME'
+
+
+###########
+# optimization
+###########
+lr: 2.0
+warmup_updates: 8000
+optimizer_adam_beta1: 0.9
+optimizer_adam_beta2: 0.98
+weight_decay: 0
+clip_grad_norm: 1
+
+
+###########
+# train and eval
+###########
+max_tokens: 30000
+max_sentences: 100000
+max_eval_sentences: 1
+max_eval_tokens: 60000
+train_set_name: 'train'
+valid_set_name: 'valid'
+test_set_name: 'test'
+vocoder: pwg
+vocoder_ckpt: ''
+profile_infer: false
+out_wav_norm: false
+save_gt: false
+save_f0: false
+gen_dir_name: ''
+use_denoise: false
@@ -0,0 +1,3 @@
+pre_align_args:
+  txt_processor: zh_g2pM
+binarizer_cls: data_gen.tts.binarizer_zh.ZhBinarizer
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+base_config:`
	`2`	`+ - configs/tts/fs2.yaml`
	`3`	`+ - configs/singing/base.yaml`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+pre_align_args:`
	`2`	`+ txt_processor: zh_g2pM`
	`3`	`+binarizer_cls: data_gen.tts.binarizer_zh.ZhBinarizer`