Skip to content

Commit 72ee2dc

Browse files
committed
Initial commit
0 parents  commit 72ee2dc

File tree

174 files changed

+275495
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

174 files changed

+275495
-0
lines changed

.gitignore

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# C extensions
7+
*.so
8+
9+
# Distribution / packaging
10+
.Python
11+
build/
12+
develop-eggs/
13+
dist/
14+
downloads/
15+
eggs/
16+
.eggs/
17+
lib/
18+
lib64/
19+
parts/
20+
sdist/
21+
var/
22+
wheels/
23+
*.egg-info/
24+
.installed.cfg
25+
*.egg
26+
MANIFEST
27+
28+
# PyInstaller
29+
# Usually these files are written by a python script from a template
30+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
31+
*.manifest
32+
*.spec
33+
34+
# Installer logs
35+
pip-log.txt
36+
pip-delete-this-directory.txt
37+
38+
# Unit test / coverage reports
39+
htmlcov/
40+
.tox/
41+
.coverage
42+
.coverage.*
43+
.cache
44+
nosetests.xml
45+
coverage.xml
46+
*.cover
47+
.hypothesis/
48+
.pytest_cache/
49+
50+
# Translations
51+
*.mo
52+
*.pot
53+
54+
# Django stuff:
55+
*.log
56+
local_settings.py
57+
db.sqlite3
58+
59+
# Flask stuff:
60+
instance/
61+
.webassets-cache
62+
63+
# Scrapy stuff:
64+
.scrapy
65+
66+
# Sphinx documentation
67+
docs/_build/
68+
69+
# PyBuilder
70+
target/
71+
72+
# Jupyter Notebook
73+
.ipynb_checkpoints
74+
75+
# pyenv
76+
.python-version
77+
78+
# celery beat schedule file
79+
celerybeat-schedule
80+
81+
# SageMath parsed files
82+
*.sage.py
83+
84+
# Environments
85+
.env
86+
.venv
87+
env/
88+
venv/
89+
ENV/
90+
env.bak/
91+
venv.bak/
92+
93+
# Spyder project settings
94+
.spyderproject
95+
.spyproject
96+
97+
# Rope project settings
98+
.ropeproject
99+
100+
# mkdocs documentation
101+
/site
102+
103+
# mypy
104+
.mypy_cache/
105+
106+
__pycache__
107+
.vscode
108+
.DS_Store
109+
110+
# MFA
111+
montreal-forced-aligner/
112+
113+
# data, checkpoint, and models
114+
raw_data/
115+
output/
116+
*.npy
117+
TextGrid/
118+
hifigan/*.pth.tar
119+
*.out
120+
deepspeaker/pretrained_models/*

CITATION.cff

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
cff-version: 1.0.0
2+
message: "If you use this software, please cite it as below."
3+
authors:
4+
- family-names: "Lee"
5+
given-names: "Keon"
6+
orcid: "https://orcid.org/0000-0001-9028-1018"
7+
title: "DiffGAN-TTS"
8+
version: 0.1.0
9+
doi: ___
10+
date-released: 2022-02-21
11+
url: "https://github.com/keonlee9420/DiffGAN-TTS"

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2021 Keon Lee
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# DiffGAN-TTS - PyTorch Implementation
2+
3+
PyTorch implementation of [DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs](https://arxiv.org/abs/2201.11972)
4+
5+
<p align="center">
6+
<img src="img/model_1.png" width="80%">
7+
</p>
8+
9+
<p align="center">
10+
<img src="img/model_2.png" width="80%">
11+
</p>
12+
13+
# Repository Status
14+
- [x] Naive Version of DiffGAN-TTS
15+
- [x] Active Shallow Diffusion Mechanism: DiffGAN-TTS (two-stage)
16+
17+
## Audio Samples
18+
Audio samples are available at [/demo](https://github.com/keonlee9420/DiffGAN-TTS/tree/main/demo).
19+
20+
# Quickstart
21+
22+
***DATASET*** refers to the names of datasets such as `LJSpeech` and `VCTK` in the following documents.
23+
24+
***MODEL*** refers to the types of model (choose from '**naive**', '**aux**', '**shallow**').
25+
26+
## Dependencies
27+
You can install the Python dependencies with
28+
```
29+
pip3 install -r requirements.txt
30+
```
31+
32+
## Inference
33+
34+
You have to download the [pretrained models](https://drive.google.com/drive/folders/14EqKdfq3hTCg8BQ1ZTc8aJwwnpkFOAzh?usp=sharing) and put them in
35+
- ``output/ckpt/DATASET_naive/`` for '**naive**' model.
36+
- ``output/ckpt/DATASET_shallow/`` for '**shallow**' model. Please note that the checkpoint of the '**shallow**' model contains both '**shallow**' and '**aux**' models, and these two models will share all directories except results throughout the whole process.
37+
38+
For a **single-speaker TTS**, run
39+
```
40+
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --restore_step RESTORE_STEP --mode single --dataset DATASET
41+
```
42+
43+
For a **multi-speaker TTS**, run
44+
```
45+
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --speaker_id SPEAKER_ID --restore_step RESTORE_STEP --mode single --dataset DATASET
46+
```
47+
48+
The dictionary of learned speakers can be found at `preprocessed_data/DATASET/speakers.json`, and the generated utterances will be put in `output/result/`.
49+
50+
## Batch Inference
51+
Batch inference is also supported, try
52+
53+
```
54+
python3 synthesize.py --source preprocessed_data/DATASET/val.txt --model MODEL --restore_step RESTORE_STEP --mode batch --dataset DATASET
55+
```
56+
to synthesize all utterances in ``preprocessed_data/DATASET/val.txt``.
57+
58+
59+
## Controllability
60+
The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios.
61+
For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by
62+
63+
```
64+
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --restore_step RESTORE_STEP --mode single --dataset DATASET --duration_control 0.8 --energy_control 0.8
65+
```
66+
67+
Please note that the controllability is originated from [FastSpeech2](https://arxiv.org/abs/2006.04558) and not a vital interest of DiffGAN-TTS.
68+
69+
# Training
70+
71+
## Datasets
72+
73+
The supported datasets are
74+
75+
- [LJSpeech](https://keithito.com/LJ-Speech-Dataset/): a **single-speaker** English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
76+
77+
- [VCTK](https://datashare.ed.ac.uk/handle/10283/3443): The CSTR VCTK Corpus includes speech data uttered by 110 English speakers (**multi-speaker TTS**) with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.
78+
79+
## Preprocessing
80+
81+
- For a **multi-speaker TTS** with external speaker embedder, download [ResCNN Softmax+Triplet pretrained model](https://drive.google.com/file/d/1F9NvdrarWZNktdX9KlRYWWHDwRkip_aP) of [philipperemy's DeepSpeaker](https://github.com/philipperemy/deep-speaker) for the speaker embedding and locate it in `./deepspeaker/pretrained_models/`.
82+
- Run
83+
```
84+
python3 prepare_align.py --dataset DATASET
85+
```
86+
for some preparations.
87+
88+
For the forced alignment, [Montreal Forced Aligner](https://montreal-forced-aligner.readthedocs.io/en/latest/) (MFA) is used to obtain the alignments between the utterances and the phoneme sequences.
89+
Pre-extracted alignments for the datasets are provided [here](https://drive.google.com/drive/folders/1fizpyOiQ1lG2UDaMlXnT3Ll4_j6Xwg7K?usp=sharing).
90+
You have to unzip the files in `preprocessed_data/DATASET/TextGrid/`. Alternately, you can [run the aligner by yourself](https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/workflows/index.html).
91+
92+
After that, run the preprocessing script by
93+
```
94+
python3 preprocess.py --dataset DATASET
95+
```
96+
97+
## Training
98+
99+
You can train three types of model: '**naive**', '**aux**', and '**shallow**'.
100+
101+
- Training Naive Version ('**naive**'):
102+
103+
Train the naive version with
104+
```
105+
python3 train.py --model naive --dataset DATASET
106+
```
107+
108+
- Training Basic Acoustic Model for Shallow Version ('**aux**'):
109+
110+
To train the shallow version, we need a pre-trained FastSpeech2. The below command will let you train the FastSpeech2 modules, including Auxiliary (Mel) Decoder.
111+
```
112+
python3 train.py --model aux --dataset DATASET
113+
```
114+
115+
- Training Shallow Version ('**shallow**'):
116+
117+
To leverage pre-trained FastSpeech2, including Auxiliary (Mel) Decoder, you must pass `--restore_step` with the final step of auxiliary FastSpeech2 training as the following command.
118+
```
119+
python3 train.py --model shallow --restore_step RESTORE_STEP --dataset DATASET
120+
```
121+
For example, if the last checkpoint is saved at 200000 steps during the auxiliary training, you have to set `--restore_step` with `200000`. Then it will load and freeze the aux model and then continue the training under the active shallow diffusion mechanism.
122+
123+
# TensorBoard
124+
125+
Use
126+
```
127+
tensorboard --logdir output/log/DATASET
128+
```
129+
130+
to serve TensorBoard on your localhost.
131+
The loss curves, synthesized mel-spectrograms, and audios are shown.
132+
133+
## Naive Diffusion
134+
135+
![](./img/tensorboard_loss_naive.png)
136+
![](./img/tensorboard_spec_naive.png)
137+
![](./img/tensorboard_audio_naive.png)
138+
139+
## Shallow Diffusion
140+
141+
![](./img/tensorboard_loss_shallow.png)
142+
![](./img/tensorboard_spec_shallow.png)
143+
![](./img/tensorboard_audio_shallow.png)
144+
145+
# Notes
146+
147+
- In addition to the Diffusion Decoder, the Variance Adaptor is also conditioned on speaker information.
148+
- Unconditional and Conditional output of the JCU discriminator is averaged during each of loss calculation as [VocGAN](https://www.isca-speech.org/archive/pdfs/interspeech_2020/yang20_interspeech.pdf) did.
149+
- Some differences on the Data and Preprocessing compared to the original paper:
150+
- Using VCTK (109 speakers) instead of Mandarin Chinese of 228 speakers.
151+
- Following [DiffSpeech](https://github.com/keonlee9420/DiffSinger)'s audio config, e.g., sample rate is 22050Hz rather than 24,000Hz.
152+
- Also, following [DiffSpeech](https://github.com/keonlee9420/DiffSinger)'s variance extraction and modeling.
153+
- `lambda_fm` is fixed to a scala value since the dynamically scaled scalar computed as L_recon/L_fm makes the model explode.
154+
- Two options for embedding for the **multi-speaker TTS** setting: training speaker embedder from scratch or using a pre-trained [philipperemy's DeepSpeaker](https://github.com/philipperemy/deep-speaker) model (as [STYLER](https://github.com/keonlee9420/STYLER) did). You can toggle it by setting the config (between `'none'` and `'DeepSpeaker'`).
155+
- DeepSpeaker on VCTK dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding.
156+
157+
<p align="center">
158+
<img src="./preprocessed_data/VCTK/spker_embed_tsne.png" width="40%">
159+
</p>
160+
161+
# Citation
162+
163+
Please cite this repository by the "[Cite this repository](https://github.blog/2021-08-19-enhanced-support-citations-github/)" of **About** section (top right of the main page).
164+
165+
# References
166+
- [keonlee9420's DiffSinger](https://github.com/keonlee9420/DiffSinger)
167+
- [keonlee9420's Comprehensive-Transformer-TTS](https://github.com/keonlee9420/Comprehensive-Transformer-TTS)
168+
- [LynnHo' DCGAN-LSGAN-WGAN-GP-DRAGAN-Pytorch](https://github.com/LynnHo/DCGAN-LSGAN-WGAN-GP-DRAGAN-Pytorch)
169+
- [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239)
170+
- [Tackling the Generative Learning Trilemma with Denoising Diffusion GANs](https://arxiv.org/abs/2112.07804)
171+
- [DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism](https://arxiv.org/abs/2105.02446)

audio/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
import audio.tools
2+
import audio.stft
3+
import audio.audio_processing

0 commit comments

Comments
 (0)