Skip to content

Commit 9fa5c4e

Browse files
committed
Initial commit
0 parents  commit 9fa5c4e

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

68 files changed

+432027
-0
lines changed

.gitignore

Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# C extensions
7+
*.so
8+
9+
# Distribution / packaging
10+
.Python
11+
build/
12+
develop-eggs/
13+
dist/
14+
downloads/
15+
eggs/
16+
.eggs/
17+
lib/
18+
lib64/
19+
parts/
20+
sdist/
21+
var/
22+
wheels/
23+
*.egg-info/
24+
.installed.cfg
25+
*.egg
26+
MANIFEST
27+
28+
# PyInstaller
29+
# Usually these files are written by a python script from a template
30+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
31+
*.manifest
32+
*.spec
33+
34+
# Installer logs
35+
pip-log.txt
36+
pip-delete-this-directory.txt
37+
38+
# Unit test / coverage reports
39+
htmlcov/
40+
.tox/
41+
.coverage
42+
.coverage.*
43+
.cache
44+
nosetests.xml
45+
coverage.xml
46+
*.cover
47+
.hypothesis/
48+
.pytest_cache/
49+
50+
# Translations
51+
*.mo
52+
*.pot
53+
54+
# Django stuff:
55+
*.log
56+
local_settings.py
57+
db.sqlite3
58+
59+
# Flask stuff:
60+
instance/
61+
.webassets-cache
62+
63+
# Scrapy stuff:
64+
.scrapy
65+
66+
# Sphinx documentation
67+
docs/_build/
68+
69+
# PyBuilder
70+
target/
71+
72+
# Jupyter Notebook
73+
.ipynb_checkpoints
74+
75+
# pyenv
76+
.python-version
77+
78+
# celery beat schedule file
79+
celerybeat-schedule
80+
81+
# SageMath parsed files
82+
*.sage.py
83+
84+
# Environments
85+
.env
86+
.venv
87+
env/
88+
venv/
89+
ENV/
90+
env.bak/
91+
venv.bak/
92+
93+
# Spyder project settings
94+
.spyderproject
95+
.spyproject
96+
97+
# Rope project settings
98+
.ropeproject
99+
100+
# mkdocs documentation
101+
/site
102+
103+
# mypy
104+
.mypy_cache/
105+
106+
__pycache__
107+
.vscode
108+
.DS_Store
109+
110+
# MFA
111+
montreal-forced-aligner/
112+
113+
# data, checkpoint, and models
114+
raw_data/
115+
output/
116+
*.npy
117+
TextGrid/
118+
hifigan/*.pth.tar
119+
*.out

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2021 Keon Lee
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
# DiffSinger - PyTorch Implementation
2+
3+
PyTorch implementation of [DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis](https://arxiv.org/abs/2105.02446) (TTS Extension).
4+
5+
<p align="center">
6+
<img src="img/model_1.png" width="80%">
7+
</p>
8+
9+
<p align="center">
10+
<img src="img/model_2.png" width="80%">
11+
</p>
12+
13+
# Status (2021.06.03)
14+
- [x] Naive Version of DiffSinger
15+
- [ ] Shallow Diffusion Mechanism: Training boundary predictor by leveraging pre-trained auxiliary decoder + Training denoiser using `k` as a maximum time step
16+
17+
# Quickstart
18+
19+
## Dependencies
20+
You can install the Python dependencies with
21+
```
22+
pip3 install -r requirements.txt
23+
```
24+
25+
## Inference
26+
27+
You have to download the [pretrained models]() and put them in ``output/ckpt/LJSpeech/``.
28+
29+
For English single-speaker TTS, run
30+
```
31+
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 900000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
32+
```
33+
The generated utterances will be put in ``output/result/``.
34+
35+
36+
## Batch Inference
37+
Batch inference is also supported, try
38+
39+
```
40+
python3 synthesize.py --source preprocessed_data/LJSpeech/val.txt --restore_step 900000 --mode batch -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
41+
```
42+
to synthesize all utterances in ``preprocessed_data/LJSpeech/val.txt``
43+
44+
## Controllability
45+
The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios.
46+
For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by
47+
48+
```
49+
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 900000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml --duration_control 0.8 --energy_control 0.8
50+
```
51+
52+
# Training
53+
54+
## Datasets
55+
56+
The supported datasets are
57+
58+
- [LJSpeech](https://keithito.com/LJ-Speech-Dataset/): a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
59+
- (will be added more)
60+
61+
## Preprocessing
62+
63+
First, run
64+
```
65+
python3 prepare_align.py config/LJSpeech/preprocess.yaml
66+
```
67+
for some preparations.
68+
69+
As described in the paper, [Montreal Forced Aligner](https://montreal-forced-aligner.readthedocs.io/en/latest/) (MFA) is used to obtain the alignments between the utterances and the phoneme sequences.
70+
Alignments for the LJSpeech datasets are provided [here](https://drive.google.com/drive/folders/1DBRkALpPd6FL9gjHMmMEdHODmkgNIIK4?usp=sharing) from [ming024's FastSpeech2](https://github.com/ming024/FastSpeech2).
71+
You have to unzip the files in ``preprocessed_data/LJSpeech/TextGrid/``.
72+
73+
After that, run the preprocessing script by
74+
```
75+
python3 preprocess.py config/LJSpeech/preprocess.yaml
76+
```
77+
78+
Alternately, you can align the corpus by yourself.
79+
Download the official MFA package and run
80+
```
81+
./montreal-forced-aligner/bin/mfa_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt english preprocessed_data/LJSpeech
82+
```
83+
or
84+
```
85+
./montreal-forced-aligner/bin/mfa_train_and_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt preprocessed_data/LJSpeech
86+
```
87+
88+
to align the corpus and then run the preprocessing script.
89+
```
90+
python3 preprocess.py config/LJSpeech/preprocess.yaml
91+
```
92+
93+
## Training
94+
95+
Train your model with
96+
```
97+
python3 train.py -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
98+
```
99+
100+
# TensorBoard
101+
102+
Use
103+
```
104+
tensorboard --logdir output/log/LJSpeech
105+
```
106+
107+
to serve TensorBoard on your localhost.
108+
The loss curves, synthesized mel-spectrograms, and audios are shown.
109+
110+
111+
112+
# Implementation Issues
113+
114+
1. **Pitch extractor comparison (on LJ001-0006.wav)**
115+
116+
<p align="center">
117+
<img src="img/pitch_extractor_comparison.png" width="100%">
118+
</p>
119+
120+
**pyworld** is used to extract f0 (fundamental frequency) as pitch information in this implementation. Empirically, however, I found that all three methods were equally acceptable for clean datasets (e.g., LJSpeech) as above figures. Note that **pysptk** would work better for noisy datasets (as described in [STYLER](https://github.com/keonlee9420/STYLER)).
121+
122+
2. Stack two layers of `FFTBlock` for the lyrics encoder (text encoder).
123+
3. (Naive version) The number of learnable parameters is `34.337M`, which is larger than the original paper (`26.744M`). The `diffusion` module takes a significant portion of whole parameters.
124+
4. I did not remove the energy prediction of FastSpeech2 since it is not critical to the model training or performance (as described in [LightSpeech](https://arxiv.org/abs/2102.04040)). It should be easily removed without any performance degradation.
125+
5. Use **HiFi-GAN** instead of **Parallel WaveGAN (PWG)** for vocoding.
126+
127+
# Citation
128+
129+
```
130+
@misc{lee2021diffsinger,
131+
author = {Lee, Keon},
132+
title = {DiffSinger},
133+
year = {2021},
134+
publisher = {GitHub},
135+
journal = {GitHub repository},
136+
howpublished = {\url{https://github.com/keonlee9420/DiffSinger}}
137+
}
138+
```
139+
140+
# References
141+
- Authors' codebase
142+
- [ming024's FastSpeech2](https://github.com/ming024/FastSpeech2) (Later than 2021.02.26 ver.)
143+
- [hojonathanho's diffusion](https://github.com/hojonathanho/diffusion)
144+
- [lmnt-com's diffwave](https://github.com/lmnt-com/diffwave)

audio/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
import audio.tools
2+
import audio.stft
3+
import audio.audio_processing

audio/audio_processing.py

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
import torch
2+
import numpy as np
3+
import librosa.util as librosa_util
4+
from scipy.signal import get_window
5+
6+
7+
def window_sumsquare(
8+
window,
9+
n_frames,
10+
hop_length,
11+
win_length,
12+
n_fft,
13+
dtype=np.float32,
14+
norm=None,
15+
):
16+
"""
17+
# from librosa 0.6
18+
Compute the sum-square envelope of a window function at a given hop length.
19+
20+
This is used to estimate modulation effects induced by windowing
21+
observations in short-time fourier transforms.
22+
23+
Parameters
24+
----------
25+
window : string, tuple, number, callable, or list-like
26+
Window specification, as in `get_window`
27+
28+
n_frames : int > 0
29+
The number of analysis frames
30+
31+
hop_length : int > 0
32+
The number of samples to advance between frames
33+
34+
win_length : [optional]
35+
The length of the window function. By default, this matches `n_fft`.
36+
37+
n_fft : int > 0
38+
The length of each analysis frame.
39+
40+
dtype : np.dtype
41+
The data type of the output
42+
43+
Returns
44+
-------
45+
wss : np.ndarray, shape=`(n_fft + hop_length * (n_frames - 1))`
46+
The sum-squared envelope of the window function
47+
"""
48+
if win_length is None:
49+
win_length = n_fft
50+
51+
n = n_fft + hop_length * (n_frames - 1)
52+
x = np.zeros(n, dtype=dtype)
53+
54+
# Compute the squared window at the desired length
55+
win_sq = get_window(window, win_length, fftbins=True)
56+
win_sq = librosa_util.normalize(win_sq, norm=norm) ** 2
57+
win_sq = librosa_util.pad_center(win_sq, n_fft)
58+
59+
# Fill the envelope
60+
for i in range(n_frames):
61+
sample = i * hop_length
62+
x[sample : min(n, sample + n_fft)] += win_sq[: max(0, min(n_fft, n - sample))]
63+
return x
64+
65+
66+
def griffin_lim(magnitudes, stft_fn, n_iters=30):
67+
"""
68+
PARAMS
69+
------
70+
magnitudes: spectrogram magnitudes
71+
stft_fn: STFT class with transform (STFT) and inverse (ISTFT) methods
72+
"""
73+
74+
angles = np.angle(np.exp(2j * np.pi * np.random.rand(*magnitudes.size())))
75+
angles = angles.astype(np.float32)
76+
angles = torch.autograd.Variable(torch.from_numpy(angles))
77+
signal = stft_fn.inverse(magnitudes, angles).squeeze(1)
78+
79+
for i in range(n_iters):
80+
_, angles = stft_fn.transform(signal)
81+
signal = stft_fn.inverse(magnitudes, angles).squeeze(1)
82+
return signal
83+
84+
85+
def dynamic_range_compression(x, C=1, clip_val=1e-5):
86+
"""
87+
PARAMS
88+
------
89+
C: compression factor
90+
"""
91+
return torch.log(torch.clamp(x, min=clip_val) * C)
92+
93+
94+
def dynamic_range_decompression(x, C=1):
95+
"""
96+
PARAMS
97+
------
98+
C: compression factor used to compress
99+
"""
100+
return torch.exp(x) / C

0 commit comments

Comments
 (0)