Skip to content

Commit cae37f3

Browse files
committed
Update repo
Update repo
0 parents  commit cae37f3

File tree

128 files changed

+27451
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

128 files changed

+27451
-0
lines changed

.gitignore

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
.idea
2+
*.pyc
3+
__pycache__/
4+
*.sh
5+
local_tools/
6+
*.ckpt
7+
*.pth
8+
infer_out/
9+
*.onnx
10+
/data/*
11+
!/data/.gitkeep
12+
/checkpoints/*
13+
!/checkpoints/.gitkeep
14+
/venv/
15+
/artifacts/
16+
17+
.vscode
18+
.ipynb_checkpoints/

README.md

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# DiffSinger (OpenVPI maintained version)
2+
3+
[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2105.02446)
4+
[![downloads](https://img.shields.io/github/downloads/openvpi/DiffSinger/total.svg)](https://github.com/openvpi/DiffSinger/releases)
5+
[![Bilibili](https://img.shields.io/badge/Bilibili-Demo-blue)](https://www.bilibili.com/video/BV1be411N7JA/)
6+
[![license](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/openvpi/DiffSinger/blob/main/LICENSE)
7+
8+
This is a refactored and enhanced version of _DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism_ based on the original [paper](https://arxiv.org/abs/2105.02446) and [implementation](https://github.com/MoonInTheRiver/DiffSinger), which provides:
9+
10+
- Cleaner code structure: useless and redundant files are removed and the others are re-organized.
11+
- Better sound quality: the sampling rate of synthesized audio are adapted to 44.1 kHz instead of the original 24 kHz.
12+
- Higher fidelity: improved acoustic models and diffusion sampling acceleration algorithms are integrated.
13+
- More controllability: introduced variance models and parameters for prediction and control of pitch, energy, breathiness, etc.
14+
- Production compatibility: functionalities are designed to match the requirements of production deployment and the SVS communities.
15+
16+
| Overview | Variance Model | Acoustic Model |
17+
|:-------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------:|
18+
| <img src="docs/resources/arch-overview.jpg" alt="arch-overview" style="zoom: 60%;" /> | <img src="docs/resources/arch-variance.jpg" alt="arch-variance" style="zoom: 50%;" /> | <img src="docs/resources/arch-acoustic.jpg" alt="arch-acoustic" style="zoom: 60%;" /> |
19+
20+
## User Guidance
21+
22+
> 中文教程 / Chinese Tutorials: [Text](https://openvpi-docs.feishu.cn/wiki/KmBFwoYDEixrS4kHcTAcajPinPe), [Video](https://space.bilibili.com/179281251/channel/collectiondetail?sid=1747910)
23+
24+
- **Installation & basic usages**: See [Getting Started](docs/GettingStarted.md)
25+
- **Dataset creation pipelines & tools**: See [MakeDiffSinger](https://github.com/openvpi/MakeDiffSinger)
26+
- **Best practices & tutorials**: See [Best Practices](docs/BestPractices.md)
27+
- **Editing configurations**: See [Configuration Schemas](docs/ConfigurationSchemas.md)
28+
- **Deployment & production**: [OpenUTAU for DiffSinger](https://github.com/xunmengshe/OpenUtau), [DiffScope (under development)](https://github.com/openvpi/diffscope)
29+
- **Communication groups**: [QQ Group](http://qm.qq.com/cgi-bin/qm/qr?_wv=1027&k=fibG_dxuPW5maUJwe9_ya5-zFcIwaoOR&authKey=ZgLCG5EqQVUGCID1nfKei8tCnlQHAmD9koxebFXv5WfUchhLwWxb52o1pimNai5A&noverify=0&group_code=907879266) (907879266), [Discord server](https://discord.gg/wwbu2JUMjj)
30+
31+
## Progress & Roadmap
32+
33+
- **Progress since we forked into this repository**: See [Releases](https://github.com/openvpi/DiffSinger/releases)
34+
- **Roadmap for future releases**: See [Project Board](https://github.com/orgs/openvpi/projects/1)
35+
- **Thoughts, proposals & ideas**: See [Discussions](https://github.com/openvpi/DiffSinger/discussions)
36+
37+
## Architecture & Algorithms
38+
39+
TBD
40+
41+
## Development Resources
42+
43+
TBD
44+
45+
## References
46+
47+
### Original Paper & Implementation
48+
49+
- Paper: [DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism](https://arxiv.org/abs/2105.02446)
50+
- Implementation: [MoonInTheRiver/DiffSinger](https://github.com/MoonInTheRiver/DiffSinger)
51+
52+
### Generative Models & Algorithms
53+
54+
- Denoising Diffusion Probabilistic Models (DDPM): [paper](https://arxiv.org/abs/2006.11239), [implementation](https://github.com/hojonathanho/diffusion)
55+
- [DDIM](https://arxiv.org/abs/2010.02502) for diffusion sampling acceleration
56+
- [PNDM](https://arxiv.org/abs/2202.09778) for diffusion sampling acceleration
57+
- [DPM-Solver++](https://github.com/LuChengTHU/dpm-solver) for diffusion sampling acceleration
58+
- [UniPC](https://github.com/wl-zhao/UniPC) for diffusion sampling acceleration
59+
- Rectified Flow (RF): [paper](https://arxiv.org/abs/2209.03003), [implementation](https://github.com/gnobitab/RectifiedFlow)
60+
61+
### Dependencies & Submodules
62+
63+
- [HiFi-GAN](https://github.com/jik876/hifi-gan) and [NSF](https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf) for waveform reconstruction
64+
- [pc-ddsp](https://github.com/yxlllc/pc-ddsp) for waveform reconstruction
65+
- [RMVPE](https://github.com/Dream-High/RMVPE) and yxlllc's [fork](https://github.com/yxlllc/RMVPE) for pitch extraction
66+
- [Vocal Remover](https://github.com/tsurumeso/vocal-remover) and yxlllc's [fork](https://github.com/yxlllc/vocal-remover) for harmonic-noise separation
67+
68+
## Disclaimer
69+
70+
Any organization or individual is prohibited from using any functionalities included in this repository to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.
71+
72+
## License
73+
74+
This forked DiffSinger repository is licensed under the [Apache 2.0 License](LICENSE).
75+

augmentation/spec_stretch.py

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
from copy import deepcopy
2+
3+
import librosa
4+
import numpy as np
5+
import torch
6+
7+
from basics.base_augmentation import BaseAugmentation, require_same_keys
8+
from basics.base_pe import BasePE
9+
from modules.fastspeech.param_adaptor import VARIANCE_CHECKLIST
10+
from modules.fastspeech.tts_modules import LengthRegulator
11+
from utils.binarizer_utils import get_mel_torch, get_mel2ph_torch
12+
from utils.hparams import hparams
13+
from utils.infer_utils import resample_align_curve
14+
15+
16+
class SpectrogramStretchAugmentation(BaseAugmentation):
17+
"""
18+
This class contains methods for frequency-domain and time-domain stretching augmentation.
19+
"""
20+
21+
def __init__(self, data_dirs: list, augmentation_args: dict, pe: BasePE = None):
22+
super().__init__(data_dirs, augmentation_args)
23+
self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
24+
self.lr = LengthRegulator().to(self.device)
25+
self.pe = pe
26+
27+
@require_same_keys
28+
def process_item(self, item: dict, key_shift=0., speed=1., replace_spk_id=None) -> dict:
29+
aug_item = deepcopy(item)
30+
waveform, _ = librosa.load(aug_item['wav_fn'], sr=hparams['audio_sample_rate'], mono=True)
31+
mel = get_mel_torch(
32+
waveform, hparams['audio_sample_rate'], num_mel_bins=hparams['audio_num_mel_bins'],
33+
hop_size=hparams['hop_size'], win_size=hparams['win_size'], fft_size=hparams['fft_size'],
34+
fmin=hparams['fmin'], fmax=hparams['fmax'],
35+
keyshift=key_shift, speed=speed, device=self.device
36+
)
37+
38+
aug_item['mel'] = mel
39+
40+
if speed != 1. or hparams['use_speed_embed']:
41+
aug_item['length'] = mel.shape[0]
42+
aug_item['speed'] = int(np.round(hparams['hop_size'] * speed)) / hparams['hop_size'] # real speed
43+
aug_item['seconds'] /= aug_item['speed']
44+
aug_item['ph_dur'] /= aug_item['speed']
45+
aug_item['mel2ph'] = get_mel2ph_torch(
46+
self.lr, torch.from_numpy(aug_item['ph_dur']), aug_item['length'], self.timestep, device=self.device
47+
).cpu().numpy()
48+
49+
f0, _ = self.pe.get_pitch(
50+
waveform, samplerate=hparams['audio_sample_rate'], length=aug_item['length'],
51+
hop_size=hparams['hop_size'], f0_min=hparams['f0_min'], f0_max=hparams['f0_max'],
52+
speed=speed, interp_uv=True
53+
)
54+
aug_item['f0'] = f0.astype(np.float32)
55+
56+
# NOTE: variance curves are directly resampled according to speed,
57+
# despite how frequency-domain features change after the augmentation.
58+
# For acoustic models, this can bring more (but not much) difficulty
59+
# to learn how variance curves affect the mel spectrograms, since
60+
# they must realize how the augmentation causes the mismatch.
61+
#
62+
# This is a simple way to combine augmentation and variances. However,
63+
# dealing variance curves like this will decrease the accuracy of
64+
# variance controls. In most situations, not being ~100% accurate
65+
# will not ruin the user experience. For example, it does not matter
66+
# if the energy does not exactly equal the RMS; it is just fine
67+
# as long as higher energy can bring higher loudness and strength.
68+
# The neural networks itself cannot be 100% accurate, though.
69+
#
70+
# There are yet other choices to simulate variance curves:
71+
# 1. Re-extract the features from resampled waveforms;
72+
# 2. Re-extract the features from re-constructed waveforms using
73+
# the transformed mel spectrograms through the vocoder.
74+
# But there are actually no perfect ways to make them all accurate
75+
# and stable.
76+
for v_name in VARIANCE_CHECKLIST:
77+
if v_name in item:
78+
aug_item[v_name] = resample_align_curve(
79+
aug_item[v_name],
80+
original_timestep=self.timestep,
81+
target_timestep=self.timestep * aug_item['speed'],
82+
align_length=aug_item['length']
83+
)
84+
85+
if key_shift != 0. or hparams['use_key_shift_embed']:
86+
if replace_spk_id is None:
87+
aug_item['key_shift'] = key_shift
88+
else:
89+
aug_item['spk_id'] = replace_spk_id
90+
aug_item['f0'] *= 2 ** (key_shift / 12)
91+
92+
return aug_item

basics/base_augmentation.py

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
from utils.hparams import hparams
2+
3+
4+
class BaseAugmentation:
5+
"""
6+
Base class for data augmentation.
7+
All methods of this class should be thread-safe.
8+
1. *process_item*:
9+
Apply augmentation to one piece of data.
10+
"""
11+
def __init__(self, data_dirs: list, augmentation_args: dict):
12+
self.raw_data_dirs = data_dirs
13+
self.augmentation_args = augmentation_args
14+
self.timestep = hparams['hop_size'] / hparams['audio_sample_rate']
15+
16+
def process_item(self, item: dict, **kwargs) -> dict:
17+
raise NotImplementedError()
18+
19+
20+
def require_same_keys(func):
21+
def run(*args, **kwargs):
22+
item: dict = args[1]
23+
res: dict = func(*args, **kwargs)
24+
assert set(item.keys()) == set(res.keys()), 'Item keys mismatch after augmentation.\n' \
25+
f'Before: {sorted(item.keys())}\n' \
26+
f'After: {sorted(res.keys())}'
27+
return res
28+
return run

0 commit comments

Comments
 (0)