Kouon-Project
diff --git a/‎.gitignore‎
Lines changed: 18 additions & 0 deletions b/‎.gitignore‎
Lines changed: 18 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 75 additions & 0 deletions b/‎README.md‎
Lines changed: 75 additions & 0 deletions
diff --git a/‎augmentation/spec_stretch.py‎
Lines changed: 92 additions & 0 deletions b/‎augmentation/spec_stretch.py‎
Lines changed: 92 additions & 0 deletions
diff --git a/‎basics/base_augmentation.py‎
Lines changed: 28 additions & 0 deletions b/‎basics/base_augmentation.py‎
Lines changed: 28 additions & 0 deletions
@@ -0,0 +1,18 @@
+.idea
+*.pyc
+__pycache__/
+*.sh
+local_tools/
+*.ckpt
+*.pth
+infer_out/
+*.onnx
+/data/*
+!/data/.gitkeep
+/checkpoints/*
+!/checkpoints/.gitkeep
+/venv/
+/artifacts/
+
+.vscode
+.ipynb_checkpoints/
@@ -0,0 +1,75 @@
+# DiffSinger (OpenVPI maintained version)
+
+[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2105.02446)
+[![downloads](https://img.shields.io/github/downloads/openvpi/DiffSinger/total.svg)](https://github.com/openvpi/DiffSinger/releases)
+[![Bilibili](https://img.shields.io/badge/Bilibili-Demo-blue)](https://www.bilibili.com/video/BV1be411N7JA/)
+[![license](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/openvpi/DiffSinger/blob/main/LICENSE)
+
+This is a refactored and enhanced version of _DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism_ based on the original [paper](https://arxiv.org/abs/2105.02446) and [implementation](https://github.com/MoonInTheRiver/DiffSinger), which provides:
+
+- Cleaner code structure: useless and redundant files are removed and the others are re-organized.
+- Better sound quality: the sampling rate of synthesized audio are adapted to 44.1 kHz instead of the original 24 kHz.
+- Higher fidelity: improved acoustic models and diffusion sampling acceleration algorithms are integrated.
+- More controllability: introduced variance models and parameters for prediction and control of pitch, energy, breathiness, etc.
+- Production compatibility: functionalities are designed to match the requirements of production deployment and the SVS communities.
+
+|                                       Overview                                        |                                    Variance Model                                     |                                    Acoustic Model                                     |
+|:-------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------:|
+| <img src="docs/resources/arch-overview.jpg" alt="arch-overview" style="zoom: 60%;" /> | <img src="docs/resources/arch-variance.jpg" alt="arch-variance" style="zoom: 50%;" /> | <img src="docs/resources/arch-acoustic.jpg" alt="arch-acoustic" style="zoom: 60%;" /> |
+
+## User Guidance
+
+> 中文教程 / Chinese Tutorials: [Text](https://openvpi-docs.feishu.cn/wiki/KmBFwoYDEixrS4kHcTAcajPinPe), [Video](https://space.bilibili.com/179281251/channel/collectiondetail?sid=1747910)
+
+- **Installation & basic usages**: See [Getting Started](docs/GettingStarted.md)
+- **Dataset creation pipelines & tools**: See [MakeDiffSinger](https://github.com/openvpi/MakeDiffSinger)
+- **Best practices & tutorials**: See [Best Practices](docs/BestPractices.md)
+- **Editing configurations**: See [Configuration Schemas](docs/ConfigurationSchemas.md)
+- **Deployment & production**: [OpenUTAU for DiffSinger](https://github.com/xunmengshe/OpenUtau), [DiffScope (under development)](https://github.com/openvpi/diffscope)
+- **Communication groups**: [QQ Group](http://qm.qq.com/cgi-bin/qm/qr?_wv=1027&k=fibG_dxuPW5maUJwe9_ya5-zFcIwaoOR&authKey=ZgLCG5EqQVUGCID1nfKei8tCnlQHAmD9koxebFXv5WfUchhLwWxb52o1pimNai5A&noverify=0&group_code=907879266) (907879266), [Discord server](https://discord.gg/wwbu2JUMjj)
+
+## Progress & Roadmap
+
+- **Progress since we forked into this repository**: See [Releases](https://github.com/openvpi/DiffSinger/releases)
+- **Roadmap for future releases**: See [Project Board](https://github.com/orgs/openvpi/projects/1)
+- **Thoughts, proposals & ideas**: See [Discussions](https://github.com/openvpi/DiffSinger/discussions)
+
+## Architecture & Algorithms
+
+TBD
+
+## Development Resources
+
+TBD
+
+## References
+
+### Original Paper & Implementation
+
+- Paper: [DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism](https://arxiv.org/abs/2105.02446)
+- Implementation: [MoonInTheRiver/DiffSinger](https://github.com/MoonInTheRiver/DiffSinger)
+
+### Generative Models & Algorithms
+
+- Denoising Diffusion Probabilistic Models (DDPM): [paper](https://arxiv.org/abs/2006.11239), [implementation](https://github.com/hojonathanho/diffusion)
+  - [DDIM](https://arxiv.org/abs/2010.02502) for diffusion sampling acceleration
+  - [PNDM](https://arxiv.org/abs/2202.09778) for diffusion sampling acceleration
+  - [DPM-Solver++](https://github.com/LuChengTHU/dpm-solver) for diffusion sampling acceleration
+  - [UniPC](https://github.com/wl-zhao/UniPC) for diffusion sampling acceleration
+- Rectified Flow (RF): [paper](https://arxiv.org/abs/2209.03003), [implementation](https://github.com/gnobitab/RectifiedFlow)
+
+### Dependencies & Submodules
+
+- [HiFi-GAN](https://github.com/jik876/hifi-gan) and [NSF](https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf) for waveform reconstruction
+- [pc-ddsp](https://github.com/yxlllc/pc-ddsp) for waveform reconstruction
+- [RMVPE](https://github.com/Dream-High/RMVPE) and yxlllc's [fork](https://github.com/yxlllc/RMVPE) for pitch extraction
+- [Vocal Remover](https://github.com/tsurumeso/vocal-remover) and yxlllc's [fork](https://github.com/yxlllc/vocal-remover) for harmonic-noise separation
+
+## Disclaimer
+
+Any organization or individual is prohibited from using any functionalities included in this repository to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.
+
+## License
+
+This forked DiffSinger repository is licensed under the [Apache 2.0 License](LICENSE).
+
@@ -0,0 +1,92 @@
+from copy import deepcopy
+
+import librosa
+import numpy as np
+import torch
+
+from basics.base_augmentation import BaseAugmentation, require_same_keys
+from basics.base_pe import BasePE
+from modules.fastspeech.param_adaptor import VARIANCE_CHECKLIST
+from modules.fastspeech.tts_modules import LengthRegulator
+from utils.binarizer_utils import get_mel_torch, get_mel2ph_torch
+from utils.hparams import hparams
+from utils.infer_utils import resample_align_curve
+
+
+class SpectrogramStretchAugmentation(BaseAugmentation):
+    """
+    This class contains methods for frequency-domain and time-domain stretching augmentation.
+    """
+
+    def __init__(self, data_dirs: list, augmentation_args: dict, pe: BasePE = None):
+        super().__init__(data_dirs, augmentation_args)
+        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
+        self.lr = LengthRegulator().to(self.device)
+        self.pe = pe
+
+    @require_same_keys
+    def process_item(self, item: dict, key_shift=0., speed=1., replace_spk_id=None) -> dict:
+        aug_item = deepcopy(item)
+        waveform, _ = librosa.load(aug_item['wav_fn'], sr=hparams['audio_sample_rate'], mono=True)
+        mel = get_mel_torch(
+            waveform, hparams['audio_sample_rate'], num_mel_bins=hparams['audio_num_mel_bins'],
+            hop_size=hparams['hop_size'], win_size=hparams['win_size'], fft_size=hparams['fft_size'],
+            fmin=hparams['fmin'], fmax=hparams['fmax'],
+            keyshift=key_shift, speed=speed, device=self.device
+        )
+
+        aug_item['mel'] = mel
+
+        if speed != 1. or hparams['use_speed_embed']:
+            aug_item['length'] = mel.shape[0]
+            aug_item['speed'] = int(np.round(hparams['hop_size'] * speed)) / hparams['hop_size']  # real speed
+            aug_item['seconds'] /= aug_item['speed']
+            aug_item['ph_dur'] /= aug_item['speed']
+            aug_item['mel2ph'] = get_mel2ph_torch(
+                self.lr, torch.from_numpy(aug_item['ph_dur']), aug_item['length'], self.timestep, device=self.device
+            ).cpu().numpy()
+
+            f0, _ = self.pe.get_pitch(
+                waveform, samplerate=hparams['audio_sample_rate'], length=aug_item['length'],
+                hop_size=hparams['hop_size'], f0_min=hparams['f0_min'], f0_max=hparams['f0_max'],
+                speed=speed, interp_uv=True
+            )
+            aug_item['f0'] = f0.astype(np.float32)
+
+            # NOTE: variance curves are directly resampled according to speed,
+            # despite how frequency-domain features change after the augmentation.
+            # For acoustic models, this can bring more (but not much) difficulty
+            # to learn how variance curves affect the mel spectrograms, since
+            # they must realize how the augmentation causes the mismatch.
+            #
+            # This is a simple way to combine augmentation and variances. However,
+            # dealing variance curves like this will decrease the accuracy of
+            # variance controls. In most situations, not being ~100% accurate
+            # will not ruin the user experience. For example, it does not matter
+            # if the energy does not exactly equal the RMS; it is just fine
+            # as long as higher energy can bring higher loudness and strength.
+            # The neural networks itself cannot be 100% accurate, though.
+            #
+            # There are yet other choices to simulate variance curves:
+            #   1. Re-extract the features from resampled waveforms;
+            #   2. Re-extract the features from re-constructed waveforms using
+            #      the transformed mel spectrograms through the vocoder.
+            # But there are actually no perfect ways to make them all accurate
+            # and stable.
+            for v_name in VARIANCE_CHECKLIST:
+                if v_name in item:
+                    aug_item[v_name] = resample_align_curve(
+                        aug_item[v_name],
+                        original_timestep=self.timestep,
+                        target_timestep=self.timestep * aug_item['speed'],
+                        align_length=aug_item['length']
+                    )
+
+        if key_shift != 0. or hparams['use_key_shift_embed']:
+            if replace_spk_id is None:
+                aug_item['key_shift'] = key_shift
+            else:
+                aug_item['spk_id'] = replace_spk_id
+            aug_item['f0'] *= 2 ** (key_shift / 12)
+
+        return aug_item
@@ -0,0 +1,28 @@
+from utils.hparams import hparams
+
+
+class BaseAugmentation:
+    """
+    Base class for data augmentation.
+    All methods of this class should be thread-safe.
+    1. *process_item*:
+        Apply augmentation to one piece of data.
+    """
+    def __init__(self, data_dirs: list, augmentation_args: dict):
+        self.raw_data_dirs = data_dirs
+        self.augmentation_args = augmentation_args
+        self.timestep = hparams['hop_size'] / hparams['audio_sample_rate']
+
+    def process_item(self, item: dict, **kwargs) -> dict:
+        raise NotImplementedError()
+
+
+def require_same_keys(func):
+    def run(*args, **kwargs):
+        item: dict = args[1]
+        res: dict = func(*args, **kwargs)
+        assert set(item.keys()) == set(res.keys()), 'Item keys mismatch after augmentation.\n' \
+                                                    f'Before: {sorted(item.keys())}\n' \
+                                                    f'After: {sorted(res.keys())}'
+        return res
+    return run