Releases · Kouon-Project/DSRX

28 Jan 12:17

colstone

ver0.0.2

81496ce

Multi‑GPU preprocessing, Flash Attention, RefineGAN & batch inference Latest

Latest

ver0.0.2 – Multi-GPU preprocessing, Flash Attention, RefineGAN & batch inference

This version builds upon the initial LoRA-enabled release (ver0.0.1 for DiffSinger acoustic fine-tuning). It focuses on faster preprocessing and inference, optional Flash Attention support, and better training instrumentation.

Highlights

Parallel preprocessing across multiple GPUs
- The binarization pipeline can now detect multiple CUDA devices and split the workload across them.
- A new configuration parameter workers_per_gpu (in base.yaml) controls how many workers are spawned per GPU.
- When num_workers > 0 and more than one GPU is available, DSRX automatically assigns device IDs and runs preprocessing in parallel, significantly reducing data preparation time on multi-GPU systems.
Experimental batch inference frontend
- An experimental inference front-end has been added to speed up inference.
- In local tests on an RTX 3060, synthesizing a ~4-minute song (~75 acoustic phrases):
  - OpenUTAU + dsinfer (DirectML + ONNX Runtime, default OpenUTAU infer settings) took about 2–3 minutes.
  - The original Python inference script took about 20–30 seconds.
  - The new batch inference backend with batch_size = 4 finished in about 8–15 seconds.
- Note: increasing the batch size can slightly worsen the accent of cross-lingual synthesis in some cases, so there is a speed–quality trade-off when tuning batch_size.
- The front-end includes a timeout/VRAM-based model unloader that frees GPU memory when idle and reloads models on demand.
- As soon as new data arrives, the front-end starts inference and writes logs for easier monitoring.
RefineGAN vocoder integration
- RefineGAN is now available as a vocoder option. Enable it via:
```
vocoder: RefineGAN
```
- The generator checks whether the product of downsample/upsample rates matches the configured hop_length. If not, it automatically adjusts the last rate to match; if it is not divisible, it raises a descriptive error instead of silently mis-configuring the model.
- This fixes previous issues where the hop_length=512 setting was effectively ignored.
Flash Attention for FastSpeech 2
- Optional Flash Attention support has been added to the FastSpeech 2 encoder.
- A new configuration option:
```
enc_attention_type: normal  # or 'flash'
```
  lets you switch between the standard multi-head attention (normal) and Flash Attention (flash).
- In our tests, enabling Flash Attention during training reduced GPU memory usage by roughly 5% and brought a small, but noticeable, speedup. The gain is modest on its own, but it can be more useful when combined with the batch inference front-end for high-throughput scenarios.
Multiple logging backends
- Training can now log to external services such as WandB or SwanLab.
- A helper function builds the logger instance based on the logger field in base.yaml.
- If no external logger is configured, TensorBoard remains the default backend.
Configuration and documentation updates
- Configuration templates for acoustic, variance, pitch, and duration models have been updated.
- New options include:
  - enc_attention_type – choose between standard and Flash attention.
  - workers_per_gpu – control parallel preprocessing on multi-GPU setups.
  - lang_phoneme_separator (default /) – used to separate language and phoneme IDs in multilingual dictionaries.
- GettingStarted.md and base_config paths were revised to match the new structure.
- README.md has been updated to mention Flash Attention, RefineGAN, and the new multi-GPU / logging features.
Repository cleanup and bug fixes
- The repository structure has been reorganized and unnecessary assets removed, making the project layout cleaner.
- A bug in phoneme_utils was fixed: redundant docstrings were removed and the language separator logic was simplified.
- Community contributions updated the acoustic/variance configuration templates and improved multi-dictionary settings.
- Another RefineGAN bug related to downsample/upsample rates vs hop_length was fixed to avoid silent misalignment.

Notes and recommendations

The experimental LoRA fine-tuning mechanism from ver0.0.1 is still available. It is not recommended to combine LoRA fine-tuning with “forgetting / frozen layer” fine-tuning at the same time.
To benefit from Flash Attention and multi-GPU preprocessing, you should use PyTorch 2.0+ and GPUs that support scaled-dot-product (Flash) attention kernels.
For best throughput, we recommend combining:
- the batch inference front-end and
- the RefineGAN vocoder.

If you run into issues or have feature requests, feel free to open an issue or send a pull request.

Kouon Project Team
January 28, 2026

Assets 2

13 Sep 15:18

colstone

ver0.0.1

cae37f3

Added usable LoRA fine-tuning (experimental, tested only on Acoustic models)

ver0.0.1: Added usable LoRA fine-tuning (experimental, tested only on Acoustic models)

This is the first version of our DSRX. Based on the openvpi fork (November 2024 version), we have introduced an experimental LoRA fine-tuning mechanism.

LoRA fine-tuning employs a low-rank decomposition matrix placed alongside the main model. By keeping the pre-trained model parameters frozen while introducing trainable low-rank matrices, it simulates the effects of full-parameter fine-tuning, significantly reducing training resource consumption and duration. In our tests, using only a 1.5-minute Japanese voice dataset (from "Blue Archive——Yutori Natsu"), with a Chinese-Japanese bilingual model (parameters: 160k steps, 512*20 network size, WaveNet backbone, 64 batch size) as the base model, and LoRA parameters (8 rank, 16 alpha, injected into 'fs2' & 'diffusion' modules, training bias enabled), we achieved excellent singing synthesis and cross-lingual performance after 500 training steps.

To use LoRA fine-tuning, please utilize configs/acoustic-lora.yaml for preprocessing and training. Acoustic model inference and ONNX export have been tested and can be properly integrated with OpenUtau.

Note: We have not yet conducted experiments with both LoRA fine-tuning and forgetting/frozen layer fine-tuning enabled simultaneously. However, we currently advise against enabling both concurrently.

If you encounter bugs during use, please submit a PR or issue to this repo. We will evaluate and improve them as soon as possible.

Kouon Project Team
September 13, 2025, late at night

Appendix: Brief explanation of new parameters in acoustic-yaml (based on openvpi fork, November 2024 version):

binarization_args.allow_missing_phonemes: This parameter accepts a boolean value, defaulting to True in LoRA preprocessing and training. When enabled, preprocessing skips phonemes missing from the dictionary, prints missing phoneme warnings during preprocessing, and creates missing_phonemes.json in the preprocessing folder. This option is strongly recommended only for forgetting/frozen layer or LoRA fine-tuning.

lora:
a. lora.enabled: Accepts a boolean value, defaulting to True, indicating that LoRA fine-tuning is enabled.
b. lora.rank: Accepts an integer value as the low-rank parameter, controlling adapter capacity and GPU memory consumption, defaulting to 8. Higher rank values increase capacity but consume more GPU memory.
c. lora.alpha: Accepts an integer value as the scaling coefficient, equivalent to setting the LoRA amplification ratio, defaulting to 16.
d. lora.target_modules: Accepts a list[str] value, containing regex patterns to match modules where LoRA should be injected, applied to linear layers, defaulting to '^fs2\.' and '^diffusion\.'.
e. lora.base_ckpt: Accepts the file path to the base model for LoRA.
f. lora.train_bias: Accepts a boolean value, defaulting to True, indicating that biases of all linear layers are trained alongside LoRA. Recommended for small datasets (<5 minutes).
g. lora.merge_before_export: Accepts a boolean value, defaulting to True, indicating whether to merge LoRA components into the base model before ONNX export.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

ver0.0.2 – Multi-GPU preprocessing, Flash Attention, RefineGAN & batch inference

Highlights

Notes and recommendations

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!

Releases: Kouon-Project/DSRX

Multi‑GPU preprocessing, Flash Attention, RefineGAN & batch inference

ver0.0.2 – Multi-GPU preprocessing, Flash Attention, RefineGAN & batch inference

Highlights

Notes and recommendations

Uh oh!

Added usable LoRA fine-tuning (experimental, tested only on Acoustic models)

Uh oh!