Skip to content

Multi‑GPU preprocessing, Flash Attention, RefineGAN & batch inference

Latest

Choose a tag to compare

@colstone colstone released this 28 Jan 12:17
81496ce

ver0.0.2 – Multi-GPU preprocessing, Flash Attention, RefineGAN & batch inference

This version builds upon the initial LoRA-enabled release (ver0.0.1 for DiffSinger acoustic fine-tuning). It focuses on faster preprocessing and inference, optional Flash Attention support, and better training instrumentation.


Highlights

  1. Parallel preprocessing across multiple GPUs

    • The binarization pipeline can now detect multiple CUDA devices and split the workload across them.
    • A new configuration parameter workers_per_gpu (in base.yaml) controls how many workers are spawned per GPU.
    • When num_workers > 0 and more than one GPU is available, DSRX automatically assigns device IDs and runs preprocessing in parallel, significantly reducing data preparation time on multi-GPU systems.
  2. Experimental batch inference frontend

    • An experimental inference front-end has been added to speed up inference.
    • In local tests on an RTX 3060, synthesizing a ~4-minute song (~75 acoustic phrases):
      • OpenUTAU + dsinfer (DirectML + ONNX Runtime, default OpenUTAU infer settings) took about 2–3 minutes.
      • The original Python inference script took about 20–30 seconds.
      • The new batch inference backend with batch_size = 4 finished in about 8–15 seconds.
    • Note: increasing the batch size can slightly worsen the accent of cross-lingual synthesis in some cases, so there is a speed–quality trade-off when tuning batch_size.
    • The front-end includes a timeout/VRAM-based model unloader that frees GPU memory when idle and reloads models on demand.
    • As soon as new data arrives, the front-end starts inference and writes logs for easier monitoring.
  3. RefineGAN vocoder integration

    • RefineGAN is now available as a vocoder option. Enable it via:
      vocoder: RefineGAN
    • The generator checks whether the product of downsample/upsample rates matches the configured hop_length. If not, it automatically adjusts the last rate to match; if it is not divisible, it raises a descriptive error instead of silently mis-configuring the model.
    • This fixes previous issues where the hop_length=512 setting was effectively ignored.
  4. Flash Attention for FastSpeech 2

    • Optional Flash Attention support has been added to the FastSpeech 2 encoder.
    • A new configuration option:
      enc_attention_type: normal  # or 'flash'
      lets you switch between the standard multi-head attention (normal) and Flash Attention (flash).
    • In our tests, enabling Flash Attention during training reduced GPU memory usage by roughly 5% and brought a small, but noticeable, speedup. The gain is modest on its own, but it can be more useful when combined with the batch inference front-end for high-throughput scenarios.
  5. Multiple logging backends

    • Training can now log to external services such as WandB or SwanLab.
    • A helper function builds the logger instance based on the logger field in base.yaml.
    • If no external logger is configured, TensorBoard remains the default backend.
  6. Configuration and documentation updates

    • Configuration templates for acoustic, variance, pitch, and duration models have been updated.
    • New options include:
      • enc_attention_type – choose between standard and Flash attention.
      • workers_per_gpu – control parallel preprocessing on multi-GPU setups.
      • lang_phoneme_separator (default /) – used to separate language and phoneme IDs in multilingual dictionaries.
    • GettingStarted.md and base_config paths were revised to match the new structure.
    • README.md has been updated to mention Flash Attention, RefineGAN, and the new multi-GPU / logging features.
  7. Repository cleanup and bug fixes

    • The repository structure has been reorganized and unnecessary assets removed, making the project layout cleaner.
    • A bug in phoneme_utils was fixed: redundant docstrings were removed and the language separator logic was simplified.
    • Community contributions updated the acoustic/variance configuration templates and improved multi-dictionary settings.
    • Another RefineGAN bug related to downsample/upsample rates vs hop_length was fixed to avoid silent misalignment.

Notes and recommendations

  • The experimental LoRA fine-tuning mechanism from ver0.0.1 is still available. It is not recommended to combine LoRA fine-tuning with “forgetting / frozen layer” fine-tuning at the same time.
  • To benefit from Flash Attention and multi-GPU preprocessing, you should use PyTorch 2.0+ and GPUs that support scaled-dot-product (Flash) attention kernels.
  • For best throughput, we recommend combining:
    • the batch inference front-end and
    • the RefineGAN vocoder.

If you run into issues or have feature requests, feel free to open an issue or send a pull request.

Kouon Project Team
January 28, 2026