PyTorch Implementation of DistillW2N: A Lightweight One-Shot Whisper to Normal Voice Conversion Model Using Distillation of Self-Supervised Features

- Create a Python environment with e.g. conda:
conda create --name distillw2n python=3.10.12 --yes - Activate the new environment:
conda activate distillw2n - Install torch and torchaudio:
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121 - Update the packages:
sudo apt-get update && apt-get install -y libsndfile1 ffmpeg - Install requirements with
pip install -r requirements.txt - Download models with links given in txt
- For quickvc and wesper please run:
python compare_infer.py - For our models please run:
python infer.py
- Please run:
python u2ss2u.py
You just need to download the datasets under YOURPATH.
- Dataset Download
- For the libritts, ljspeech, and timit datasets, datahelper will automatically download if they are not found at
YOURPATH. - For the wtimit dataset, you will need to request it via email. Follow the appropriate procedures to obtain access and download the dataset to
YOURPATH.
- For the libritts, ljspeech, and timit datasets, datahelper will automatically download if they are not found at
- Dataset Preparation (Option)
- datapreper offers options for ppw (Pseudo-whisper) and vad (Voice Activity Detection) versions. You can choose to apply these processing steps according to your project's requirements.
This implementation builds on
- SoundStream for the training pipeline.
- Add Seed-VC inference samples for comparison.
- Train the SoundStream Decoder using a larger dataset of high-quality audio. (Training in progress with limited resources)