LinearVC: Linear transformations of self-supervised features through the lens of voice conversion

Overview

Voice conversion is performed using just linear regression. The work is described in:

H. Kamper, B. van Niekerk, J. Zaïdi, and M-A. Carbonneau, "LinearVC: Linear transformations of self-supervised features through the lens of voice conversion," in Interspeech, 2025.

Samples: https://www.kamperh.com/linearvc/

Quick start

Programmatic usage

Install the dependencies in environment.yml or run conda env create -f environment.yml and check that everything installed correctly. The steps below are also illustrated in the demo notebook.

import torch
import torchaudio

device = "cuda"  # "cpu"

# Load all the required models
wavlm = torch.hub.load(
    "bshall/knn-vc", 
    "wavlm_large", 
    trust_repo=True, 
    progress=True, 
    device=device, 
)
hifigan, _ = torch.hub.load(
    "bshall/knn-vc",
    "hifigan_wavlm",
    trust_repo=True,
    prematched=True,
    progress=True,
    device=device,
)
linearvc_model = linearvc.LinearVC(wavlm, hifigan, device)

# Lists of source and target audio files
source_wavs = [
    "<filename of audio from source speaker 1>.wav",
    "<filename of audio from source speaker 2>.wav",
    ...,
]
target_wavs = [
    "<filename of audio from target speaker 1>.wav",
    "<filename of audio from target speaker 2>.wav",
    ...,
]

# Source input utterance
input_features = linearvc_model.get_features("<filename>.wav")

# Voice conversion projection matrix
W = linearvc_model.get_projmat(
    source_wavs,
    target_wavs,
    parallel=True,  # enable if parallel
    vad=False,
)

# Project the input and vocode
output_wav = linearvc_model.project_and_vocode(input_features, W)
torchaudio.save("output.wav", output_wav[None], 16000)

If parallel=True, utterances with the same filename are paired up. If parallel=False, the utterances don't have to align, but then you need more data (3 minutes per speaker is good, more than that doesn't help much).

Script usage

Perform LinearVC by finding all the source and target audio files in given directories:

./linearvc.py \
    --extension .flac \
    ~/LibriSpeech/dev-clean/1272/ \
    ~/LibriSpeech/dev-clean/1462/ \
    ~/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac \
    output.wav

When parallel utterances are available, much less data is needed. Running the script with --parallel as below scans two directories and pairs up all utterances with the same filename. E.g. below it finds 002.wav, 003.wav, etc. in the p225/ source directory and then pairs these up with the same filenames in the p226/ directory.

./linearvc.py \
    --parallel \
    data/vctk_demo/p225/ \
    data/vctk_demo/p226/ \
    data/vctk_demo/p225/067.wav \
    output2.wav

Full script details:

usage: linearvc.py [-h] [--parallel] [--lasso LASSO] [--vad]
                   [--extension {.flac,.wav}]
                   source_wav_dir target_wav_dir input_wav output_wav

Perform voice conversion with linear regression.

positional arguments:
  source_wav_dir        directory with source speaker speech
  target_wav_dir        directory with target speaker speech
  input_wav             input speech filename
  output_wav            output speech filename

options:
  -h, --help            show this help message and exit
  --parallel            whether source and target utterances are parallel, in
                        which case the filenames in the two directories should
                        match
  --lasso LASSO         lasso is applied with this alpha value
  --vad                 voice activatiy detecion is applied to start of
                        utterance
  --extension {.flac,.wav}
                        source and target audio file extension (default:
                        '.wav')

Experiments on all utterances (LibriSpeech)

These experiments are described in (Kamper et al. 2025).

Extract WavLM features:

./extract_wavlm_libri.py \
    --exclude data/eval_inputs_dev-clean.txt \
    ~/endgame/datasets/librispeech/LibriSpeech/dev-clean/ \
    ~/scratch/dev-clean/wavlm_exclude/
./extract_wavlm_libri.py \ 
    --exclude data/eval_inputs_test-clean.txt \
    ~/endgame/datasets/librispeech/LibriSpeech/test-clean/ \
    ~/scratch/test-clean/wavlm_exclude/

Experiments with all utterances:

jupyter lab experiments_libri.ipynb

Experiments on parallel utterances (VCTK)

These experiments are not described in the paper but are still interesting.

Downsample speech to 16kHz:

# Development set
./resample_vad.py \
    data/vctk_scottish.txt \
    ~/endgame/datasets/VCTK-Corpus/wav48/ \
    ~/scratch/vctk/wav/scottish/

# Test set
./resample_vad.py \
    data/vctk_english.txt \
    ~/endgame/datasets/VCTK-Corpus/wav48/ \
    ~/scratch/vctk/wav/english/

Create the evaluation dataset (which is already in the data/ directory released with the repo):

./evalcsv_vctk.py \
    data/vctk_scottish.txt \
    /home/kamperh/scratch/vctk/wav/scottish/ \
    data/speakersim_vctk_scottish_2024-09-16.csv
./evalcsv_vctk.py \
    data/vctk_english.txt \
    /home/kamperh/scratch/vctk/wav/english/ \
    data/speakersim_vctk_english_2024-09-16.csv

Extract features for particular parallel utterances (for baselines):

./extract_wavlm_vctk.py --utterance 008 \
    ~/scratch/vctk/wav/english/ ~/scratch/vctk/english/wavlm_008/

Experiments with parallel utterances:

jupyter lab experiments_vctk.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
data		data
doc		doc
.gitignore		.gitignore
demo.ipynb		demo.ipynb
environment.yml		environment.yml
evalcsv_vctk.py		evalcsv_vctk.py
experiments_libri.ipynb		experiments_libri.ipynb
experiments_vctk.ipynb		experiments_vctk.ipynb
extract_wavlm_libri.py		extract_wavlm_libri.py
extract_wavlm_vctk.py		extract_wavlm_vctk.py
intelligibility.py		intelligibility.py
license.md		license.md
linearvc.py		linearvc.py
log.md		log.md
readme.md		readme.md
reduced_rank_ridge.py		reduced_rank_ridge.py
resample_vad.py		resample_vad.py
speaker_similarity.py		speaker_similarity.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LinearVC: Linear transformations of self-supervised features through the lens of voice conversion

Overview

Quick start

Programmatic usage

Script usage

Experiments on all utterances (LibriSpeech)

Experiments on parallel utterances (VCTK)

About

Uh oh!

Releases

Packages

Languages

License

kamperh/linearvc

Folders and files

Latest commit

History

Repository files navigation

LinearVC: Linear transformations of self-supervised features through the lens of voice conversion

Overview

Quick start

Programmatic usage

Script usage

Experiments on all utterances (LibriSpeech)

Experiments on parallel utterances (VCTK)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages