Evaluation suite for zero-shot Singing Voice Synthesis (SVS) systems, covering aesthetic appeal, signal quality, pronunciation accuracy, speaker similarity, and melodic precision.
We incorporate two MOS (Mean Opinion Score) prediction models to evaluate the subjective appeal of synthesized singing.
- SingMOS-Pro: A specialized MOS predictor for singing voice, focusing on professional vocal attributes.
- Sheet-SSQA: Simple Hierarchy-aware Enhancement-based Tool for Speech Subjective Quality Assessment.
- Mel Cepstral Distortion (MCD): Measures the spectral distance between predicted and ground-truth audio.
- WER/CER: Evaluates accuracy using ASR models.
- English: Whisper Large v3.
- Chinese: Paraformer-large.
- Note: Punctuation is removed before computation.
- Speaker-Sim (Cosine Similarity): Computes cosine similarity between prompt and generated voices.
- Model: WavLM-base-plus-sv. You can pass a local path or model id via
model_path_or_idwhen initializingSVPipeline.
- FFE / GPE / VDE: Frame Error, Gross Pitch Error, and Voicing Decision Error.
conda create -n soulx-singer-eval python=3.10
conda activate soulx-singer-eval
pip install -r requirements.txtBefore running evaluation, download the following files and place them under the ckpt/ directory:
all7-sslmos-mdf-2337-config.yml
https://github.com/unilight/sheet/releases/download/v0.1.0/all7-sslmos-mdf-2337-config.ymlall7-sslmos-mdf-2337-checkpoint-86000steps.pkl
https://github.com/unilight/sheet/releases/download/v0.1.0/all7-sslmos-mdf-2337-checkpoint-86000steps.pklft_wav2vec2_large_ll60k_mdf_p1_200epochs_all_192epochs.pth
https://github.com/South-Twilight/SingMOS/releases/download/ckpt_v3/ft_wav2vec2_large_ll60k_mdf_p1_200epochs_all_192epochs.pth
Note: If HuggingFace is unreachable in your environment,
s3prlmay fail to download SSL checkpoints because its URLs are hard-coded. You can patchs3prlto use thehf-mirrordomain by replacinghttps://huggingface.co/withhttps://hf-mirror.com/in the s3prl source (s3prl/upstream/wav2vec2/hubconf.py).
Due to the absence of a widely adopted SVS benchmark, we provide two complementary evaluation datasets to assess open-source and zero-shot conditions: GMO-SVS and SoulX-Singer-Eval.
HuggingFace: https://huggingface.co/datasets/Soul-AILab/SoulX-Singer-Eval-Dataset
GMO-SVS is built upon three public SVS corpora: GTSinger, M4Singer, and Opencpop. For M4Singer and Opencpop, we adopt their official test splits. GTSinger contributes English and Mandarin songs from multiple singers with diverse techniques and styles. In total, GMO-SVS contains 802 samples.
For each song, the first sentence is used as the acoustic prompt, and the remaining content is synthesized by evaluated models. Ground-truth recordings of the prompt singers are preserved to evaluate pronunciation accuracy, prosodic consistency, and overall synthesis quality. None of these open-source datasets are used in SoulX-Singer training, ensuring fair evaluation.
SoulX-Singer-Eval is a newly collected dataset for zero-shot generalization on unseen speakers. It contains 100 singing segments from 50 distinct individuals (25 Mandarin and 25 English speakers), with 2 segments per speaker. Mandarin data are collected from recruited professional and amateur singers who consented to open-source their voice data for academic purposes. English segments are sliced and filtered from the multitrack Mixing Secrets dataset. All segments are manually annotated with precise melody to meet prompt requirements for zero-shot SVS models.
Target lyrics and melodies for synthesis are randomly selected from 15 Mandarin and 15 English tracks in GMO-SVS. This introduces speakers unseen by baseline models and provides a rigorous benchmark for timbre cloning and style transfer.
Follow the structure in examples/summary.json. Each line is a JSON record with:
txt: reference transcriptref_fn: reference wav pathgen_fn: generated wav pathprompt_fn: prompt wav pathlanguage: Chinese or Englishprompt_language: language of the prompt
bash eva_server_run.shEdit eva_client_run.sh and set infer_dir to the folder that contains a summary.json file, then run:
bash eva_client_run.shThe script will generate:
result_zh.json/result_en.jsonmerged_zh.json/merged_en.json
python eva_client.py --input_file examples/summary.json --output_dir examplesResults will be written to:
examples/result_zh.jsonexamples/result_en.json
Then aggregate:
python average.py --input_file examples/result_zh.json --result_file examples/merged_zh.jsonThis project integrates components from the following repositories: