seed-pytorch/datasets/README.md at main · kaistmm/seed-pytorch

SEED Training and Evaluation Dataset Preparation Guide

This document guides you through the process of preparing the datasets required for training and evaluating the SEED model.

cd ./datasets # Current path: ./datasets

1. Download and Prepare Training Datasets

First, download and organize the speech datasets and supplementary data (noise, reverberation, etc.) needed for model training.

# 1.1. Download Libri-Light (https://github.com/facebookresearch/libri-light/blob/main/data_preparation/README.md)
mkdir -p ./Libri-Light
cd ./Libri-Light
wget https://dl.fbaipublicfiles.com/librilight/data/small.tar
tar -xvf small.tar
rm -rf small.tar
cd ../

# 1.2. Download LibriTTS-R (https://www.openslr.org/141/)
mkdir -p ./LibriTTS-R
cd ./LibriTTS-R
wget https://www.openslr.org/resources/141/train_clean_100.tar.gz
wget https://www.openslr.org/resources/141/train_clean_360.tar.gz
tar -xvf train_clean_100.tar.gz
rm -rf train_clean_100.tar.gz
tar -xvf train_clean_360.tar.gz # Adjust if the filename is train-clean-360.tar.gz
rm -rf train_clean_360.tar.gz # Adjust if the filename is train-clean-360.tar.gz
cd ../

# 1.3. Download MUSAN (Music, Noise, Speech sound effects) (https://www.openslr.org/17/)
wget https://www.openslr.org/resources/17/musan.tar.gz
tar -xvf musan.tar.gz
rm -rf musan.tar.gz

# 1.4. Download RIRs (Room Impulse Responses) (https://www.openslr.org/28/)
wget https://www.openslr.org/resources/28/rirs_noises.zip
unzip rirs_noises.zip -d rirs_noises
rm -rf rirs_noises.zip
mkdir -p ./RIRS_NOISES
mv rirs_noises/simulated_rirs/* ./RIRS_NOISES/ # Ensure files are moved into RIRS_NOISES
rm -rf rirs_noises

Expected directory structure after downloading training datasets:

datasets/
├── Libri-Light/small/
├── LibriTTS-R/
│   ├── train_clean_100/
│   └── train_clean_360/  
├── musan/
│   ├── music/
│   ├── noise/
│   └── speech/
└── RIRS_NOISES/

2. Preprocess Training Datasets

Generate the data list file required for SEED model training using the downloaded training datasets.

# 2.1. Create data list for SEED training
python ./make_seed_trainset.py \
    --target_dirs ./Libri-Light/small/ \
                  ./LibriTTS-R/train_clean_100/ \
                  ./LibriTTS-R/train_clean_360/ \
    --output_filename ./manifests/train_libritts+light_1000h.txt \
    --extensions .wav

Note: You might need to convert the audio format of the training data to 16kHz, mono channel. This README does not provide a script execution command for this conversion. If necessary, you should use a separate script or prepare it (e.g., using sox or ffmpeg). We use ffmpeg -y -i "input_path" -ac 1 -ar 16000 -acodec pcm_s16le command to convert the audio format.

3. Download and Prepare Evaluation Datasets

Prepare the VoxCeleb1 and VoxConverse test datasets that will be used for model evaluation.

# 3.1. Download VoxCeleb1
# Download and prepare the VoxCeleb1 dataset according to the instructions at the following link:
# https://github.com/clovaai/voxceleb_trainer
# After downloading, it should be located in the ./voxceleb1/ directory.

# 3.2. Create VCMix Test Set (using VoxCeleb1 and VoxConverse)
# The VCMix dataset combines the VoxCeleb1 and VoxConverse test SV datasets.
# Related paper: "Rethinking Session Variability: Leveraging Session Embeddings for Session Robustness in Speaker Verification" (https://arxiv.org/abs/2309.14741)
# The script below uses the VCMix dataset generation code provided by the first author (Hee-Soo Heo) of the paper.

python ./make_vcmix_testset.py \
    --voxceleb1_path   ./voxceleb1/ \
    --voxconverse_path ./voxconverse_test_SV/ \
    --voxconverse_test ./voxconverse_test_SV/trials_wo_overlap.txt \
    --output_filename  ./manifests/vcmix_test.txt \
    --download_voxconverse_test_SV  # Using this option will automatically download and prepare the VoxConverse test SV dataset.

Note: The --download_voxconverse_test_SV option is optional. If you do not have the VoxConverse dataset, using this option to download the re-prepared VoxConverse test SV dataset is a fast way to generate the vcmix_test.txt file.

Expected directory structure (partial) after preparing evaluation datasets:

datasets/
├── Libri-Light/small/
├── LibriTTS-R/
│   ├── train-clean-100/
│   └── train_clean_360/
├── musan/
│   ├── music/
│   ├── noise/
│   └── speech/
├── RIRS_NOISES/
├── voxceleb1/                     # Location of VoxCeleb1 data
├── voxconverse_test_SV/           # Location of VoxConverse data
└── manifests/
    ├── train_libritts+light_1000h.txt
    ├── vox1-O.txt                 # VoxCeleb1 evaluation list (can be generated by make_vcmix_testset.py or a separate script)
    └── vcmix_test.txt

4. Consolidate Evaluation Datasets (Using Symbolic Links)

To easily use various evaluation datasets within the code, consolidate the evaluation dataset directories into a single common directory (vox1-evals) using symbolic links.

sh ./merge_eval_directories.sh

5. Final Directory Structure Check

After completing all steps, your datasets directory should have the following structure:

datasets/
├── Libri-Light/small/          # For training
├── LibriTTS-R/                 # For training
│   ├── train-clean-100/
│   └── train_clean_360/
├── musan/                      # For training (data augmentation)
│   ├── music/
│   ├── noise/
│   └── speech/
├── RIRS_NOISES/                # For training (data augmentation)
├── voxceleb1/                  # For evaluation
├── voxconverse_test_SV/        # For evaluation
├── vox1-evals/                 # For evaluation (directory consolidated with symbolic links)
└── manifests/                  # Data list files
    ├── train_libritts+light_1000h.txt
    ├── vox1-O.txt
    └── vcmix_test.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SEED Training and Evaluation Dataset Preparation Guide

1. Download and Prepare Training Datasets

2. Preprocess Training Datasets

3. Download and Prepare Evaluation Datasets

4. Consolidate Evaluation Datasets (Using Symbolic Links)

5. Final Directory Structure Check

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

SEED Training and Evaluation Dataset Preparation Guide

1. Download and Prepare Training Datasets

2. Preprocess Training Datasets

3. Download and Prepare Evaluation Datasets

4. Consolidate Evaluation Datasets (Using Symbolic Links)

5. Final Directory Structure Check