-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Open
Description
Describe the bug
While importing an audiofolder dataset, where the names of the audiofiles don't correspond to the filenames in the metadata.csv, we get an unclear error message that is not helpful for the debugging, i.e.
ValueError: Instruction "train" corresponds to no data!
Steps to reproduce the bug
Assume an audiofolder with audiofiles, filename1.mp3, filename2.mp3 etc and a file metadata.csv which contains the columns file_name and sentence. The file_names are formatted like filename1.mp3, filename2.mp3 etc.
Load the audio
from datasets import load_dataset
load_dataset("audiofolder", data_dir='/path/to/audiofolder')
When the file_names in the csv are not in sync with the filenames in the audiofolder, then we get an Error message:
File /opt/conda/lib/python3.12/site-packages/datasets/arrow_reader.py:251, in BaseReader.read(self, name, instructions, split_infos, in_memory)
249 if not files:
250 msg = f'Instruction "{instructions}" corresponds to no data!'
--> 251 raise ValueError(msg)
252 return self.read_files(files=files, original_instructions=instructions, in_memory=in_memory)
ValueError: Instruction "train" corresponds to no data!
load_dataset has a default value for the argument split = 'train'.
Expected behavior
It would be better to get an error report something like:
The metadata.csv file has different filenames than the files in the datadirectory.
It would have saved me 4 hours of debugging.
Environment info
datasetsversion: 3.2.0- Platform: Linux-5.14.0-427.40.1.el9_4.x86_64-x86_64-with-glibc2.39
- Python version: 3.12.8
huggingface_hubversion: 0.27.0- PyArrow version: 18.1.0
- Pandas version: 2.2.3
fsspecversion: 2024.9.0
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels