Skip to content

Importing dataset gives unhelpful error message when filenames in metadata.csv are not found in the directory #7369

@svencornetsdegroot

Description

@svencornetsdegroot

Describe the bug

While importing an audiofolder dataset, where the names of the audiofiles don't correspond to the filenames in the metadata.csv, we get an unclear error message that is not helpful for the debugging, i.e.

ValueError: Instruction "train" corresponds to no data!

Steps to reproduce the bug

Assume an audiofolder with audiofiles, filename1.mp3, filename2.mp3 etc and a file metadata.csv which contains the columns file_name and sentence. The file_names are formatted like filename1.mp3, filename2.mp3 etc.

Load the audio

from datasets import load_dataset
load_dataset("audiofolder", data_dir='/path/to/audiofolder')

When the file_names in the csv are not in sync with the filenames in the audiofolder, then we get an Error message:

File /opt/conda/lib/python3.12/site-packages/datasets/arrow_reader.py:251, in BaseReader.read(self, name, instructions, split_infos, in_memory)
    249 if not files:
    250     msg = f'Instruction "{instructions}" corresponds to no data!'
--> 251     raise ValueError(msg)
    252 return self.read_files(files=files, original_instructions=instructions, in_memory=in_memory)

ValueError: Instruction "train" corresponds to no data!

load_dataset has a default value for the argument split = 'train'.

Expected behavior

It would be better to get an error report something like:

The metadata.csv file has different filenames than the files in the datadirectory. 

It would have saved me 4 hours of debugging.

Environment info

  • datasets version: 3.2.0
  • Platform: Linux-5.14.0-427.40.1.el9_4.x86_64-x86_64-with-glibc2.39
  • Python version: 3.12.8
  • huggingface_hub version: 0.27.0
  • PyArrow version: 18.1.0
  • Pandas version: 2.2.3
  • fsspec version: 2024.9.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions