Skip to content

Features of IterableDataset set to None by remove column #5284

@sanchit-gandhi

Description

@sanchit-gandhi

Describe the bug

The remove_column method of the IterableDataset sets the dataset features to None.

Steps to reproduce the bug

from datasets import Audio, load_dataset

# load LS in streaming mode
dataset = load_dataset("librispeech_asr", "clean", split="validation", streaming=True)

# check original features
print("Original features: ", dataset.features.keys())

# define features to remove: we KEEP audio and text
COLUMNS_TO_REMOVE = ['chapter_id', 'speaker_id', 'file', 'id']

dataset = dataset.remove_columns(COLUMNS_TO_REMOVE)

# check processed features, uh-oh!
print("Processed features: ", dataset.features)

# streaming the first audio sample still works
print("First sample:", next(iter(ds)))

Print Output:

Original features:  dict_keys(['file', 'audio', 'text', 'speaker_id', 'chapter_id', 'id'])
Processed features:  None
First sample: {'audio': {'path': '2277-149896-0000.flac', 'array': array([ 0.00186157,  0.0005188 ,  0.00024414, ..., -0.00097656,
       -0.00109863, -0.00146484]), 'sampling_rate': 16000}, 'text': "HE WAS IN A FEVERED STATE OF MIND OWING TO THE BLIGHT HIS WIFE'S ACTION THREATENED TO CAST UPON HIS ENTIRE FUTURE"}

Expected behavior

The features should be those not removed by the remove_column method, i.e. audio and text.

Environment info

  • datasets version: 2.7.1
  • Platform: Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.15
  • PyArrow version: 9.0.0
  • Pandas version: 1.3.5

(Running on Google Colab for a blog post: https://colab.research.google.com/drive/1ySCQREPZEl4msLfxb79pYYOWjUZhkr9y#scrollTo=8pRDGiVmH2ml)

cc @polinaeterna @lhoestq

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions