-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Closed
Labels
Description
Describe the bug
The remove_column method of the IterableDataset sets the dataset features to None.
Steps to reproduce the bug
from datasets import Audio, load_dataset
# load LS in streaming mode
dataset = load_dataset("librispeech_asr", "clean", split="validation", streaming=True)
# check original features
print("Original features: ", dataset.features.keys())
# define features to remove: we KEEP audio and text
COLUMNS_TO_REMOVE = ['chapter_id', 'speaker_id', 'file', 'id']
dataset = dataset.remove_columns(COLUMNS_TO_REMOVE)
# check processed features, uh-oh!
print("Processed features: ", dataset.features)
# streaming the first audio sample still works
print("First sample:", next(iter(ds)))Print Output:
Original features: dict_keys(['file', 'audio', 'text', 'speaker_id', 'chapter_id', 'id'])
Processed features: None
First sample: {'audio': {'path': '2277-149896-0000.flac', 'array': array([ 0.00186157, 0.0005188 , 0.00024414, ..., -0.00097656,
-0.00109863, -0.00146484]), 'sampling_rate': 16000}, 'text': "HE WAS IN A FEVERED STATE OF MIND OWING TO THE BLIGHT HIS WIFE'S ACTION THREATENED TO CAST UPON HIS ENTIRE FUTURE"}
Expected behavior
The features should be those not removed by the remove_column method, i.e. audio and text.
Environment info
datasetsversion: 2.7.1- Platform: Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.15
- PyArrow version: 9.0.0
- Pandas version: 1.3.5
(Running on Google Colab for a blog post: https://colab.research.google.com/drive/1ySCQREPZEl4msLfxb79pYYOWjUZhkr9y#scrollTo=8pRDGiVmH2ml)
Reactions are currently unavailable