-
Notifications
You must be signed in to change notification settings - Fork 3k
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Feature request
expose cast(**cast_kwargs) to cast_column()
datasets/src/datasets/arrow_dataset.py
Line 2205 in 0feb65d
| return self.cast(features) |
Motivation
cast_column() wraps cast() function without exposing any cast() args. For large multi-modal datasets, e.g.
# a dataset with list[{"bytes"}: b'', ...], much more than one image
load_dataset("MLLM-CL/VTCBench").cast_column("images", List(Image(decode=False)))This would fail due to #6206, #7167, where the default value 1000 for batch size in cast() is too large and causes pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays.
datasets/src/datasets/arrow_dataset.py
Lines 2164 to 2205 in 0feb65d
| @fingerprint_transform(inplace=False) | |
| def cast_column(self, column: str, feature: FeatureType, new_fingerprint: Optional[str] = None) -> "Dataset": | |
| """Cast column to feature for decoding. | |
| Args: | |
| column (`str`): | |
| Column name. | |
| feature (`FeatureType`): | |
| Target feature. | |
| new_fingerprint (`str`, *optional*): | |
| The new fingerprint of the dataset after transform. | |
| If `None`, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments. | |
| Returns: | |
| [`Dataset`] | |
| Example: | |
| ```py | |
| >>> from datasets import load_dataset, ClassLabel | |
| >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") | |
| >>> ds.features | |
| {'label': ClassLabel(names=['neg', 'pos']), | |
| 'text': Value('string')} | |
| >>> ds = ds.cast_column('label', ClassLabel(names=['bad', 'good'])) | |
| >>> ds.features | |
| {'label': ClassLabel(names=['bad', 'good']), | |
| 'text': Value('string')} | |
| ``` | |
| """ | |
| feature = _fix_for_backward_compatible_features(feature) | |
| if hasattr(feature, "decode_example"): | |
| dataset = copy.deepcopy(self) | |
| dataset._info.features[column] = feature | |
| dataset._fingerprint = new_fingerprint | |
| dataset._data = dataset._data.cast(dataset.features.arrow_schema) | |
| dataset._data = update_metadata_with_features(dataset._data, dataset.features) | |
| return dataset | |
| else: | |
| features = self.features | |
| features[column] = feature | |
| return self.cast(features) |
Your contribution
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request