Skip to content

Support cast_kwargs in cast_columns #7909

@Moenupa

Description

@Moenupa

Feature request

expose cast(**cast_kwargs) to cast_column()

return self.cast(features)

Motivation

cast_column() wraps cast() function without exposing any cast() args. For large multi-modal datasets, e.g.

# a dataset with list[{"bytes"}: b'', ...], much more than one image
load_dataset("MLLM-CL/VTCBench").cast_column("images", List(Image(decode=False)))

This would fail due to #6206, #7167, where the default value 1000 for batch size in cast() is too large and causes pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays.

@fingerprint_transform(inplace=False)
def cast_column(self, column: str, feature: FeatureType, new_fingerprint: Optional[str] = None) -> "Dataset":
"""Cast column to feature for decoding.
Args:
column (`str`):
Column name.
feature (`FeatureType`):
Target feature.
new_fingerprint (`str`, *optional*):
The new fingerprint of the dataset after transform.
If `None`, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
Returns:
[`Dataset`]
Example:
```py
>>> from datasets import load_dataset, ClassLabel
>>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
>>> ds.features
{'label': ClassLabel(names=['neg', 'pos']),
'text': Value('string')}
>>> ds = ds.cast_column('label', ClassLabel(names=['bad', 'good']))
>>> ds.features
{'label': ClassLabel(names=['bad', 'good']),
'text': Value('string')}
```
"""
feature = _fix_for_backward_compatible_features(feature)
if hasattr(feature, "decode_example"):
dataset = copy.deepcopy(self)
dataset._info.features[column] = feature
dataset._fingerprint = new_fingerprint
dataset._data = dataset._data.cast(dataset.features.arrow_schema)
dataset._data = update_metadata_with_features(dataset._data, dataset.features)
return dataset
else:
features = self.features
features[column] = feature
return self.cast(features)

Your contribution

#7910

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions