Skip to content

Iterating over values of a column in the IterableDataset #7381

@TopCoder2K

Description

@TopCoder2K

Feature request

I would like to be able to iterate (and re-iterate if needed) over a column of an IterableDataset instance. The following example shows the supposed API:

def gen():
    yield {"text": "Good", "label": 0}
    yield {"text": "Bad", "label": 1}

ds = IterableDataset.from_generator(gen)
texts = ds["text"]

for v in texts:
    print(v)  # Prints "Good" and "Bad"

for v in texts:
    print(v)  # Prints "Good" and "Bad" again

Motivation

In the real world problems, huge NNs like Transformer are not always the best option, so there is a need to conduct experiments with different methods. While 🤗Datasets is perfectly adapted to 🤗Transformers, it may be inconvenient when being used with other libraries. The ability to retrieve a particular column is the case (e.g., gensim's FastText requires only lists of strings, not dictionaries).
While there are ways to achieve the desired functionality, they are not good (forum). It would be great if there was a built-in solution.

Your contribution

Theoretically, I can submit a PR, but I have very little knowledge of the internal structure of 🤗Datasets, so some help may be needed.
Moreover, I can only work on weekends, since I have a full-time job. However, the feature does not seem to be popular, so there is no need to implement it as fast as possible.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions