HuggingFace dataset streaming support

To run an eval on a very large dataset that can't fit in memory or cached on disk, we have to use HF dataset loader's `streaming` option. Currently Inspect AI's HF dataset wrapper by default will call `dataset.save_to_disk` which is unsupported on the dataset iterator. Additionally, `MemoryDataset` assumes a fully loaded dataset for slicing, shuffling, etc.

I would love to be able to run evals on large datasets so just wanted to put in this as a feature request.

Much <3 to the maintainers of this repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HuggingFace dataset streaming support #2978

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HuggingFace dataset streaming support #2978

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions