Skip to content

Simplify seed dataset creation from DataFrames & Datasets #90

@eric-tramel

Description

@eric-tramel

Is your feature request related to a problem? Please describe.
Nope

Describe the solution you'd like

Presently, if one wants to use a seed dataset that exists as a DataFrame in memory, there are a few hoops to jump through which seem like they could be simplified in the context of local execution. In the example below, I'm doing a common pattern where I'm loading a dataset from HF (but it could also come from anywhere else).

"""Loading A Large Dataset

In this example, I want to load records from the wikipedia dataset, which
is quite large, and I don't want to load it all into RAM. So I'm using 
streaming=True.
"""
doc_iterator = load_dataset(
    "wikimedia/wikipedia",
    "20231101.en",
    split="train",
    streaming=True
)

"""Cast to a DataFrame...

Now, to use with DD today, I need to cast to a fully materialized DataFrame.
This means that I must load materialize all the data I require now, and I cannot,
for instance, progressively generate records from an iterator, like `datasets.IteratedDataset`.
"""
df_documents = pd.DataFrame.from_records(
    [record for record in doc_iterator.take(num_samples)]
)

"""Load into config

Next, I've got to put this into the config builder. However, this requires me to 
make a separate call to a dd.DataDesigner classmethod (??), then give the DF,
and then also, provide a filename -- even though I'm not interested in writing this
to disk.
"""
config_builder.with_seed_dataset(
    dataset_reference=dd.DataDesigner.make_seed_reference_from_dataframe(
        df_documents,
        "wiki.csv"
    )
)

# ... continue with config generation

Instead, the desired alternative would be to simply do something like the following as the north star, and have it work for DataFrames, Datasets, IteratedDatasets, or just generic Iterators that return dictionaries.

doc_iterator = load_dataset(
    "wikimedia/wikipedia",
    "20231101.en",
    split="train",
    streaming=True
)

config_builder.with_seed_dataset(doc_iterator)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions