-
Notifications
You must be signed in to change notification settings - Fork 42
Open
Labels
Description
Is your feature request related to a problem? Please describe.
Nope
Describe the solution you'd like
Presently, if one wants to use a seed dataset that exists as a DataFrame in memory, there are a few hoops to jump through which seem like they could be simplified in the context of local execution. In the example below, I'm doing a common pattern where I'm loading a dataset from HF (but it could also come from anywhere else).
"""Loading A Large Dataset
In this example, I want to load records from the wikipedia dataset, which
is quite large, and I don't want to load it all into RAM. So I'm using
streaming=True.
"""
doc_iterator = load_dataset(
"wikimedia/wikipedia",
"20231101.en",
split="train",
streaming=True
)
"""Cast to a DataFrame...
Now, to use with DD today, I need to cast to a fully materialized DataFrame.
This means that I must load materialize all the data I require now, and I cannot,
for instance, progressively generate records from an iterator, like `datasets.IteratedDataset`.
"""
df_documents = pd.DataFrame.from_records(
[record for record in doc_iterator.take(num_samples)]
)
"""Load into config
Next, I've got to put this into the config builder. However, this requires me to
make a separate call to a dd.DataDesigner classmethod (??), then give the DF,
and then also, provide a filename -- even though I'm not interested in writing this
to disk.
"""
config_builder.with_seed_dataset(
dataset_reference=dd.DataDesigner.make_seed_reference_from_dataframe(
df_documents,
"wiki.csv"
)
)
# ... continue with config generationInstead, the desired alternative would be to simply do something like the following as the north star, and have it work for DataFrames, Datasets, IteratedDatasets, or just generic Iterators that return dictionaries.
doc_iterator = load_dataset(
"wikimedia/wikipedia",
"20231101.en",
split="train",
streaming=True
)
config_builder.with_seed_dataset(doc_iterator)