Skip to content

Better integration with the Hugging Face datasets ecosystem #160

@casenave

Description

@casenave

The following code leads to https://huggingface.co/datasets/fabiencasenave/Tensile2d_test , which has a complete integration in HF datasets.

The features can be exported, but for the complete CGNS trees (with tags and BC and so on), we still need to resort to binary-like samples. The Sequence(Value("float")) are possible (efficient ?), but HDF5 integration seems to be underway: huggingface/datasets#7743. We cannot use from datasets.features import Array2D for types, since this imposes fixed-size arrays. The HDF5 integration seems to be limited to fixed size arrays, since the work seems to rely on Array2D and so on: https://github.com/klamike/datasets/blob/2c4bfba70d525b0f9336b8e36b299d73d4a2f3e4/tests/packaged_modules/test_hdf5.py

from plaid.bridges import huggingface_bridge
import datasets
from datasets import Dataset, Features, Value, Sequence, Array2D


hf_dataset = datasets.load_dataset("PLAID-datasets/Tensile2d", split="all_samples")
dataset, pb_def,  = huggingface_bridge.huggingface_dataset_to_plaid(hf_dataset, processes_number = 12, verbose = True)


all_feat_ids = dataset.get_all_features_identifiers()
all_feat_ids = [k for k in all_feat_ids if "name" in k.keys()]



features= {}
for feat_id in all_feat_ids:
    if feat_id["type"] == "scalar":
        features[feat_id["name"]] = Value("float64")
    elif feat_id["type"] == "field":
        features[feat_id["name"]] = Sequence(Value("float64"))


_dict = {}
for split in ["train_500", "test", "OOD"]:

    def generator():
        for id in pb_def.get_split(split):
            sample = {}
            for feat_id in all_feat_ids:
                sample[feat_id["name"]] = dataset[id].get_feature_from_identifier(feat_id)
            yield sample

    ds = datasets.Dataset.from_generator(
        generator,
        features=datasets.Features(features),
        num_proc=1,
        writer_batch_size=1,
        split=datasets.splits.NamedSplit(split),
    )
    _dict[split] = ds

datasets.DatasetDict(_dict).push_to_hub("fabiencasenave/Tensile2d_test")

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions