Better integration with the Hugging Face datasets ecosystem

The following code leads to https://huggingface.co/datasets/fabiencasenave/Tensile2d_test , which has a complete integration in HF datasets.

The features can be exported, but for the complete CGNS trees (with tags and BC and so on), we still need to resort to binary-like samples. The `Sequence(Value("float"))` are possible (efficient ?), but HDF5 integration seems to be underway: https://github.com/huggingface/datasets/pull/7743. We cannot use `from datasets.features import Array2D` for types, since this imposes fixed-size arrays. The HDF5 integration seems to be limited to fixed size arrays, since the work seems to rely on `Array2D` and so on: https://github.com/klamike/datasets/blob/2c4bfba70d525b0f9336b8e36b299d73d4a2f3e4/tests/packaged_modules/test_hdf5.py

```python
from plaid.bridges import huggingface_bridge
import datasets
from datasets import Dataset, Features, Value, Sequence, Array2D


hf_dataset = datasets.load_dataset("PLAID-datasets/Tensile2d", split="all_samples")
dataset, pb_def,  = huggingface_bridge.huggingface_dataset_to_plaid(hf_dataset, processes_number = 12, verbose = True)


all_feat_ids = dataset.get_all_features_identifiers()
all_feat_ids = [k for k in all_feat_ids if "name" in k.keys()]



features= {}
for feat_id in all_feat_ids:
    if feat_id["type"] == "scalar":
        features[feat_id["name"]] = Value("float64")
    elif feat_id["type"] == "field":
        features[feat_id["name"]] = Sequence(Value("float64"))


_dict = {}
for split in ["train_500", "test", "OOD"]:

    def generator():
        for id in pb_def.get_split(split):
            sample = {}
            for feat_id in all_feat_ids:
                sample[feat_id["name"]] = dataset[id].get_feature_from_identifier(feat_id)
            yield sample

    ds = datasets.Dataset.from_generator(
        generator,
        features=datasets.Features(features),
        num_proc=1,
        writer_batch_size=1,
        split=datasets.splits.NamedSplit(split),
    )
    _dict[split] = ds

datasets.DatasetDict(_dict).push_to_hub("fabiencasenave/Tensile2d_test")
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Better integration with the Hugging Face datasets ecosystem #160

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Better integration with the Hugging Face datasets ecosystem #160

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions