Skip to content

HDF5 dataloaders using a python generator #255

@swpenninga

Description

@swpenninga

Any dataloaders that use the H5Generator are effectively singlethreaded and cpu-bound:

def iterator(self):

Either I am completely misunderstanding the situation, but as far as I know:

tf.data.Dataset.from_generator(image_extractor, ...)

Means that next() runs in Python. This gives the pipeline:

GPU waits → Python → h5py → Python → TF → GPU

and num_workers, AUTOTUNE, prefetch, and batch have no effect in speeding this up.

Currently, I'm training an RF-data VAE that's taking 7s to load a batch, and 120ms to do the forward and backward pass which is why I cannot use the zea dataloader.


I understand that fixing this is quite a large task as there are many dependencies, but it would be good to look into at some point.

Metadata

Metadata

Labels

data formatRelated to the zea data format saving and loadingefficiencyImprovements made regarding code or tests efficiency

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions