Skip to content

Tame the eventual memory hungryness of the dataconverter and NOMAD parser for large datasets #737

@mkuehbach

Description

@mkuehbach
  • Observed for 15.5 GiB NeXus/HDF5 NXem dataset, essentially all content in one 3d array which uncompressed is 32GiB, the array if stored chunked, upon parsing multiple times the uncompressed payload is allocated causing the parsing to break on small systems.
Image

Suggestions, we should check for inefficiencies, assuming what happens when a dataset is large:

  • Compute summary statistics and finiteness via taking advantage of chunked layout, chunk-by-chunk
  • Could be useful to drop processing statistics for contiguously stored datasets if larger than fraction of maximum allocated memory accessible to the user in the deployment, like "drop of load"
  • Access h5 objects metadata directly rather than via np.shape() calls

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions