Skip to content

Proposal: New API to replace existing arrays in npz filesΒ #68

@zhihaoy

Description

@zhihaoy

Currently, dump_npz either destroys all existing arrays in an npz file or append arrays with names that exist in the npz file as duplicated entries. This is a rather strange semantics, especially when loading individual arrays with load_npz doesn't follow those semantics.

A reasonable semantics would be replacing existing arrays. This requires a context object, with a role similar to HighFive::File.

NumPy doesn't support this. They only overwrite all arrays at once.

I checked out libzippp and libzip++, neither work with streams. It's primarily because dump_npy_stream and libzip are both "push" interfaces (thus, both need to run the main loop), so they cannot work together without writing a special stream class that serves as a pipe so that libzip can pull data from it...

I think there is a rather simple way to support replacing semantics given a context object. When an npz_file is opened for update, append as usual while keeping the central directory as a data structure in memory. When closing, write a temporary file with only the up-to-date arrays, finish the file with the central directory entry, do an atomic move to replace the old file. (It's possible to shrink a file in-place, but I guess that would invalidate too much I/O buffer).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions