Skip to content

Persisting metadata cache #179

@tpietzsch

Description

@tpietzsch

In the context of BigStitcher, we have the problem that opening a container with many (thousands) datasets (representing tiles) in BigDataViewer takes a long time. The reason is, that we need to load the metadata for all tiles to know bounding boxes etc. This is before loading any actual data. Data loading is asynchronous, but loading the metadata is not. We need the metadata (dataset dimensions and DataType) to create the lazy CachedCellImgs so that we even can decide which data should be lazy-loaded asynchronously.

N5 has N5JsonCache to cache metadata (attributes, group lists, etc). After the cache is populated, querying dataset dimensions is very fast, because we don't need any reads from the backend. But initially, to populate the cache, a lot of backend operations need to happen (reading attributes.json files, listing group contents).

To speed up this initial loading in BDV, I created this hack: bigdataviewer/bigdataviewer-core#213
The idea is to put a attributes_cache.json file in the root of the container that collects all attributes.json (currently, should be extended to .zgroup, .zattr, etc) into a single file. This is basically a dump of a fully populated N5JsonCache. If that file exists when opening a container, I read it and populate it into the N5JsonCache. This is very fast. The file can be a several MB but reading, parsing, and injecting into the N5JsonCache still only takes a few milliseconds.

Do we want to build this out and make this more standard?

The attributes_cache.json is organized as a tree of json objects corresponding to the directory structure of the container.
Every object has a children property that lists all subgroups, a isDataset boolean property, a isGroup property, and each of the attributes files as JSON objects (e.g., one child named attributes.json containing the JsonObject of the parsed attributes.json file. That is, this is exactly the information needed to populate a N5JsonCache.N5CacheInfo entry.

Where it differs from the N5JsonCache structure is that it is actually a nested tree structure. The path to the setup0/timepoint0/s0/attributes.json content would be under the JSON path "/children/setup0/children/timepoint0/children/s0/attributes.json".
The nice thing about this is that we could put attributes_cache.json not only in the container root but also into subgroups, where they would collect everything under that subgroup (with paths relative to the subgroup). This could be used to quickly rebuild the root attributes_cache when something changes. To collect all information, we would only have to descend into subgroups until a attributes_cache.json is found and then that could be included completely and we wouldn't need to recurse deeper.

Anyway. Do we want to add something like that directly in the N5 core?

If not, could we open the N5JsonCache API a bit more? (Currently I'm using reflection to populate N5JsonCache.N5CacheInfo entries without triggering unwanted backend reads.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions