Skip to content
Open
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .cspell/custom-dictionary.txt
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,13 @@ ELECTRONANALYZER
Emminger
Florian
Forschungsgemeinschaft
GPFS
GROUPNAME
Ginzburg
Globar
González
Grundmann
HFIVEPY
Heiko
Hetaba
Hildebrandt
Expand Down Expand Up @@ -124,6 +126,7 @@ instancename
isdt
isscalar
issubdtype
itemsize
itemsizing
iufc
iupac
Expand Down Expand Up @@ -151,13 +154,16 @@ mydatareader
mynxdl
namefit
namefitting
nbytes
ndarray
ndataconverter
ndims
nexpy
nexusapp
nexusformat
nodemixin
nonvariadic
nslots
nsmap
nxcollection
nxdata
Expand All @@ -175,6 +181,7 @@ printoptions
punx
pynxtools
raman
rdcc
redef
reqs
requiredness
Expand Down
3 changes: 2 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ We are offering a small guide to getting started with NeXus, `pynxtools`, and NO
- [Data conversion in `pynxtools`](learn/pynxtools/dataconverter-and-readers.md)
- [Validation of NeXus files](learn/pynxtools/nexus-validation.md)
- [The `MultiFormatReader` as a reader superclass](learn/pynxtools/multi-format-reader.md)
- [Using and tailoring compression](learn/pynxtools/compression.md)

</div>
<div markdown="block">
Expand Down Expand Up @@ -102,4 +103,4 @@ For questions or suggestions:

<h2>Project and community</h2>

The work is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - [460197019 (FAIRmat)](https://gepris.dfg.de/gepris/projekt/460197019?language=en).
The work is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - [460197019 (FAIRmat)](https://gepris.dfg.de/gepris/projekt/460197019?language=en).
109 changes: 109 additions & 0 deletions docs/learn/pynxtools/compression.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Using compression with HDF5

## Approach

Data compression covers methods to effectively reduce the size of a dataset or portions of it. Lossless and lossy methods are distinguished. There is an entire research field working on developing methods and implementations for each method. Given that `pynxtools` writes its content to HDF5 files, we have decided to use the functionalities that this library provides to motivate users to use compression as an optional feature, at an optionally individual dataset level. To preserve the original data, we decided to support only with lossless compression algorithms. We also decided to currently not compress strings and scalar dataset.

Specifically, we use the build-in [gzip (`deflate`) compression filter](https://support.hdfgroup.org/documentation/hdf5-docs/hdf5_topics/UsingCompressionInHDF5.html) due to its wide support across the most frequently used programming languages. Users should be aware that using `deflate` instead of more modern algorithms has the trade-off that there is as of now no efficient multi-threaded implementation of this compression filter within the HDF5 library. Therefore, compression can take a substantial amount of the total execution time within the `dataconverter` HDF5 file writing part.

## How to use compression

Developers of plugins for `pynxtools` instruct the writing of data via adding variables such as numpy or xarrays like the following example

```
array = np.zeros((1000,), np.float64)
```

to specific places in the `template` dictionary object that the `dataconverter` provides:

```
template["/ENTRY[entry1]/numpy_array"] = array
```

For such an instruction, the `dataconverter` creates an HDF5 dataset instance that uses the so-called contiguous data storage layout. This dataset is stored uncompressed.

As an alternative, compression can be instructed via a slight modification of the previous example:

```
template["/ENTRY[entry1]/array"] = {
"compress": array,
"strength": 9,
}
```

Wrapping the `array` here into a dictionary instructs the `dataconverter` to store a lossless compressed version of the same dataset.
The dictionary has one mandatory keyword `compress`. An additional, optional keyword `strength`, exists with which to overwrite the
default compression strength to trade-off processing time with file size reduction at the granularity of an individual dataset.

Using compression internally forces the HDF5 library to use a different, a so-called chunked data storage layout.
A chunked data layout can be understood as an internal splitting of the dataset into chunks, pieces that get compressed individually;
typically one after another.

## Benefits for users

The compression filters in HDF5 work in two directions - compression and decompression. Decompression is typically faster than compression.
Thanks to functionalities offers by `h5py`, users can work as conveniently with HDF5 files irrespective if these combine contiguous and chunked datasets in the same file.

Compared to wrapping the entire HDF5 file into an archive, e.g., when `zip`ping it up, offers more fine-grained control.
Note that especially for a usage in research data management systems, like NOMAD, combining both approaches, wrapping an HDF5 file
that has internally compressed datasets into a zip file is often not additionally effective unless, the HDF5 file has a considerable
number of groups and additional internal datasets where much padding bytes were added internally by the HDF5 library when writing file.

The benefit of using compression for users is that depending on the entropy of the data a substantial reduction of the HDF5 file size and thus savings
in terms of storage space and data transfer times are possible without loosing precision. Clearly, a downside of using compression is that before
any data can be accessed and worked with, e.g., in NOMAD decompressing is required. Thanks to `h5py` functionalities this happens automatically.
The chunked storage layout is useful in that it enables a selective decompression of only those portions of the dataset required.
This is an effective but advanced lever to use when implementing more effective data processing pipelines, especially when not all data are used,
and irrespective of the research data management system or downstream applications that consume the HDF5 file.

## Expectation management

Compression is often very effective for images and spectra where data are stored using integer values as for many bins or pixels no counts are taken or
the number of counts is substantially lower than the maximum number of counts that an integer offers discretization.
Compression is often observed as less effective when applied on floating point data. Frequently this is the case for measurements or simulations
where physically insignificant changes in the last digits still demand for storage when using lossless compression schemes.
The often smaller precision requirements or physical precision offered by a measurement in relation to the maximum precision of the datatype,
i.e., discretization, is the motivation behind developing lossy compression methods and using lower precision floating point numbers
e.g., in the field of machine learning and artificial intelligence.


## Configure the chunking

While efficient and effective for performing compression tasks, the design of using chunks has also drawbacks that can affect read and write
performance. Consequently, the efficiency and speed with which such chunked and compressed data can be used in downstream processing
and visualization depends on the configuration of the chunks. This is a field where compromises need to be made especially for large datasets.
The `h5py` library implements a heuristic that tries to construct chunks with shape similar to that of the original dataset.
Each chunk is effectively a shrunk (hyper)-rectangle/-cuboid of the original dataset. This pleases users with needs to slice about equally
frequently across all perpendicular viewing directions.

However, there are often cases where there is a clear bias towards slicing across a prioritized direction paired with the expectation
that displaying should be fast when slicing perpendicular to this direction. In this case, it can be useful to overwrite the `h5py` heuristic
with another one which favors that direction by shaping the chunks differently.

As an example, assume you wish to inspect an image stack with 100,000 images each having 1024 x 1024 pixels.
Assume further for simplicity that these pixels are organized in a three-dimensional array that is 100,000 images deep.
Depending on the data layout in memory, the values for the pixels of the same image pack closer in memory than for pixels of neighboring images.
Assume now you wish to inspect primarily one image at a time, i.e. you slice perpendicular to the
image id axis. In this case, it would be ideal to load only the 1024 x 1024 pixels you need and ideally these should be in the same chunk.
Loading neighboring images, or portions of it, speculatively is what modern hardware does and sophisticated visualization software offers,
as it brings advantages when navigating forwards and backwards along the slicing direction.
Assume another user who is interested in seeing the contrast changes along the image id direction, i.e. narrowing on a single pixel column
interested in displaying an array with 100,000 entries. That user would like to have all contrast values again ideally in one chunk and
read-out in one operation. Such use cases can collide substantiating why the optional functionality of `pynxtools` to customize the
chunk settings should be used when there is clear bias towards one particular viewing direction.

Observing that our exclusive relying on the heuristic of `h5py` delivered frequently too small chunks that increased loading and display times
for HDF5 files that were generated with `pynxtools` using H5Web in the NOMAD research data management system. This motivated adding the
here described customization option.

The customization of the chunking heuristic has an additional level of hardware-dependent complexity though. Specifically, the actual
read-out performance of chunked HDF5 content can heavily depend on the file system architecture and its settings. It is important to understand
that the chunk configuration though is defined upon writing dataset into the HDF5 file and cannot be changed thereafter.
The `src/pynxtools/dataconverter/chunk_cache.py` configuration makes explicit typical default values one can use for starting analyses
on different file systems used for deploying NOMAD. By default, we follow the default of the `h5py` library, which tries to achieve a
performance compromise that is tailored towards single storage operations like on servers and laptops.

Developers that customize for Lustre or GPFS based hardware and NOMAD deployments can use the chunk_cache settings to explore further
optimization routes to make the most out of their NeXus/HDF5-file-based RDM pipeline in NOMAD.


1 change: 1 addition & 0 deletions mkdocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ nav:
- learn/pynxtools/dataconverter-and-readers.md
- learn/pynxtools/nexus-validation.md
- learn/pynxtools/multi-format-reader.md
- learn/pynxtools/compression.md
- Reference:
- reference/definitions.md
- reference/cli-api.md
Expand Down
18 changes: 18 additions & 0 deletions src/pynxtools/dataconverter/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,21 @@
#
# Copyright The NOMAD Authors.
#
# This file is part of NOMAD. See https://nomad-lab.eu for further info.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

from pynxtools.dataconverter import helpers, validation

helpers.validate_data_dict = validation.validate_data_dict # type: ignore
57 changes: 57 additions & 0 deletions src/pynxtools/dataconverter/chunk_cache.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These three new files should be combined into one (e.g. chunk.py) since they all address the same functionality.

Alternatively, as discussed in #739, they should go to helpers.py.

# Copyright The NOMAD Authors.
#
# This file is part of NOMAD. See https://nomad-lab.eu for further info.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
"""Use-case-specific configurations to optimize performance for chunked storage."""

# https://github.com/h5py/h5py/blob/master/docs/high/file.rst

# e.g. for h5py v3.15.1 https://github.com/h5py/h5py/blob/fad034c16f595cb24f4393bbd0dcd23c53bc9a33/h5py/tests/test_file2.py#L111
CHUNK_CONFIG_HFIVEPY: dict[str, int | float] = {
"byte_size": 1 * 1024 * 1024,
"rdcc_nbytes": 1 * 1024 * 1024, # 1 MiB before HDF2.0, will be 8 MiB for HDF2.0
"rdcc_nslots": 521,
"rdcc_w0": 0.75,
}

CHUNK_CONFIG_SSD_NVM: dict[str, int | float] = {
"byte_size": 1 * 1024 * 1024,
"rdcc_nbytes": 128 * 1024 * 1024,
"rdcc_nslots": 4093,
"rdcc_w0": 0.75,
}
CHUNK_CONFIG_HDD: dict[str, int | float] = {
"byte_size": 4 * 1024 * 1024,
"rdcc_nbytes": 256 * 1024 * 1024,
"rdcc_nslots": 1021,
"rdcc_w0": 0.75,
}
CHUNK_CONFIG_GPFS: dict[str, int | float] = {
"byte_size": 8 * 1024 * 1024,
"rdcc_nbytes": 256 * 1024 * 1024,
"rdcc_nslots": 521,
"rdcc_w0": 0.75,
}

CHUNK_CONFIG_LUSTRE: dict[str, int | float] = {
# set stripe size before creating a file!
"byte_size": 8 * 1024 * 1024,
"rdcc_nbytes": 256 * 1024 * 1024,
"rdcc_nslots": 521,
"rdcc_w0": 0.75,
}

CHUNK_CONFIG_DEFAULT = CHUNK_CONFIG_HFIVEPY
Loading
Loading