Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .cspell/custom-dictionary.txt
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,13 @@ ELECTRONANALYZER
Emminger
Florian
Forschungsgemeinschaft
GPFS
GROUPNAME
Ginzburg
Globar
González
Grundmann
HFIVEPY
Heiko
Hetaba
Hildebrandt
Expand Down Expand Up @@ -104,6 +106,7 @@ fairmat
fillvalue
flatfield
fluence
fourd
fxcef
getlink
getroottree
Expand All @@ -124,6 +127,7 @@ instancename
isdt
isscalar
issubdtype
itemsize
itemsizing
iufc
iupac
Expand Down Expand Up @@ -151,19 +155,23 @@ mydatareader
mynxdl
namefit
namefitting
nbytes
ndarray
ndataconverter
ndims
nexpy
nexusapp
nexusformat
nodemixin
nonvariadic
nslots
nsmap
nxcollection
nxdata
nxdl
nxdls
nxentry
oned
optionalities
orcid
otherfile
Expand All @@ -175,6 +183,7 @@ printoptions
punx
pynxtools
raman
rdcc
redef
reqs
requiredness
Expand All @@ -186,10 +195,12 @@ showlegend
straße
submoduled
superproject
threed
tnxdl
tofile
tommaso
tracebacklimit
twod
underload
uniquify
unitless
Expand Down
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ We are offering a small guide to getting started with NeXus, `pynxtools`, and NO
- [Data conversion in `pynxtools`](learn/pynxtools/dataconverter-and-readers.md)
- [Validation of NeXus files](learn/pynxtools/nexus-validation.md)
- [The `MultiFormatReader` as a reader superclass](learn/pynxtools/multi-format-reader.md)
- [Using and tailoring compression](learn/pynxtools/compression.md)

</div>
<div markdown="block">
Expand Down
102 changes: 102 additions & 0 deletions docs/learn/pynxtools/compression.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Using compression with HDF5

## Approach

Data compression covers methods to effectively reduce the size of a dataset or portions of it. Lossless and lossy methods are distinguished. Given that `pynxtools` writes its content to HDF5 files, we have decided to use compression filters that the HDF5 library provides. Using compression or not in `pynxtools` is optional. The user can decide at the individual dataset level. To preserve the original data, we decided to only support lossless compression algorithms. We also decided to not compress strings and scalar datasets.

Specifically, we use the build-in [`deflate` compression filter](https://support.hdfgroup.org/documentation/hdf5-docs/hdf5_topics/UsingCompressionInHDF5.html) due to its wide support across the most frequently used programming languages. Users should be aware that using `deflate` instead of more modern algorithms has the trade-off that there is as of now no efficient multi-threaded implementation of this compression filter within the HDF5 library. Therefore, compression can take a substantial amount of the total execution time within the `dataconverter` HDF5 file writing part.

## How to use compression

Developers of plugins for `pynxtools` instruct the writing of data by adding variables such as numpy arrays or xarrays like


```
array = np.zeros((1000,), np.float64)
```

to specific places in the [`template`](https://fairmat-nfdi.github.io/pynxtools/how-tos/pynxtools/build-a-plugin.html#the-reader-template-dictionary) object that the `dataconverter` provides:

```
template["/ENTRY[entry1]/numpy_array"] = array
```

Given such a template entry, the `dataconverter` creates an HDF5 dataset instance that uses the so-called contiguous data storage layout. That means that the dataset is stored uncompressed.

As an alternative, compression can be instructed via a slight modification of the previous example:

```
template["/ENTRY[entry1]/array"] = {
"compress": array,
"strength": 9,
}
```

Wrapping the `array` here into a dictionary instructs the `dataconverter` to store a lossless compressed version of the same dataset.
The dictionary has one mandatory keyword `compress`. An additional, optional keyword `strength` exists which can be used to overwrite the
default compression strength to trade-off processing time with file size reduction at the granularity of an individual dataset.

Using compression internally forces the HDF5 library to use a different, so-called chunked data storage layout.
A chunked data layout can be understood as an internal splitting of the dataset into chunks, pieces that get compressed individually;
typically one after another.

## Benefits for users

The compression filters in HDF5 work in two directions - compression and decompression. Decompression is typically faster than compression.
Using functionalities offered by `h5py`, users can work as conveniently with HDF5 files irrespective even when combining contiguous and chunked datasets in the same file.

Compared to wrapping the entire HDF5 file into an archive, e.g., when `zip`ping it up, chunking offers more fine-grained control.
Often when uploading content to research data management systems, like NOMAD, users wrap their file(s) into a `zip` or other types of compressed archives. Using compression as described above can often make this obsolete as improvements of additional compression are insignificant. A relevant exception where `zip`ping an HDF5 file is still useful although much of its internal content has already been compressed is when there is a considerable number of groups surplus substantial padding bytes remaining.

Using compression can significantly reduce the size of HDF5 files, depending on the entropy of the data. This leads to savings in storage requirements and faster data transfer, without any loss of numerical precision. Clearly, a downside of using compression is that before
any data can be accessed and worked with, e.g., in NOMAD, decompressing is required; `h5py` does this automatically.
The chunked storage layout is useful in that it enables a selective decompression of only those portions of the dataset required.
This provides an effective, but more advanced, mechanism, for improving data processing pipelines, particularly when only subsets of the data are required, and independently of the research data management system or downstream applications that access the HDF5 file.

## Expectation management

Compression is often very effective for images and spectra where data are stored using integer values as for many bins or pixels no counts are taken or
the number of counts is substantially lower than the maximum value that the integer type can represent.
Compression is often observed as less effective when applied on floating point data. Frequently this is the case for measurements or simulations
where physically insignificant changes in the last digits still demand for storage when using lossless compression schemes.
The often smaller precision requirements or physical precision offered by a measurement in relation to the maximum precision of the datatype,
i.e., discretization, is the motivation behind developing lossy compression methods and using lower precision floating point numbers
e.g., in the field of machine learning and artificial intelligence.

## Tailoring the chunking

While efficient and effective for performing compression tasks, the usage of chunks has also drawbacks that can affect performance.
Consequently, the efficiency and thus processing speed that compressed data offer when used in downstream processing and visualization depends
on the configuration of the chunks. Compromises need to be made especially for large datasets. The `h5py` library uses a heuristic to generate
chunk shapes that resemble the dimensions of the original dataset. This serves users that slice a datasets approximately equally
frequently across all perpendicular viewing directions.

However, there are often cases where one of the viewing directions is prioritized and users expect that displaying should be fastest when
slicing perpendicular to this direction. In this case, it can be useful to overwrite the `h5py` heuristic
with another one which favors that direction by shaping the chunks differently.

As an example, assume you wish to inspect an image stack with 100,000 images each having 1024 x 1024 pixels.
Assume further for simplicity that these pixels are organized in a three-dimensional array that is 100,000 images deep.
Depending on the data layout in memory, the values for the pixels of the same image pack closer in memory than for pixels of neighboring images.
Assume now you wish to inspect primarily one image at a time, i.e. you slice perpendicular to the
image id axis. In this case, it would be ideal to load only the 1024 x 1024 pixels you need and ideally these should be in the same chunk.
Loading neighboring images, or portions of it, speculatively is what modern hardware does and sophisticated visualization software offers,
as it brings advantages when navigating forwards and backwards along the slicing direction.
Assume another user who is interested in seeing the contrast changes along the image id direction, i.e. narrowing on a single pixel column
interested in displaying an array with 100,000 entries. That user would like to have all contrast values again ideally in one chunk and
read-out in one operation. Such use cases can collide substantiating why the optional functionality of `pynxtools` to customize the
chunk settings should be used when there is clear bias towards one particular viewing direction.

Observing that our exclusive relying on the heuristic of `h5py` delivered frequently too small chunks that increased loading and display times
for HDF5 files that were generated with `pynxtools` using H5Web in the NOMAD research data management system. This motivated adding the
here described customization option.

The customization of the chunking heuristic has an additional level of hardware-dependent complexity though. Specifically, the actual
read-out performance of chunked HDF5 content can heavily depend on the file system architecture and its settings. It is important to understand
that the chunk configuration though is defined upon writing dataset into the HDF5 file and cannot be changed thereafter.
The `src/pynxtools/dataconverter/chunk.py` configuration makes explicit typical default values one can use for starting analyses
on different file systems used for deploying NOMAD. By default, we follow the default of the `h5py` library, which tries to achieve a
performance compromise that is tailored towards single storage operations like on servers and laptops.

Developers that customize for Lustre or GPFS based hardware and NOMAD deployments can use the chunk_cache settings to explore further
optimization routes to make the most out of their NeXus/HDF5-file-based RDM pipeline in NOMAD.
1 change: 1 addition & 0 deletions mkdocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ nav:
- learn/pynxtools/dataconverter-and-readers.md
- learn/pynxtools/nexus-validation.md
- learn/pynxtools/multi-format-reader.md
- learn/pynxtools/compression.md
- Reference:
- reference/definitions.md
- reference/cli-api.md
Expand Down
18 changes: 18 additions & 0 deletions src/pynxtools/dataconverter/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,21 @@
#
# Copyright The NOMAD Authors.
#
# This file is part of NOMAD. See https://nomad-lab.eu for further info.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

from pynxtools.dataconverter import helpers, validation

helpers.validate_data_dict = validation.validate_data_dict # type: ignore
Loading
Loading