FAIRmat-NFDI · mkuehbach · Dec 30, 2025 · Dec 30, 2025 · Dec 30, 2025 · Dec 31, 2025
diff --git a/.cspell/custom-dictionary.txt b/.cspell/custom-dictionary.txt
@@ -16,11 +16,13 @@ ELECTRONANALYZER
 Emminger
 Florian
 Forschungsgemeinschaft
+GPFS
 GROUPNAME
 Ginzburg
 Globar
 González
 Grundmann
+HFIVEPY
 Heiko
 Hetaba
 Hildebrandt
@@ -104,6 +106,7 @@ fairmat
 fillvalue
 flatfield
 fluence
+fourd
 fxcef
 getlink
 getroottree
@@ -124,6 +127,7 @@ instancename
 isdt
 isscalar
 issubdtype
+itemsize
 itemsizing
 iufc
 iupac
@@ -151,19 +155,23 @@ mydatareader
 mynxdl
 namefit
 namefitting
+nbytes
 ndarray
 ndataconverter
+ndims
 nexpy
 nexusapp
 nexusformat
 nodemixin
 nonvariadic
+nslots
 nsmap
 nxcollection
 nxdata
 nxdl
 nxdls
 nxentry
+oned
 optionalities
 orcid
 otherfile
@@ -175,6 +183,7 @@ printoptions
 punx
 pynxtools
 raman
+rdcc
 redef
 reqs
 requiredness
@@ -186,10 +195,12 @@ showlegend
 straße
 submoduled
 superproject
+threed
 tnxdl
 tofile
 tommaso
 tracebacklimit
+twod
 underload
 uniquify
 unitless

diff --git a/docs/index.md b/docs/index.md
@@ -69,6 +69,7 @@ We are offering a small guide to getting started with NeXus, `pynxtools`, and NO
 - [Data conversion in `pynxtools`](learn/pynxtools/dataconverter-and-readers.md)
 - [Validation of NeXus files](learn/pynxtools/nexus-validation.md)
 - [The `MultiFormatReader` as a reader superclass](learn/pynxtools/multi-format-reader.md)
+- [Using and tailoring compression](learn/pynxtools/compression.md)
 
 </div>
 <div markdown="block">

diff --git a/docs/learn/pynxtools/compression.md b/docs/learn/pynxtools/compression.md
@@ -0,0 +1,102 @@
+# Using compression with HDF5
+
+## Approach
+
+Data compression covers methods to effectively reduce the size of a dataset or portions of it. Lossless and lossy methods are distinguished. Given that `pynxtools` writes its content to HDF5 files, we have decided to use compression filters that the HDF5 library provides. Using compression or not in `pynxtools` is optional. The user can decide at the individual dataset level. To preserve the original data, we decided to only support lossless compression algorithms. We also decided to not compress strings and scalar datasets.
+
+Specifically, we use the build-in [`deflate` compression filter](https://support.hdfgroup.org/documentation/hdf5-docs/hdf5_topics/UsingCompressionInHDF5.html) due to its wide support across the most frequently used programming languages. Users should be aware that using `deflate` instead of more modern algorithms has the trade-off that there is as of now no efficient multi-threaded implementation of this compression filter within the HDF5 library. Therefore, compression can take a substantial amount of the total execution time within the `dataconverter` HDF5 file writing part.
+
+## How to use compression
+
+Developers of plugins for `pynxtools` instruct the writing of data by adding variables such as numpy arrays or xarrays like
+
+
+```
+array = np.zeros((1000,), np.float64)
+```
+
+to specific places in the [`template`](https://fairmat-nfdi.github.io/pynxtools/how-tos/pynxtools/build-a-plugin.html#the-reader-template-dictionary) object that the `dataconverter` provides:
+
+```
+template["/ENTRY[entry1]/numpy_array"] = array
+```
+
+Given such a template entry, the `dataconverter` creates an HDF5 dataset instance that uses the so-called contiguous data storage layout. That means that the dataset is stored uncompressed.
+
+As an alternative, compression can be instructed via a slight modification of the previous example:
+
+```
+template["/ENTRY[entry1]/array"] = {
+  "compress": array,
+  "strength": 9,
+}
+```
+
+Wrapping the `array` here into a dictionary instructs the `dataconverter` to store a lossless compressed version of the same dataset.
+The dictionary has one mandatory keyword `compress`. An additional, optional keyword `strength` exists which can be used to overwrite the
+default compression strength to trade-off processing time with file size reduction at the granularity of an individual dataset.
+
+Using compression internally forces the HDF5 library to use a different, so-called chunked data storage layout.
+A chunked data layout can be understood as an internal splitting of the dataset into chunks, pieces that get compressed individually;
+typically one after another.
+
+## Benefits for users
+
+The compression filters in HDF5 work in two directions - compression and decompression. Decompression is typically faster than compression.
+Using functionalities offered by `h5py`, users can work as conveniently with HDF5 files irrespective even when combining contiguous and chunked datasets in the same file.
+
+Compared to wrapping the entire HDF5 file into an archive, e.g., when `zip`ping it up, chunking offers more fine-grained control.
+Often when uploading content to research data management systems, like NOMAD, users wrap their file(s) into a `zip` or other types of compressed archives. Using compression as described above can often make this obsolete as improvements of additional compression are insignificant. A relevant exception where `zip`ping an HDF5 file is still useful although much of its internal content has already been compressed is when there is a considerable number of groups surplus substantial padding bytes remaining.
+
+Using compression can significantly reduce the size of HDF5 files, depending on the entropy of the data. This leads to savings in storage requirements and faster data transfer, without any loss of numerical precision. Clearly, a downside of using compression is that before
+any data can be accessed and worked with, e.g., in NOMAD, decompressing is required; `h5py` does this automatically.
+The chunked storage layout is useful in that it enables a selective decompression of only those portions of the dataset required.
+This provides an effective, but more advanced, mechanism, for improving data processing pipelines, particularly when only subsets of the data are required, and independently of the research data management system or downstream applications that access the HDF5 file.
+
+## Expectation management
+
+Compression is often very effective for images and spectra where data are stored using integer values as for many bins or pixels no counts are taken or
+the number of counts is substantially lower than the maximum value that the integer type can represent.
+Compression is often observed as less effective when applied on floating point data. Frequently this is the case for measurements or simulations
+where physically insignificant changes in the last digits still demand for storage when using lossless compression schemes.
+The often smaller precision requirements or physical precision offered by a measurement in relation to the maximum precision of the datatype,
+i.e., discretization, is the motivation behind developing lossy compression methods and using lower precision floating point numbers
+e.g., in the field of machine learning and artificial intelligence.
+
+## Tailoring the chunking
+
+While efficient and effective for performing compression tasks, the usage of chunks has also drawbacks that can affect performance.
+Consequently, the efficiency and thus processing speed that compressed data offer when used in downstream processing and visualization depends
+on the configuration of the chunks. Compromises need to be made especially for large datasets. The `h5py` library uses a heuristic to generate
+chunk shapes that resemble the dimensions of the original dataset. This serves users that slice a datasets approximately equally
+frequently across all perpendicular viewing directions.
+
+However, there are often cases where one of the viewing directions is prioritized and users expect that displaying should be fastest when
+slicing perpendicular to this direction. In this case, it can be useful to overwrite the `h5py` heuristic
+with another one which favors that direction by shaping the chunks differently.
+
+As an example, assume you wish to inspect an image stack with 100,000 images each having 1024 x 1024 pixels.
+Assume further for simplicity that these pixels are organized in a three-dimensional array that is 100,000 images deep.
+Depending on the data layout in memory, the values for the pixels of the same image pack closer in memory than for pixels of neighboring images.
+Assume now you wish to inspect primarily one image at a time, i.e. you slice perpendicular to the
+image id axis. In this case, it would be ideal to load only the 1024 x 1024 pixels you need and ideally these should be in the same chunk.
+Loading neighboring images, or portions of it, speculatively is what modern hardware does and sophisticated visualization software offers,
+as it brings advantages when navigating forwards and backwards along the slicing direction.
+Assume another user who is interested in seeing the contrast changes along the image id direction, i.e. narrowing on a single pixel column
+interested in displaying an array with 100,000 entries. That user would like to have all contrast values again ideally in one chunk and
+read-out in one operation. Such use cases can collide substantiating why the optional functionality of `pynxtools` to customize the
+chunk settings should be used when there is clear bias towards one particular viewing direction.
+
+Observing that our exclusive relying on the heuristic of `h5py` delivered frequently too small chunks that increased loading and display times
+for HDF5 files that were generated with `pynxtools` using H5Web in the NOMAD research data management system. This motivated adding the
+here described customization option.
+
+The customization of the chunking heuristic has an additional level of hardware-dependent complexity though. Specifically, the actual
+read-out performance of chunked HDF5 content can heavily depend on the file system architecture and its settings. It is important to understand
+that the chunk configuration though is defined upon writing dataset into the HDF5 file and cannot be changed thereafter.
+The `src/pynxtools/dataconverter/chunk.py` configuration makes explicit typical default values one can use for starting analyses
+on different file systems used for deploying NOMAD. By default, we follow the default of the `h5py` library, which tries to achieve a
+performance compromise that is tailored towards single storage operations like on servers and laptops.
+
+Developers that customize for Lustre or GPFS based hardware and NOMAD deployments can use the chunk_cache settings to explore further
+optimization routes to make the most out of their NeXus/HDF5-file-based RDM pipeline in NOMAD.
diff --git a/mkdocs.yaml b/mkdocs.yaml
@@ -35,6 +35,7 @@ nav:
       - learn/pynxtools/dataconverter-and-readers.md
       - learn/pynxtools/nexus-validation.md
       - learn/pynxtools/multi-format-reader.md
+      - learn/pynxtools/compression.md
   - Reference:
     - reference/definitions.md
     - reference/cli-api.md

diff --git a/src/pynxtools/dataconverter/__init__.py b/src/pynxtools/dataconverter/__init__.py
@@ -1,3 +1,21 @@
+#
+# Copyright The NOMAD Authors.
+#
+# This file is part of NOMAD. See https://nomad-lab.eu for further info.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
 from pynxtools.dataconverter import helpers, validation
 
 helpers.validate_data_dict = validation.validate_data_dict  # type: ignore