-
Notifications
You must be signed in to change notification settings - Fork 10
Making explicit chunk cache handling policies and enabling a custom autochunker with a fall back to h5py's build-in autochunker #741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
mkuehbach
wants to merge
15
commits into
master
Choose a base branch
from
modifiable_hfive_chunking
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 10 commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
48d45e5
Design chunking configurations for different systems
atomprobe-tc 61866d2
Skeleton implementation replacement of the auto-chunking
atomprobe-tc cbab717
documentation
atomprobe-tc 474f5a1
Added explicit setting of rdcc config upon file creation
atomprobe-tc ad06a38
Most up-to-date link in docs
atomprobe-tc cf721a9
Merge branch 'master' into modifiable_hfive_chunking
atomprobe-tc 94b7bc9
sync up with latest state from cenem feature branch, cherry-picking o…
atomprobe-tc 179814a
Add missing config_defaults from refactoring_compression branch
atomprobe-tc 744e501
Add documentation specific for chunking
atomprobe-tc 0076b2c
Merge branch 'master' into modifiable_hfive_chunking
atomprobe-tc 4e12e5e
skeleton for tests, pulled together changes in one file
atomprobe-tc 66433ae
remove unused
atomprobe-tc 5109383
remove dead code and move imports
atomprobe-tc 5d82133
docstring
atomprobe-tc 99b166e
fix handling of awkward cases, tests completed
atomprobe-tc File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,109 @@ | ||
| # Using compression with HDF5 | ||
|
|
||
| ## Approach | ||
|
|
||
| Data compression covers methods to effectively reduce the size of a dataset or portions of it. Lossless and lossy methods are distinguished. There is an entire research field working on developing methods and implementations for each method. Given that `pynxtools` writes its content to HDF5 files, we have decided to use the functionalities that this library provides to motivate users to use compression as an optional feature, at an optionally individual dataset level. To preserve the original data, we decided to support only with lossless compression algorithms. We also decided to currently not compress strings and scalar dataset. | ||
|
|
||
| Specifically, we use the build-in [gzip (`deflate`) compression filter](https://support.hdfgroup.org/documentation/hdf5-docs/hdf5_topics/UsingCompressionInHDF5.html) due to its wide support across the most frequently used programming languages. Users should be aware that using `deflate` instead of more modern algorithms has the trade-off that there is as of now no efficient multi-threaded implementation of this compression filter within the HDF5 library. Therefore, compression can take a substantial amount of the total execution time within the `dataconverter` HDF5 file writing part. | ||
|
|
||
| ## How to use compression | ||
|
|
||
| Developers of plugins for `pynxtools` instruct the writing of data via adding variables such as numpy or xarrays like the following example | ||
|
|
||
| ``` | ||
| array = np.zeros((1000,), np.float64) | ||
| ``` | ||
|
|
||
| to specific places in the `template` dictionary object that the `dataconverter` provides: | ||
|
|
||
| ``` | ||
| template["/ENTRY[entry1]/numpy_array"] = array | ||
| ``` | ||
|
|
||
| For such an instruction, the `dataconverter` creates an HDF5 dataset instance that uses the so-called contiguous data storage layout. This dataset is stored uncompressed. | ||
|
|
||
| As an alternative, compression can be instructed via a slight modification of the previous example: | ||
|
|
||
| ``` | ||
| template["/ENTRY[entry1]/array"] = { | ||
| "compress": array, | ||
| "strength": 9, | ||
| } | ||
| ``` | ||
|
|
||
| Wrapping the `array` here into a dictionary instructs the `dataconverter` to store a lossless compressed version of the same dataset. | ||
| The dictionary has one mandatory keyword `compress`. An additional, optional keyword `strength`, exists with which to overwrite the | ||
| default compression strength to trade-off processing time with file size reduction at the granularity of an individual dataset. | ||
|
|
||
| Using compression internally forces the HDF5 library to use a different, a so-called chunked data storage layout. | ||
| A chunked data layout can be understood as an internal splitting of the dataset into chunks, pieces that get compressed individually; | ||
| typically one after another. | ||
|
|
||
| ## Benefits for users | ||
|
|
||
| The compression filters in HDF5 work in two directions - compression and decompression. Decompression is typically faster than compression. | ||
| Thanks to functionalities offers by `h5py`, users can work as conveniently with HDF5 files irrespective if these combine contiguous and chunked datasets in the same file. | ||
|
|
||
| Compared to wrapping the entire HDF5 file into an archive, e.g., when `zip`ping it up, offers more fine-grained control. | ||
| Note that especially for a usage in research data management systems, like NOMAD, combining both approaches, wrapping an HDF5 file | ||
| that has internally compressed datasets into a zip file is often not additionally effective unless, the HDF5 file has a considerable | ||
| number of groups and additional internal datasets where much padding bytes were added internally by the HDF5 library when writing file. | ||
|
|
||
| The benefit of using compression for users is that depending on the entropy of the data a substantial reduction of the HDF5 file size and thus savings | ||
| in terms of storage space and data transfer times are possible without loosing precision. Clearly, a downside of using compression is that before | ||
| any data can be accessed and worked with, e.g., in NOMAD decompressing is required. Thanks to `h5py` functionalities this happens automatically. | ||
| The chunked storage layout is useful in that it enables a selective decompression of only those portions of the dataset required. | ||
| This is an effective but advanced lever to use when implementing more effective data processing pipelines, especially when not all data are used, | ||
| and irrespective of the research data management system or downstream applications that consume the HDF5 file. | ||
|
|
||
| ## Expectation management | ||
|
|
||
| Compression is often very effective for images and spectra where data are stored using integer values as for many bins or pixels no counts are taken or | ||
| the number of counts is substantially lower than the maximum number of counts that an integer offers discretization. | ||
| Compression is often observed as less effective when applied on floating point data. Frequently this is the case for measurements or simulations | ||
| where physically insignificant changes in the last digits still demand for storage when using lossless compression schemes. | ||
| The often smaller precision requirements or physical precision offered by a measurement in relation to the maximum precision of the datatype, | ||
| i.e., discretization, is the motivation behind developing lossy compression methods and using lower precision floating point numbers | ||
| e.g., in the field of machine learning and artificial intelligence. | ||
|
|
||
|
|
||
| ## Configure the chunking | ||
|
|
||
| While efficient and effective for performing compression tasks, the design of using chunks has also drawbacks that can affect read and write | ||
| performance. Consequently, the efficiency and speed with which such chunked and compressed data can be used in downstream processing | ||
| and visualization depends on the configuration of the chunks. This is a field where compromises need to be made especially for large datasets. | ||
| The `h5py` library implements a heuristic that tries to construct chunks with shape similar to that of the original dataset. | ||
| Each chunk is effectively a shrunk (hyper)-rectangle/-cuboid of the original dataset. This pleases users with needs to slice about equally | ||
| frequently across all perpendicular viewing directions. | ||
|
|
||
| However, there are often cases where there is a clear bias towards slicing across a prioritized direction paired with the expectation | ||
| that displaying should be fast when slicing perpendicular to this direction. In this case, it can be useful to overwrite the `h5py` heuristic | ||
| with another one which favors that direction by shaping the chunks differently. | ||
|
|
||
| As an example, assume you wish to inspect an image stack with 100,000 images each having 1024 x 1024 pixels. | ||
| Assume further for simplicity that these pixels are organized in a three-dimensional array that is 100,000 images deep. | ||
| Depending on the data layout in memory, the values for the pixels of the same image pack closer in memory than for pixels of neighboring images. | ||
| Assume now you wish to inspect primarily one image at a time, i.e. you slice perpendicular to the | ||
| image id axis. In this case, it would be ideal to load only the 1024 x 1024 pixels you need and ideally these should be in the same chunk. | ||
| Loading neighboring images, or portions of it, speculatively is what modern hardware does and sophisticated visualization software offers, | ||
| as it brings advantages when navigating forwards and backwards along the slicing direction. | ||
| Assume another user who is interested in seeing the contrast changes along the image id direction, i.e. narrowing on a single pixel column | ||
| interested in displaying an array with 100,000 entries. That user would like to have all contrast values again ideally in one chunk and | ||
| read-out in one operation. Such use cases can collide substantiating why the optional functionality of `pynxtools` to customize the | ||
| chunk settings should be used when there is clear bias towards one particular viewing direction. | ||
|
|
||
| Observing that our exclusive relying on the heuristic of `h5py` delivered frequently too small chunks that increased loading and display times | ||
| for HDF5 files that were generated with `pynxtools` using H5Web in the NOMAD research data management system. This motivated adding the | ||
| here described customization option. | ||
|
|
||
| The customization of the chunking heuristic has an additional level of hardware-dependent complexity though. Specifically, the actual | ||
| read-out performance of chunked HDF5 content can heavily depend on the file system architecture and its settings. It is important to understand | ||
| that the chunk configuration though is defined upon writing dataset into the HDF5 file and cannot be changed thereafter. | ||
| The `src/pynxtools/dataconverter/chunk_cache.py` configuration makes explicit typical default values one can use for starting analyses | ||
| on different file systems used for deploying NOMAD. By default, we follow the default of the `h5py` library, which tries to achieve a | ||
| performance compromise that is tailored towards single storage operations like on servers and laptops. | ||
|
|
||
| Developers that customize for Lustre or GPFS based hardware and NOMAD deployments can use the chunk_cache settings to explore further | ||
| optimization routes to make the most out of their NeXus/HDF5-file-based RDM pipeline in NOMAD. | ||
|
|
||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,21 @@ | ||
| # | ||
| # Copyright The NOMAD Authors. | ||
| # | ||
| # This file is part of NOMAD. See https://nomad-lab.eu for further info. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| # | ||
|
|
||
| from pynxtools.dataconverter import helpers, validation | ||
|
|
||
| helpers.validate_data_dict = validation.validate_data_dict # type: ignore |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,57 @@ | ||
| # | ||
|
||
| # Copyright The NOMAD Authors. | ||
| # | ||
| # This file is part of NOMAD. See https://nomad-lab.eu for further info. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| # | ||
| """Use-case-specific configurations to optimize performance for chunked storage.""" | ||
|
|
||
| # https://github.com/h5py/h5py/blob/master/docs/high/file.rst | ||
|
|
||
| # e.g. for h5py v3.15.1 https://github.com/h5py/h5py/blob/fad034c16f595cb24f4393bbd0dcd23c53bc9a33/h5py/tests/test_file2.py#L111 | ||
| CHUNK_CONFIG_HFIVEPY: dict[str, int | float] = { | ||
| "byte_size": 1 * 1024 * 1024, | ||
| "rdcc_nbytes": 1 * 1024 * 1024, # 1 MiB before HDF2.0, will be 8 MiB for HDF2.0 | ||
| "rdcc_nslots": 521, | ||
| "rdcc_w0": 0.75, | ||
| } | ||
|
|
||
| CHUNK_CONFIG_SSD_NVM: dict[str, int | float] = { | ||
| "byte_size": 1 * 1024 * 1024, | ||
| "rdcc_nbytes": 128 * 1024 * 1024, | ||
| "rdcc_nslots": 4093, | ||
| "rdcc_w0": 0.75, | ||
| } | ||
| CHUNK_CONFIG_HDD: dict[str, int | float] = { | ||
| "byte_size": 4 * 1024 * 1024, | ||
| "rdcc_nbytes": 256 * 1024 * 1024, | ||
| "rdcc_nslots": 1021, | ||
| "rdcc_w0": 0.75, | ||
| } | ||
| CHUNK_CONFIG_GPFS: dict[str, int | float] = { | ||
| "byte_size": 8 * 1024 * 1024, | ||
| "rdcc_nbytes": 256 * 1024 * 1024, | ||
| "rdcc_nslots": 521, | ||
| "rdcc_w0": 0.75, | ||
| } | ||
|
|
||
| CHUNK_CONFIG_LUSTRE: dict[str, int | float] = { | ||
| # set stripe size before creating a file! | ||
| "byte_size": 8 * 1024 * 1024, | ||
| "rdcc_nbytes": 256 * 1024 * 1024, | ||
| "rdcc_nslots": 521, | ||
| "rdcc_w0": 0.75, | ||
| } | ||
|
|
||
| CHUNK_CONFIG_DEFAULT = CHUNK_CONFIG_HFIVEPY | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.