|
3 | 3 | There are cases where users wish to compose a NeXus/HDF5 file with data from multiple sources. Typical examples include: |
4 | 4 |
|
5 | 5 | - A file should contain multiple `NXentry` instances where each instance applies a different application definition. |
6 | | -- Content under `NXentry` instances is composed from running a specific pynxtools parser plugin plus additional content |
7 | | - that is injected via software other than `pynxtools` or not even software that is written in Python. |
| 6 | +- Content within `NXentry` instances is generated by executing a specific `pynxtools` reader plugin and may be supplemented |
| 7 | + with additional data provided by external software components, including those implemented outside of Python. |
8 | 8 |
|
9 | 9 | Enabling such use cases while minimizing data copying is the idea behind the append mode of the dataconverter. It is activated by |
10 | | -passing the `--append` flag during [command line invocation](../../tutorial/converting-data-to-nexus.md). |
| 10 | +passing the `--append` flag during command line invocation (see [Tutorial -> Converting your research data to NeXus](../../tutorial/converting-data-to-nexus.md). |
| 11 | + |
| 12 | +Taking this tutorial and its NXxps case study as an example. It composes the HDF5 file with content from two input files |
| 13 | +the `EX439_S718_Au.sle` with proprietary formatting and the `eln_data_sle.yaml`, a NOMAD-specific metadata exchange file. |
| 14 | +Instead of running the tutorial with passing both input in one go, one could first add process only the proprietary file |
| 15 | +(without using `--append`) and thereafter process the YAML file (with using `--append`). The minimal command line call |
| 16 | +reads as follows. |
| 17 | + |
| 18 | +``` |
| 19 | +dataconverter EX439_S718_Au.sle --reader xps --nxdl NXxps --output Au_25_mbar_O2_no_align.nxs |
| 20 | +dataconverter eln_data_sle.yaml --reader xps --nxdl NXxps --append --output Au_25_mbar_O2_no_align.nxs |
| 21 | +``` |
| 22 | + |
| 23 | +When processing both the `*.sle` and `*.yaml` file in one call, adding `--append` has no effect, i.e., |
| 24 | +`pynxtools` proceeds as if `--append` is absent but mind that adding the flag deactivates the verification. |
| 25 | + |
| 26 | +Users who wish to use a `params.yaml` parameters file, like it is shown in the tutorial, |
| 27 | +should add the `append` flag like this: |
| 28 | + |
| 29 | +``` |
| 30 | +dataconverter: |
| 31 | + reader: xps |
| 32 | + nxdl: NXxps |
| 33 | + input-file: |
| 34 | + - EX439_S718_Au.sle |
| 35 | + - eln_data_sle.yaml |
| 36 | + output: Au_25_mbar_O2_no_align.nxs |
| 37 | + append: True |
| 38 | +``` |
| 39 | + |
| 40 | +Users who wish to call the dataconverter as a step in other Python code or Jupyter Notebooks may find |
| 41 | +the following variation and code a useful snippet to include in their batch pipeline: |
| 42 | + |
| 43 | +``` |
| 44 | +from pynxtools.dataconverter.convert import convert |
| 45 | +
|
| 46 | +# file "Au_25_mbar_O2_no_align.nxs" exists already, e.g., when it |
| 47 | +# was instantiated with "EX439_S718_Au.sle" as mentioned above |
| 48 | +
|
| 49 | +_ = convert( |
| 50 | + input_file=("eln_data_sle.yaml"), |
| 51 | + reader="xps", |
| 52 | + nxdl="NXxps", |
| 53 | + append=True, |
| 54 | + output="Au_25_mbar_O2_no_align.nxs", |
| 55 | +) |
| 56 | +
|
| 57 | +# modify tuple[str] input_file to include the actual files you wish to convert |
| 58 | +# modify output: str to customize output file path and name |
| 59 | +``` |
11 | 60 |
|
12 | 61 | ## Possibilities and limitations |
13 | 62 |
|
14 | | -**The append mode must not be understood as a functionality that allows an overwriting of existent data.** |
| 63 | +**The append mode is not a functionality that allows an overwriting of existent data!** |
15 | 64 | We are convinced that written data should be immutable. Therefore, using the append mode demands to accept the following assumptions: |
16 | 65 |
|
17 | 66 | - Only groups, datasets, or attributes not yet existent can be added when in append mode. |
18 | 67 | The implementation catches attempts of overwriting existent HDF5 objects, |
19 | 68 | emitting respective logging messages. |
20 | 69 | - When in append mode, the internal validation of the `template` dictionary is switched off, |
21 | | - irrespective if `--skip-verify` is passed or not. |
22 | | - Instead, users should validate [the HDF5 file](../../how-tos/pynxtools/validate-nexus-files.md) when having the file compositing completed. |
| 70 | + irrespective if `--skip-verify` is passed or not. Instead, users should validate the HDF5 file |
| 71 | + (see [How-tos -> pynxtools -> Validation of NeXus files](../../how-tos/pynxtools/validate-nexus-files.md) after they have completed the file compositing. |
23 | 72 | - The HDF5 library's functionality to reshape existent HDF5 datasets is not supported by `pynxtools`. |
24 | 73 |
|
25 | 74 | ## Interpreting root level attributes |
26 | 75 |
|
27 | 76 | Note that `pynxtools` sets several attributes at the root level of a NeXus/HDF5 file. These values are defined by whichever tool writes them first. |
28 | | -A subsequent writing to the HDF5 file in append mode does not modify these. This makes the interpretation of the following attributes ambiguous |
29 | | -`NeXus_repository`, `NeXus_release`, `HDF5_Version`, `h5py_version`, `creator`, `creator_version`, `file_time` and `file_update_time`. |
| 77 | +A subsequent writing to the HDF5 file in append mode does not modify these. This makes the interpretation of the following attributes ambiguous: |
| 78 | +`NeXus_repository`, `NeXus_release`, `HDF5_Version`, `h5py_version`, `creator`, `creator_version`, `file_time`, and `file_update_time`. |
30 | 79 |
|
31 | 80 | When in append mode, `pynxtools` adds the root level attribute `append_mode = "True"` which flags the file as an artifact that was composed |
32 | | -from at least one pynxtools tool running in append mode. Note that the absence of this flag does not guarantee that the file was written |
33 | | -by `pynxtools` or its plugins, as also other software could have written the NeXus file. |
| 81 | +from at least one `pynxtools` tool or reader plugin running in append mode. Note that the absence of this flag does not guarantee that the file |
| 82 | +was written by `pynxtools` or its plugins, as also other software could have written the NeXus file. |
34 | 83 |
|
35 | | -Until the NeXus standard allows users to link or define these attributes at the HDF5 object level, i.e. for groups, datasets, and attributes, separately, |
36 | | -we advise to no mix tools that write content that adheres to different versions of the NeXus definitions. Note that the `validate` functionality |
37 | | -of `pynxtools` can currently not detect which objects within an HDF5 file were written with which NeXus or tool version. The validation concludes from |
38 | | -the combination of the `ENTRY/definition`, `ENTRY/definition/@version`, and `/@NeXus_version` attributes. |
| 84 | +Until the NeXus standard allows users to link or define these attributes at the HDF5 object level, i.e., for groups, datasets, and attributes, separately, |
| 85 | +we advise not to mix tools that write content that adheres to different versions of the NeXus definitions. Note that the `validate` functionality |
| 86 | +of `pynxtools` does not provide a mechanism to determine which specific NeXus or tool version was used to generate individual objects within an HDF5 file. |
| 87 | +The validation concludes from the combination of the `ENTRY/definition`, `ENTRY/definition/@version`, and `/@NeXus_version` attributes. |
39 | 88 |
|
40 | 89 | ## Time-stamped HDF5 objects |
41 | 90 |
|
42 | 91 | Note that the HDF5 library has the low-level feature to timestamp individual HDF5 objects. By default though, this feature is deactivated |
43 | 92 | as per decision of the HDF5 Consortium. The choice was made to prevent that changing timestamp values change the hash of the entire file content. |
44 | 93 | Note that the `pynxtools-em` plugin includes a [`hfive_base` parser](https://github.com/FAIRmat-NFDI/pynxtools-em/blob/main/src/pynxtools_em/parsers/hfive_base.py) |
45 | | -that can compute hashes from the content of individual HDF5 objects. Users are advised to blacklist timestamp attributes like `file_time`, and `file_update_time` |
| 94 | +that can compute hashes from the content of individual HDF5 objects. Users are advised to blacklist timestamp attributes like `file_time` and `file_update_time` |
46 | 95 | when comparing the binary content of two HDF5 files using this parser. |
47 | 96 |
|
0 commit comments