Skip to content

Commit 06620f0

Browse files
committed
add flag append_mode, complete documentation, add tests accordingly
1 parent 71650ef commit 06620f0

File tree

4 files changed

+70
-19
lines changed

4 files changed

+70
-19
lines changed

docs/index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ We are offering a small guide to getting started with NeXus, `pynxtools`, and NO
6969
- [Data conversion in `pynxtools`](learn/pynxtools/dataconverter-and-readers.md)
7070
- [Validation of NeXus files](learn/pynxtools/nexus-validation.md)
7171
- [The `MultiFormatReader` as a reader superclass](learn/pynxtools/multi-format-reader.md)
72-
- [Append mode `dataconverter`](learn/pynxtools/dataconverter-append-mode.md)
72+
- [Append mode for the dataconverter](learn/pynxtools/dataconverter-append-mode.md)
7373

7474
</div>
7575
<div markdown="block">
@@ -103,4 +103,4 @@ For questions or suggestions:
103103

104104
<h2>Project and community</h2>
105105

106-
The work is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - [460197019 (FAIRmat)](https://gepris.dfg.de/gepris/projekt/460197019?language=en).
106+
The work is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - [460197019 (FAIRmat)](https://gepris.dfg.de/gepris/projekt/460197019?language=en).

docs/learn/pynxtools/dataconverter-append-mode.md

Lines changed: 64 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -3,45 +3,94 @@
33
There are cases where users wish to compose a NeXus/HDF5 file with data from multiple sources. Typical examples include:
44

55
- A file should contain multiple `NXentry` instances where each instance applies a different application definition.
6-
- Content under `NXentry` instances is composed from running a specific pynxtools parser plugin plus additional content
7-
that is injected via software other than `pynxtools` or not even software that is written in Python.
6+
- Content within `NXentry` instances is generated by executing a specific `pynxtools` reader plugin and may be supplemented
7+
with additional data provided by external software components, including those implemented outside of Python.
88

99
Enabling such use cases while minimizing data copying is the idea behind the append mode of the dataconverter. It is activated by
10-
passing the `--append` flag during [command line invocation](../../tutorial/converting-data-to-nexus.md).
10+
passing the `--append` flag during command line invocation (see [Tutorial -> Converting your research data to NeXus](../../tutorial/converting-data-to-nexus.md).
11+
12+
Taking this tutorial and its NXxps case study as an example. It composes the HDF5 file with content from two input files
13+
the `EX439_S718_Au.sle` with proprietary formatting and the `eln_data_sle.yaml`, a NOMAD-specific metadata exchange file.
14+
Instead of running the tutorial with passing both input in one go, one could first add process only the proprietary file
15+
(without using `--append`) and thereafter process the YAML file (with using `--append`). The minimal command line call
16+
reads as follows.
17+
18+
```
19+
dataconverter EX439_S718_Au.sle --reader xps --nxdl NXxps --output Au_25_mbar_O2_no_align.nxs
20+
dataconverter eln_data_sle.yaml --reader xps --nxdl NXxps --append --output Au_25_mbar_O2_no_align.nxs
21+
```
22+
23+
When processing both the `*.sle` and `*.yaml` file in one call, adding `--append` has no effect, i.e.,
24+
`pynxtools` proceeds as if `--append` is absent but mind that adding the flag deactivates the verification.
25+
26+
Users who wish to use a `params.yaml` parameters file, like it is shown in the tutorial,
27+
should add the `append` flag like this:
28+
29+
```
30+
dataconverter:
31+
reader: xps
32+
nxdl: NXxps
33+
input-file:
34+
- EX439_S718_Au.sle
35+
- eln_data_sle.yaml
36+
output: Au_25_mbar_O2_no_align.nxs
37+
append: True
38+
```
39+
40+
Users who wish to call the dataconverter as a step in other Python code or Jupyter Notebooks may find
41+
the following variation and code a useful snippet to include in their batch pipeline:
42+
43+
```
44+
from pynxtools.dataconverter.convert import convert
45+
46+
# file "Au_25_mbar_O2_no_align.nxs" exists already, e.g., when it
47+
# was instantiated with "EX439_S718_Au.sle" as mentioned above
48+
49+
_ = convert(
50+
input_file=("eln_data_sle.yaml"),
51+
reader="xps",
52+
nxdl="NXxps",
53+
append=True,
54+
output="Au_25_mbar_O2_no_align.nxs",
55+
)
56+
57+
# modify tuple[str] input_file to include the actual files you wish to convert
58+
# modify output: str to customize output file path and name
59+
```
1160

1261
## Possibilities and limitations
1362

14-
**The append mode must not be understood as a functionality that allows an overwriting of existent data.**
63+
**The append mode is not a functionality that allows an overwriting of existent data!**
1564
We are convinced that written data should be immutable. Therefore, using the append mode demands to accept the following assumptions:
1665

1766
- Only groups, datasets, or attributes not yet existent can be added when in append mode.
1867
The implementation catches attempts of overwriting existent HDF5 objects,
1968
emitting respective logging messages.
2069
- When in append mode, the internal validation of the `template` dictionary is switched off,
21-
irrespective if `--skip-verify` is passed or not.
22-
Instead, users should validate [the HDF5 file](../../how-tos/pynxtools/validate-nexus-files.md) when having the file compositing completed.
70+
irrespective if `--skip-verify` is passed or not. Instead, users should validate the HDF5 file
71+
(see [How-tos -> pynxtools -> Validation of NeXus files](../../how-tos/pynxtools/validate-nexus-files.md) after they have completed the file compositing.
2372
- The HDF5 library's functionality to reshape existent HDF5 datasets is not supported by `pynxtools`.
2473

2574
## Interpreting root level attributes
2675

2776
Note that `pynxtools` sets several attributes at the root level of a NeXus/HDF5 file. These values are defined by whichever tool writes them first.
28-
A subsequent writing to the HDF5 file in append mode does not modify these. This makes the interpretation of the following attributes ambiguous
29-
`NeXus_repository`, `NeXus_release`, `HDF5_Version`, `h5py_version`, `creator`, `creator_version`, `file_time` and `file_update_time`.
77+
A subsequent writing to the HDF5 file in append mode does not modify these. This makes the interpretation of the following attributes ambiguous:
78+
`NeXus_repository`, `NeXus_release`, `HDF5_Version`, `h5py_version`, `creator`, `creator_version`, `file_time`, and `file_update_time`.
3079

3180
When in append mode, `pynxtools` adds the root level attribute `append_mode = "True"` which flags the file as an artifact that was composed
32-
from at least one pynxtools tool running in append mode. Note that the absence of this flag does not guarantee that the file was written
33-
by `pynxtools` or its plugins, as also other software could have written the NeXus file.
81+
from at least one `pynxtools` tool or reader plugin running in append mode. Note that the absence of this flag does not guarantee that the file
82+
was written by `pynxtools` or its plugins, as also other software could have written the NeXus file.
3483

35-
Until the NeXus standard allows users to link or define these attributes at the HDF5 object level, i.e. for groups, datasets, and attributes, separately,
36-
we advise to no mix tools that write content that adheres to different versions of the NeXus definitions. Note that the `validate` functionality
37-
of `pynxtools` can currently not detect which objects within an HDF5 file were written with which NeXus or tool version. The validation concludes from
38-
the combination of the `ENTRY/definition`, `ENTRY/definition/@version`, and `/@NeXus_version` attributes.
84+
Until the NeXus standard allows users to link or define these attributes at the HDF5 object level, i.e., for groups, datasets, and attributes, separately,
85+
we advise not to mix tools that write content that adheres to different versions of the NeXus definitions. Note that the `validate` functionality
86+
of `pynxtools` does not provide a mechanism to determine which specific NeXus or tool version was used to generate individual objects within an HDF5 file.
87+
The validation concludes from the combination of the `ENTRY/definition`, `ENTRY/definition/@version`, and `/@NeXus_version` attributes.
3988

4089
## Time-stamped HDF5 objects
4190

4291
Note that the HDF5 library has the low-level feature to timestamp individual HDF5 objects. By default though, this feature is deactivated
4392
as per decision of the HDF5 Consortium. The choice was made to prevent that changing timestamp values change the hash of the entire file content.
4493
Note that the `pynxtools-em` plugin includes a [`hfive_base` parser](https://github.com/FAIRmat-NFDI/pynxtools-em/blob/main/src/pynxtools_em/parsers/hfive_base.py)
45-
that can compute hashes from the content of individual HDF5 objects. Users are advised to blacklist timestamp attributes like `file_time`, and `file_update_time`
94+
that can compute hashes from the content of individual HDF5 objects. Users are advised to blacklist timestamp attributes like `file_time` and `file_update_time`
4695
when comparing the binary content of two HDF5 files using this parser.
4796

src/pynxtools/dataconverter/convert.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -344,7 +344,9 @@ def main_cli():
344344
"--skip-verify",
345345
is_flag=True,
346346
default=False,
347-
help="Skips the verification routine during conversion.",
347+
help="Skips the verification routine during conversion."
348+
"When --append is used the verification is always skipped"
349+
"irrespective if --skip-verify is used or not.",
348350
)
349351
@click.option(
350352
"--mapping",

src/pynxtools/dataconverter/writer.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -225,7 +225,7 @@ def __init__(
225225
self.data = data
226226
self.nxdl_f_path = nxdl_f_path
227227
self.output_path = output_path
228-
self.output_nexus = h5py.File(self.output_path, "r+" if append else "w")
228+
self.output_nexus = h5py.File(self.output_path, "a" if append else "w")
229229
# using "r+" or "a" allow resizing a dataset that uses chunked data storage layout
230230
# we currently do not implement this resizing though
231231
# create_{group,dataset} with an existent name throws a ValueError

0 commit comments

Comments
 (0)