|
1 | 1 | --- |
2 | | -title: 'pynxtools: A framework for generating NeXus files from formats across disciplines.' |
| 2 | +title: 'pynxtools: A framework for generating NeXus files from formats across disciplines' |
3 | 3 | tags: |
4 | 4 | - Python |
5 | 5 | - NeXus |
@@ -131,47 +131,47 @@ bibliography: paper.bib |
131 | 131 |
|
132 | 132 | # Summary |
133 | 133 |
|
134 | | -Scientific data across physics, materials science, and materials engineering often lacks adherence to FAIR principles [@Wilkinson:2016; @Jacobsen:2020; @Barker:2022; @Wilkinson:2025] due to incompatible instrument-specific formats and diverse standardization practices. pynxtools is a Python software development framework with a command line interface (CLI) that standardizes data conversion for scientific experiments in materials characterization to the NeXus format [@Koennecke:2015; @Koennecke:2006; @Klosowski:1997] across diverse scientific domains. NeXus uses NeXus application definitions as their data storage specifications. pynxtools provides a fixed, versioned set of NeXus application definitions that ensures convergence and alignment in data specifications across atom probe tomography, electron microscopy, optical spectroscopy, photoemission spectroscopy, scanning probe microscopy, X-ray diffraction. Through its modular plugin architecture, pynxtools provides maps for instrument-specific raw data, and electronic lab notebook metadata, to these unified definitions, while performing validation to ensure data correctness and NeXus compliance. By simplifying the adoption of standardized application definitions, the framework enables true data interoperability and FAIR data management across multiple experimental techniques. |
| 134 | +Scientific data across physics, materials science, and materials engineering often lacks adherence to FAIR principles [@Wilkinson:2016; @Jacobsen:2020; @Barker:2022; @Wilkinson:2025] due to incompatible instrument-specific formats and diverse standardization practices. `pynxtools` is a Python software development framework with a command line interface (CLI) that standardizes data conversion for scientific experiments in materials characterization to the NeXus format [@Koennecke:2015; @Koennecke:2006; @Klosowski:1997] across diverse scientific domains. NeXus defines data storage specifications for different experimental techniques through application definitions. `pynxtools` provides a fixed, versioned set of NeXus application definitions that ensures convergence and alignment in data specifications across atom probe tomography, electron microscopy, optical spectroscopy, photoemission spectroscopy, scanning probe microscopy, and X-ray diffraction. Through its modular plugin architecture `pynxtools` provides maps for instrument-specific raw data and electronic lab notebook metadata to these unified definitions, while performing validation to ensure data correctness and NeXus compliance. By simplifying the adoption of standardized application definitions, the framework enables true data interoperability and FAIR data management across multiple experimental techniques. |
135 | 135 |
|
136 | 136 | # Statement of need |
137 | 137 |
|
138 | | -Achieving FAIR (Findable, Accessible, Interoperable, and Reproducible) data principles in experimental physics and materials science requires consistent implementation of standardized data formats. NeXus provides comprehensive data specifications for structured storage of scientific data. pynxtools simplifies the use of NeXus for developers and researchers by providing guided workflows and automated validation to ensure complete compliance. Existing tools [@Koennecke:2024; @Jemian:2025] provide solutions with individual capabilities, but none offers a comprehensive end-to-end workflow for proper NeXus adoption. pynxtools addresses this critical gap by providing a framework that enforces complete NeXus application definition compliance through automated validation, detailed error reporting for missing required data points, and clear implementation pathways via configuration files and extensible plugins. This approach transforms NeXus from a complex specification into a practical solution, enabling researchers to achieve true data interoperability without deep technical expertise in the underlying standards. |
| 138 | +Achieving FAIR (Findable, Accessible, Interoperable, and Reproducible) data principles in experimental physics and materials science requires consistent implementation of standardized data formats. NeXus provides comprehensive data specifications for structured storage of scientific data. `pynxtools` simplifies the use of NeXus for developers and researchers by providing guided workflows and automated validation to ensure complete compliance. Existing tools [@Koennecke:2024; @Jemian:2025] provide solutions with individual capabilities, but none offers a comprehensive end-to-end workflow for proper NeXus adoption. `pynxtools` addresses this critical gap by providing a framework that enforces complete NeXus application definition compliance through automated validation, detailed error reporting for missing required data points, and clear implementation pathways via configuration files and extensible plugins. This approach transforms NeXus from a complex specification into a practical solution, enabling researchers to achieve true data interoperability without deep technical expertise in the underlying standards. |
139 | 139 |
|
140 | 140 | # Dataconverter and validation |
141 | 141 |
|
142 | | -The _dataconverter_, core module of pynxtools, combines instrument output files and data from electronic lab notebooks into NeXus-compliant HDF5 files. The converter performs three key operations: reading experimental data through specialized readers, validating against NeXus application definitions to ensure compliance with existence, shape, and format constraints, and writing valid NeXus/HDF5 output files. |
| 142 | +The `dataconverter`, core module of pynxtools, combines instrument output files and data from electronic lab notebooks into NeXus-compliant HDF5 files. The converter performs three key operations: extracting experimental data through specialized readers, validating against NeXus application definitions to ensure compliance with existence, shape, and format constraints, and writing valid NeXus/HDF5 output files. |
143 | 143 |
|
144 | 144 | The `dataconverter` provides a CLI to produce NeXus files where users can use one of the built-in readers for generic functionality or technique-specific reader plugins, which are distributed as separate Python packages. |
145 | 145 |
|
146 | 146 | For developers, the `dataconverter` provides an abstract `reader` class for building plugins that process experiment-specific formats and populate the NeXus specification. It passes a `Template`, a subclass of Python’s dictionary, to the `reader` as a form to fill. The `Template` ensures structural compliance with the chosen NeXus application definition and organizes data by NeXus's required, recommended, and optional levels. |
147 | 147 |
|
148 | | -The _dataconverter_ validates _reader_ output against the selected NeXus application definition, checking for instances of required concepts, complex dependencies (like inheritance and nested group rules), and data integrity (type, shape, constraints). It reports errors for invalid required concepts and emits CLI warnings for unmatched or invalid data, aiding practical NeXus file creation. |
| 148 | +The `dataconverter` validates `reader` output against the selected NeXus application definition, checking for instances of required concepts, complex dependencies (like inheritance and nested group rules), and data integrity (type, shape, constraints). It reports errors for invalid required concepts and emits CLI warnings for unmatched or invalid data, aiding practical NeXus file creation. |
149 | 149 |
|
150 | 150 | All reader plugins are tested using the pynxtools.testing suite, which runs automatically via GitHub CI to ensure compatibility with the dataconverter, the NeXus specification, and integration across plugins. |
151 | 151 |
|
152 | 152 | The dataconverter includes an ELN generator that creates either a fillable `YAML` file or a `NOMAD` [@Scheidgen:2023] ELN schema based on a selected NeXus application definition. |
153 | 153 |
|
154 | 154 | # NeXus reader and annotator |
155 | 155 |
|
156 | | -_read_nexus_ enables semantic access to NeXus files by linking data items to NeXus concepts, allowing applications to locate relevant data without hardcoding file paths. It supports concept-based queries that return all data items associated with a specific NeXus Vocabulary term. Each data item is annotated by traversing its group path and resolving its corresponding NeXus concept, included inherited definitions. |
| 156 | +`read_nexus` enables semantic access to NeXus files by linking data items to NeXus concepts, allowing applications to locate relevant data without hardcoding file paths. It supports concept-based queries that return all data items associated with a specific NeXus Vocabulary term. Each data item is annotated by traversing its group path and resolving its corresponding NeXus concept, included inherited definitions. |
157 | 157 |
|
158 | 158 | Items not part of the NeXus schema are explicitly marked as such, aiding in validation and debugging. Targeted documentation of individual data items is supported through path-specific annotation. The tool also identifies and summarizes the file’s default plottable data based on the NXdata definition. |
159 | 159 |
|
160 | 160 | # `NOMAD` integration |
161 | 161 |
|
162 | | -While pynxtools works as a standalone tool, it can also be integrated directly into Research Data Management Systems (RDMS). Out of the box, the package functions as a plugin within the `NOMAD` platform [@Scheidgen:2023; @Draxl:2019]. This enables data in the NeXus format to be integrated into `NOMAD`'s metadata model, making it searchable and interoperable with other data from theory and experiment. The plugin consists of several key components (so called entry points): |
| 162 | +While `pynxtools` works as a standalone tool, it can also be integrated directly into Research Data Management Systems (RDMS). Out of the box, the package functions as a plugin within the `NOMAD` platform [@Scheidgen:2023; @Draxl:2019]. This enables data in the NeXus format to be integrated into `NOMAD`'s metadata model, making it searchable and interoperable with other data from theory and experiment. The plugin consists of several key components (so called entry points): |
163 | 163 |
|
164 | | -pynxtools extends `NOMAD`'s data schema (called _Metainfo_ [@Ghiringhelli:2017]) by integrating NeXus definitions as a `NOMAD` _Schema Package_, adding NeXus-specific quantities and enabling interoperability through links to other standardized data representations in `NOMAD`. The _dataconverter_ is integrated into `NOMAD`, making the conversion of data to NeXus accessible via the `NOMAD` GUI. The _dataconverter_ also processes manually entered `NOMAD` ELN data in the conversion. |
| 164 | +`pynxtools` extends `NOMAD`'s data schema (called `Metainfo` [@Ghiringhelli:2017]) by integrating NeXus definitions as a `NOMAD` `Schema Package`, adding NeXus-specific quantities and enabling interoperability through links to other standardized data representations in `NOMAD`. The `dataconverter` is integrated into `NOMAD`, making the conversion of data to NeXus accessible via the `NOMAD` GUI. The `dataconverter` also processes manually entered `NOMAD` ELN data in the conversion. |
165 | 165 |
|
166 | | -The `NOMAD` Parser module in pynxtools (_NexusParser_) extracts structured data from NeXus HDF5 files to populate `NOMAD` with _Metainfo_ object instances as defined by the pynxtools schema package. This enables ingestion of NeXus data directly into `NOMAD`. Parsed data is post-processed using `NOMAD`'s _Normalization_ pipeline. This includes automatic handling of units, linking references (including sample and instrument identifiers defined elsewhere in `NOMAD`), and populating derived quantities needed for advanced search and visualization. |
| 166 | +The `NOMAD` Parser module in `pynxtools` (`NexusParser`) extracts structured data from NeXus HDF5 files to populate `NOMAD` with `Metainfo` object instances as defined by the `pynxtools` schema package. This enables ingestion of NeXus data directly into `NOMAD`. Parsed data is post-processed using `NOMAD`'s `Normalization` pipeline. This includes automatic handling of units, linking references (including sample and instrument identifiers defined elsewhere in `NOMAD`), and populating derived quantities needed for advanced search and visualization. |
167 | 167 |
|
168 | | -`pynxtools` contains an integrated _Search Application_ for NeXus data within `NOMAD`, powered by `Elasticsearch` [@elasticsearch:2025]. This provides a search dashboard whereby users can efficiently filter uploaded data based on parameters like experiment type, upload timestamp, and domain- and technique-specific quantities. The entire `pynxtools` workflow (conversion, parsing, and normalization) is exemplified in a representative `NOMAD` _Example Upload_ that is shipped with the package. This example helps new users understand the workflow and serves as a template to adapt the plugin to new NeXus applications. |
| 168 | +`pynxtools` contains an integrated `Search Application` for NeXus data within `NOMAD`, powered by `Elasticsearch` [@elasticsearch:2025]. This provides a search dashboard whereby users can efficiently filter uploaded data based on parameters like experiment type, upload timestamp, and domain- and technique-specific quantities. The entire `pynxtools` workflow (conversion, parsing, and normalization) is exemplified in a representative `NOMAD` `Example Upload` that is shipped with the package. This example helps new users understand the workflow and serves as a template to adapt the plugin to new NeXus applications. |
169 | 169 |
|
170 | 170 | # Funding |
171 | 171 | The work is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - project 460197019 (FAIRmat). |
172 | 172 |
|
173 | 173 | # Acknowledgements |
174 | 174 |
|
175 | | -We acknowledge the following software packages our package depends on: [@H5py:2008], [@Harris:2020], [@Click:2014], [@Druskat:2021], [@Hoyer:2017], [@Hoyer:2025], [@Pandas:2020], [@McKinney:2010], [@Behnel:2005], [@Clarke:2019], [@Hjorth:2017], [@Pint:2012]. |
| 175 | +We acknowledge the following software packages our package depends on: `h5py` [@H5py:2008], `numpy` [@Harris:2020], `click` [@Click:2014] , `CFF` [@Druskat:2021], `xarray` [@Hoyer:2017], [@Hoyer:2025], `pandas` [@Pandas:2020], `scipy` [@McKinney:2010], `lxml` [@Behnel:2005], `mergedeep` [@Clarke:2019], `Atomic Simulation Environment` [@Hjorth:2017], `pint` [@Pint:2012]. |
176 | 176 |
|
177 | 177 | # References |
0 commit comments