-
Notifications
You must be signed in to change notification settings - Fork 5
Home
This page is a Work-in-Progress Proposal for how to build a ''Next Generation XAFS Data Library'' The main idea is to allow XAFS Data Library that
- facilitate the exchange XAFS spectra, particularly on model compounds, such as found in Model Compound Libraries.
- store and manage multiple spectra, but still easily export to plain ASCII files.
- allow users to have public and private libraries of spectra.
- make it easy to share sets of XAFS spectra.
To date, there have been several Web-based Databases. While these all organize XAFS spectra on model compounds, they all have shortcomings such as incomplete data, difficulty of adding new data, and incompatible formats.
The aim here is to create a way to store multiple XAFS spectra in a manner that can be used within dedicated applications and embedded into existing data processing software with minimal effort. This will also alleviate many of the problems associated with the current databases.
Storing and Exchanging XAFS Data is a common need for everyone using XAFS. In particular, retrieving data on ''Standards'' or ''Model Compounds'' is a continuing need for both XANES and EXAFS analysis. Additionally, data taken on samples at different facilities and beamlines need to be compared and analyzed together. At this point, there is no commonly accepted data format for XAFS data. There have been a few attempts to standardize the traditional "ASCII Column File". To be sure, ASCII Column Files have some strong appeal. Most data collection software saves such files, and most processing and analysis software use some variation of this format. In addition, such files may be easily read by humans and used in a wide variety of third party applications. Still, most ASCII Column Files need some intimate knowledge of the data layout to use the data. The lack of a standard ASCII format is a serious problem. The work here is parallel to the efforts to come up with a standard. and will be able to convert data into such standard file formats.
For the effort here, in which storing multiple spectra is a key requirement, ASCII Column Data Files offer no standard way to hold multiple spectra. In order to facilitate the sharing data between beamlines and the exchange (and recognition) of high-quality data on standards, it is useful to have Data Libraries or Repositories which a scientist can use at will and exchange with others. A key point is that not only should data be well-formatted and vetted for quality, but should also be easily be viewed as a part of Suite of Spectra.
A Data Library is often envisioned as a single centralized Library (see http://xafs.org/Databases) which is meant to house ''Standard Data'' with some assurance of quality and the expectation that the data is to be shared with the entire community. In contrast, many researchers keep their own set of data closely guarded and do not want to share their data. Another model for a Data Library is a set of spectra being shared between collaborators, but then possibly allowed to have a wider distribution (say, after some paper has been published). We are considering all of these as legitimate use cases, and are more interested here in creating a format that can be used for all these cases.
With the motivation from the previous section, the goals for the format and library are:
- store spectra for exchange, especially for model compounds. Raw data, direct from the beamline will probably need to be converted to this format.
- store information about the sample, measurement conditions, etc.
- store multiple spectra, either on the same sample or multiple samples, and possibly taken at many facilities.
- provide programming libraries and simple standalone applications that can read, write, and manage such data libraries. Programming libraries would have to support multiple languages.
There are a few reasonable ways to solve this problem. What follows below is a methods which makes heavy use of ''relational databases'' and SQL. The principle argument here is that relational databases offer a well-understood, proven way to store data with extensible meta-data. The use of SQL also makes the programming libraries simpler, as they can rely on tested SQL syntax to access the underlying data store.
As the XAS Data Library is being developed, code and examples will be available at https://github.com/XraySpectroscopy/XASDataLibrary
I propose using SQLite, a widely used, Free relational database engine as the primary store for the XAFS Data Library. A key feature of SQLite is that it needs no external server or configuration -- the database is contained in a single disk file. SQLite databases can accessed with a variety of tools, for example DB Browser for SQLite and SQLite Manager addon for Firefox.
SQL-based relational databases may not be the most obvious choice for storing scientific data composing of arrays of related data. One obvious limitation is that relational databases don't store array data very well. Thus storing array data in a portable way within the confines of an SQL database needs special attention. The approach adopted here is to JSON, which can encapsulate an array, or other complex data structure into a string.
JSON -- Javascript Object Notation -- provides a standard, easy-to-use method for encapsulating complex data structures into strings that can be parsed and used by a large number of programming languages as the original data. In this respect, the requirements for the XAS Data Library -- numerical arrays of data -- are fairly modest. Storing array data in strings is, of course, what ASCII Column Files have done for years, only not with the benefit of a standard programming interface to read them. As an example, an array of data [8000, 8001.0 , 8002.0] would be encoded in JSON as
'[8000, 8001.0, 8002.0]'
This is considerably easier and lighter weight than using XML to encode array data.
In addition to encoding numerical arrays, JSON can also encode an associative array (also known as a Hash Table, Dictionary, Record, or Key/Value List. This can be a very useful construct for storing attribute information. It might be tempting to use such Associative Arrays for many pieces of data inside the database, this would prevent those data from being used in SQL SELECT and other statements: such data would not be available for making relations. But, as Associative Arrays can so useful and extensible, several of the tables in the database include a attributes column that is always stored as text. This data will be expected to hold a JSON-encoded Associative Array that may be useful to complement the corresponding notes column. This data cannot be used directly in searching the database, but may be useful to particular applications.