@@ -127,9 +127,9 @@ require the use of proprietary software to read it. This is not simply
127127because of cost, though that can be a major barrier, but is also a
128128matter of historical preservation - it may be impossible to locate or
129129run old enough software. Imagine having a really useful data set
130- except it's only readable by a program that only runs on a 25 year old
130+ except it's only readable by a program that only runs on a 35 year old
131131version of MacOS. Yes, that really happens. If we design our
132- (meta)data for interoperability then we can take that 25 + year old
132+ (meta)data for interoperability then we can take that 35 + year old
133133dataset and work with it using modern tools.
134134
135135* I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
@@ -143,7 +143,8 @@ column is conductance". What is needed is a way to say "the third
143143column is conductance, it's an electronic measure, it's related to
144144resistance, and here is how". It's a lot of work, but luckily for us
145145most of it has been done and we can reuse it - Resource Definition
146- Framework (RDF). We'll talk more about this in a later section.
146+ Framework (RDF) is made for this job. We'll talk more about this in a
147+ later section.
147148
148149### Reusable
149150
@@ -169,19 +170,99 @@ them. 28520 has about 250 people.
169170
170171As mentioned above, metadata is simply information that describes your
171172data. Hidden behind that word "simply" is the slight complication that
172- metadata can be really simple or it can be really complicated. Where
173+ metadata can be anything from really simple to really complicated. Where
173174it falls on that continuum is situational. Small amounts of simple
174- data will probably have simple metadata
175-
176- ### What Metadata is and why we care
175+ data will probably have simple metadata.
177176
178177### Metadata representation and searching
179178
180- #### JSON
179+ Without a standard means of representation and agreed-upon meanings,
180+ the metadata we assemble might be more of a hinderance than a
181+ help. Selecting our data's properties to record is domain-specific in
182+ many cases. A very general set of attributes is available at
183+ [ schema.org] ( https://schema.org/ ) . Other ontologies exist, of course,
184+ and selection among them tends to narrow as you go.
181185
182186#### RDF
183187
184- ### The future: AI-generated Metadata
188+ RDF is "Resource Description Format" and is a broadly used concept
189+ even outside of FAIR. Fundamentally, RDF's "intention" is to describe
190+ the world in terms of triples: subject, predicate, and object. From
191+ these building blocks we can construct graph structures (in the
192+ "discrete math" sense of the term). As they grow, they can represent
193+ linkages between related items, for instance, and that is when the
194+ real power of RDF can start to be exploited. With small collections of
195+ metadata there is little choice but to handle search terms, for
196+ instance. Once the collection expands, new kinds of queries are
197+ possible. For instance, different researchers might submit data to an
198+ archive. We know the possible range of metadata descriptors and we
199+ know what each one means. At this point, we can traverse the RDF
200+ graph. It's easy to go from the starting place to enter the graph and
201+ then follow along, hop by hop, expanding the search possibilities by
202+ adding related terms.
203+
204+ #### JSON and JSON-LD
205+
206+ For any data to be useable by a computer, it has to be represented in
207+ a way that it can be understood. Metadata is no exception. RDF is both
208+ a conceptual layout for in-memory processing and also a defined way of
209+ writing out the structure. There are just two problems. One is that
210+ the format is a lot to digest when you first start working with
211+ it. The other is that, depending on the language you're using, you
212+ might end up having to write your own parser for this. Luck is smiling
213+ on us, though, in the form of alternative ways to represent that graph
214+ structure. A very common representation, and one that is becoming
215+ increasingly popular, is JSON (Javascript Object Notation). It has its
216+ roots in Javascript, but it has spread far and wide in dozens of
217+ languages. Support for it is nearly ubiquitous now. Take a look at it.
218+
219+ ```
220+ {
221+ "first_name": "John",
222+ "last_name": "Smith",
223+ "is_alive": true,
224+ "age": 27,
225+ "address": {
226+ "street_address": "21 2nd Street",
227+ "city": "New York",
228+ "state": "NY",
229+ "postal_code": "10021-3100"
230+ },
231+ "phone_numbers": [
232+ {
233+ "type": "home",
234+ "number": "212 555-1234"
235+ },
236+ {
237+ "type": "office",
238+ "number": "646 555-4567"
239+ }
240+ ],
241+ "children": [
242+ "Catherine",
243+ "Thomas",
244+ "Trevor"
245+ ],
246+ "spouse": null
247+ }
248+ ```
249+
250+ JSON is good for representing lots of structured data, but there needs
251+ to be something that can go beyond that and hold references to the
252+ schemas that apply to certain fields. For this, there is JSON-LD. The
253+ LD stands for "Linked Data". JSON-LD stores additional data in the
254+ form of "context" fields and these fields can contain URLs to outside
255+ sites storing official, curated ontologies:
256+
257+ ```
258+ {
259+ "@context": "https://json-ld.org/contexts/person.jsonld",
260+ "@id": "http://dbpedia.org/resource/John_Lennon",
261+ "name": "John Lennon",
262+ "born": "1940-10-09",
263+ "spouse": "http://dbpedia.org/resource/Cynthia_Lennon"
264+ }
265+ ```
185266
186267## FOXDEN - a Pilot, Prototype Example
187268
@@ -192,4 +273,109 @@ Linux, in that it provides a collection of tools that work together in
192273a modular fashion. It is possible to use all or some of the components.
193274
194275Access to the FOXDEN modules is via either a web page for each module
195- or by using a command-line tool. The comma
276+ or by using a command-line tool. Besides being a perfectly reasonable
277+ way to use the tools, the command line tool is also well suited to use
278+ in scripts.
279+
280+ ### The Modules
281+
282+ FOXDEN's modular architecture makes it easy to select which components
283+ of it you'd like to use and even makes it possible to substitute your
284+ own software if you have specialized needs. The FOXDEN documentation
285+ has a walkthrough of basic use. This document will instead give some
286+ brief background on each component. You're encouraged to work through
287+ the "Quick Start Guide".
288+
289+ #### Frontend service: web interface
290+
291+ As you would (likely) expect, the Frontend service generates the web
292+ pages through which users can easily interact with the
293+ system. Initially, the user is presented with a login page. Once past
294+ that, access to the other modules is a click or two away. Of
295+ particular interest is the "docs" button toward the upper-right
296+ corner. Having the documentation close at hand will prove... handy.
297+
298+ #### Command line (CLI) tool
299+
300+ The command line tool ("foxden") is both an alternative way users can
301+ access the system as well as a means to interface shell scripts to the
302+ system for automating common tasks. The "foxden" command by itself
303+ with no arguments will display a list of the available commands and
304+ also gives a link to the documentation and a reminder of how to get
305+ more detailed help.
306+
307+ #### Authentication and authorization service
308+
309+ Even in a purely open research environment, it's still necessary to
310+ keep track of who is making changes. This is both for proper
311+ attribution as well as non-repudiation (perhaps less of a factor in
312+ X-Ray Science than in other disciplines, but the system is built to be
313+ versatile). Web users will see a familiar-looking login screen. CLI
314+ users will need to authenticate via Kerberos - don't worry, it's well
315+ described in the introductory documentation. FOXDEN mercifully
316+ provides a way to use Kerberos that is simpler than the old-school
317+ way.
318+
319+ #### Data Discovery service
320+
321+ The Discovery service provides a way to query the underlying
322+ "management database" that tracks movement of files and the metadata
323+ associated with them. The query language is the same one MongoDB used
324+ (Mongo QL).
325+
326+ #### MetaData service
327+
328+ The MetaData service is one of the critical components. This module
329+ can not only query metadata, for instance finding matching schemas,
330+ but also create new schemas and manipulate existing ones.
331+
332+ #### Provenance service
333+
334+ The Provenance service provides a lot of functionality. The tracking
335+ of "provenance" is not just something art historians do. The term
336+ refers to the tracking of every movement of the data we're managing,
337+ what tools were used to transform it and under what circumstances, and
338+ where the data came from. This last element could be, say, "from this
339+ instrument on this beamline" or it could be "Dr. J. Doe's Sept 13th
340+ dataset, reduced by this lump of MATLAB code".
341+
342+ #### Data Management service
343+
344+ The Data Management service abstracts data movement in and out of the
345+ underlying Object Store (AWS S3 or compatible). Functions are provided
346+ to both manage the Object Store as well as to move data in, move it
347+ out, or delete it.
348+
349+ #### Publication service
350+
351+ The Publication service has two major sections. The first handles the
352+ creation and assignment of DOIs (Document Object Identifiers - you've
353+ seen these in "References" sections) and the association of that
354+ identifier with metadata and data. The second section provides a means
355+ to interact with Zenodo in a manner consistent with the rest of
356+ FOXDEN.
357+
358+ #### SpecScan service
359+
360+ SpecScan is pretty specific: it is used to create and manipulate
361+ records for spec scans. It does what it says on the tin.
362+
363+ #### MLHub
364+
365+ The MLHub service allows the user to run various Machine Learning (ML)
366+ algorithms directly in the FOXDEN environment. TensorFlow is directly
367+ supported. Doing this directly inside of FOXDEN seems odd at first,
368+ but it follows the paradigm of "moving the compute to the data",
369+ preventing time-consuming retrievals.
370+
371+ #### CHESS Analysis Pipeline (CHAP)
372+
373+ The CHAP service simplifies running the CHESS-developed CHAP algorithms on data stored in FOXDEN.
374+
375+ #### CHAP Notebook
376+
377+ Designed for novice programmers, the CHAP Notebook service simplifies
378+ data analysis by giving users a Jupyter-like interface for writing
379+ code modules that are inserted into pre-defined workflows. These
380+ modules are also deposited in a code repository for future
381+ dissemination.
0 commit comments