Skip to content

Commit 89012f2

Browse files
authored
Merge pull request #113 from RENCI-NRIG/escott-5
finally finished with FAIR, I hope.
2 parents 0630e41 + af76d72 commit 89012f2

File tree

1 file changed

+196
-10
lines changed

1 file changed

+196
-10
lines changed

theme5/CF100/domain-metadata-standards.md

Lines changed: 196 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -127,9 +127,9 @@ require the use of proprietary software to read it. This is not simply
127127
because of cost, though that can be a major barrier, but is also a
128128
matter of historical preservation - it may be impossible to locate or
129129
run old enough software. Imagine having a really useful data set
130-
except it's only readable by a program that only runs on a 25 year old
130+
except it's only readable by a program that only runs on a 35 year old
131131
version of MacOS. Yes, that really happens. If we design our
132-
(meta)data for interoperability then we can take that 25+ year old
132+
(meta)data for interoperability then we can take that 35+ year old
133133
dataset and work with it using modern tools.
134134

135135
* I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
@@ -143,7 +143,8 @@ column is conductance". What is needed is a way to say "the third
143143
column is conductance, it's an electronic measure, it's related to
144144
resistance, and here is how". It's a lot of work, but luckily for us
145145
most of it has been done and we can reuse it - Resource Definition
146-
Framework (RDF). We'll talk more about this in a later section.
146+
Framework (RDF) is made for this job. We'll talk more about this in a
147+
later section.
147148

148149
### Reusable
149150

@@ -169,19 +170,99 @@ them. 28520 has about 250 people.
169170

170171
As mentioned above, metadata is simply information that describes your
171172
data. Hidden behind that word "simply" is the slight complication that
172-
metadata can be really simple or it can be really complicated. Where
173+
metadata can be anything from really simple to really complicated. Where
173174
it falls on that continuum is situational. Small amounts of simple
174-
data will probably have simple metadata
175-
176-
### What Metadata is and why we care
175+
data will probably have simple metadata.
177176

178177
### Metadata representation and searching
179178

180-
#### JSON
179+
Without a standard means of representation and agreed-upon meanings,
180+
the metadata we assemble might be more of a hinderance than a
181+
help. Selecting our data's properties to record is domain-specific in
182+
many cases. A very general set of attributes is available at
183+
[schema.org](https://schema.org/). Other ontologies exist, of course,
184+
and selection among them tends to narrow as you go.
181185

182186
#### RDF
183187

184-
### The future: AI-generated Metadata
188+
RDF is "Resource Description Format" and is a broadly used concept
189+
even outside of FAIR. Fundamentally, RDF's "intention" is to describe
190+
the world in terms of triples: subject, predicate, and object. From
191+
these building blocks we can construct graph structures (in the
192+
"discrete math" sense of the term). As they grow, they can represent
193+
linkages between related items, for instance, and that is when the
194+
real power of RDF can start to be exploited. With small collections of
195+
metadata there is little choice but to handle search terms, for
196+
instance. Once the collection expands, new kinds of queries are
197+
possible. For instance, different researchers might submit data to an
198+
archive. We know the possible range of metadata descriptors and we
199+
know what each one means. At this point, we can traverse the RDF
200+
graph. It's easy to go from the starting place to enter the graph and
201+
then follow along, hop by hop, expanding the search possibilities by
202+
adding related terms.
203+
204+
#### JSON and JSON-LD
205+
206+
For any data to be useable by a computer, it has to be represented in
207+
a way that it can be understood. Metadata is no exception. RDF is both
208+
a conceptual layout for in-memory processing and also a defined way of
209+
writing out the structure. There are just two problems. One is that
210+
the format is a lot to digest when you first start working with
211+
it. The other is that, depending on the language you're using, you
212+
might end up having to write your own parser for this. Luck is smiling
213+
on us, though, in the form of alternative ways to represent that graph
214+
structure. A very common representation, and one that is becoming
215+
increasingly popular, is JSON (Javascript Object Notation). It has its
216+
roots in Javascript, but it has spread far and wide in dozens of
217+
languages. Support for it is nearly ubiquitous now. Take a look at it.
218+
219+
```
220+
{
221+
"first_name": "John",
222+
"last_name": "Smith",
223+
"is_alive": true,
224+
"age": 27,
225+
"address": {
226+
"street_address": "21 2nd Street",
227+
"city": "New York",
228+
"state": "NY",
229+
"postal_code": "10021-3100"
230+
},
231+
"phone_numbers": [
232+
{
233+
"type": "home",
234+
"number": "212 555-1234"
235+
},
236+
{
237+
"type": "office",
238+
"number": "646 555-4567"
239+
}
240+
],
241+
"children": [
242+
"Catherine",
243+
"Thomas",
244+
"Trevor"
245+
],
246+
"spouse": null
247+
}
248+
```
249+
250+
JSON is good for representing lots of structured data, but there needs
251+
to be something that can go beyond that and hold references to the
252+
schemas that apply to certain fields. For this, there is JSON-LD. The
253+
LD stands for "Linked Data". JSON-LD stores additional data in the
254+
form of "context" fields and these fields can contain URLs to outside
255+
sites storing official, curated ontologies:
256+
257+
```
258+
{
259+
"@context": "https://json-ld.org/contexts/person.jsonld",
260+
"@id": "http://dbpedia.org/resource/John_Lennon",
261+
"name": "John Lennon",
262+
"born": "1940-10-09",
263+
"spouse": "http://dbpedia.org/resource/Cynthia_Lennon"
264+
}
265+
```
185266

186267
## FOXDEN - a Pilot, Prototype Example
187268

@@ -192,4 +273,109 @@ Linux, in that it provides a collection of tools that work together in
192273
a modular fashion. It is possible to use all or some of the components.
193274

194275
Access to the FOXDEN modules is via either a web page for each module
195-
or by using a command-line tool. The comma
276+
or by using a command-line tool. Besides being a perfectly reasonable
277+
way to use the tools, the command line tool is also well suited to use
278+
in scripts.
279+
280+
### The Modules
281+
282+
FOXDEN's modular architecture makes it easy to select which components
283+
of it you'd like to use and even makes it possible to substitute your
284+
own software if you have specialized needs. The FOXDEN documentation
285+
has a walkthrough of basic use. This document will instead give some
286+
brief background on each component. You're encouraged to work through
287+
the "Quick Start Guide".
288+
289+
#### Frontend service: web interface
290+
291+
As you would (likely) expect, the Frontend service generates the web
292+
pages through which users can easily interact with the
293+
system. Initially, the user is presented with a login page. Once past
294+
that, access to the other modules is a click or two away. Of
295+
particular interest is the "docs" button toward the upper-right
296+
corner. Having the documentation close at hand will prove... handy.
297+
298+
#### Command line (CLI) tool
299+
300+
The command line tool ("foxden") is both an alternative way users can
301+
access the system as well as a means to interface shell scripts to the
302+
system for automating common tasks. The "foxden" command by itself
303+
with no arguments will display a list of the available commands and
304+
also gives a link to the documentation and a reminder of how to get
305+
more detailed help.
306+
307+
#### Authentication and authorization service
308+
309+
Even in a purely open research environment, it's still necessary to
310+
keep track of who is making changes. This is both for proper
311+
attribution as well as non-repudiation (perhaps less of a factor in
312+
X-Ray Science than in other disciplines, but the system is built to be
313+
versatile). Web users will see a familiar-looking login screen. CLI
314+
users will need to authenticate via Kerberos - don't worry, it's well
315+
described in the introductory documentation. FOXDEN mercifully
316+
provides a way to use Kerberos that is simpler than the old-school
317+
way.
318+
319+
#### Data Discovery service
320+
321+
The Discovery service provides a way to query the underlying
322+
"management database" that tracks movement of files and the metadata
323+
associated with them. The query language is the same one MongoDB used
324+
(Mongo QL).
325+
326+
#### MetaData service
327+
328+
The MetaData service is one of the critical components. This module
329+
can not only query metadata, for instance finding matching schemas,
330+
but also create new schemas and manipulate existing ones.
331+
332+
#### Provenance service
333+
334+
The Provenance service provides a lot of functionality. The tracking
335+
of "provenance" is not just something art historians do. The term
336+
refers to the tracking of every movement of the data we're managing,
337+
what tools were used to transform it and under what circumstances, and
338+
where the data came from. This last element could be, say, "from this
339+
instrument on this beamline" or it could be "Dr. J. Doe's Sept 13th
340+
dataset, reduced by this lump of MATLAB code".
341+
342+
#### Data Management service
343+
344+
The Data Management service abstracts data movement in and out of the
345+
underlying Object Store (AWS S3 or compatible). Functions are provided
346+
to both manage the Object Store as well as to move data in, move it
347+
out, or delete it.
348+
349+
#### Publication service
350+
351+
The Publication service has two major sections. The first handles the
352+
creation and assignment of DOIs (Document Object Identifiers - you've
353+
seen these in "References" sections) and the association of that
354+
identifier with metadata and data. The second section provides a means
355+
to interact with Zenodo in a manner consistent with the rest of
356+
FOXDEN.
357+
358+
#### SpecScan service
359+
360+
SpecScan is pretty specific: it is used to create and manipulate
361+
records for spec scans. It does what it says on the tin.
362+
363+
#### MLHub
364+
365+
The MLHub service allows the user to run various Machine Learning (ML)
366+
algorithms directly in the FOXDEN environment. TensorFlow is directly
367+
supported. Doing this directly inside of FOXDEN seems odd at first,
368+
but it follows the paradigm of "moving the compute to the data",
369+
preventing time-consuming retrievals.
370+
371+
#### CHESS Analysis Pipeline (CHAP)
372+
373+
The CHAP service simplifies running the CHESS-developed CHAP algorithms on data stored in FOXDEN.
374+
375+
#### CHAP Notebook
376+
377+
Designed for novice programmers, the CHAP Notebook service simplifies
378+
data analysis by giving users a Jupyter-like interface for writing
379+
code modules that are inserted into pre-defined workflows. These
380+
modules are also deposited in a code repository for future
381+
dissemination.

0 commit comments

Comments
 (0)