Skip to content

Commit 6fb85a6

Browse files
committed
Merge pull request #217 from NESCent/traitdb-inception
Adds legacy docs of early conceptualization
2 parents 79d7eb7 + 54a40f8 commit 6fb85a6

File tree

5 files changed

+1122
-0
lines changed

5 files changed

+1122
-0
lines changed

doc/legacy/README.txt

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
The documentation in this directory represents requirements studies
2+
and related artifacts from very early on in the conceptualization of a
3+
trait data management application at NESCent.
4+
5+
Contents:
6+
7+
* needs.txt: Needs Assesement
8+
* vision.txt: Product Vision
9+
* traitstudies.txt: Descriptions of data collected in a few trait-based studies.
10+
- 2 studies with detailed descriptions
11+
- 2 studies with brief descriptions
12+
* features-sink.txt: Kitchen sink of features, requirements, open questions, etc.
13+
14+
This is very unpolished documentation, and much has been superseded or
15+
subsequently evolved. It is kept here primarily for archival purposes.

doc/legacy/features-sink.txt

Lines changed: 213 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,213 @@
1+
-*- mode: outline -*-
2+
3+
This is a "kitchen sink" of potential features, requirements, design
4+
ideas, open questions etc. Some effort was applied to group them into
5+
large groups.
6+
7+
This can be useful for further steps in the requirements process.
8+
9+
10+
* Possible features and ideas
11+
12+
** Data representation
13+
14+
*** Taxon names
15+
Freely-definable names vs. identifiers from an authoritative
16+
database (eg, uBIO, Catalog of Life)?
17+
18+
*** Taxonomies and phylogenies
19+
- Is taxonomic or phylogenetic subordination of taxons important
20+
for any aspect of data collection?
21+
- E.g., is it a basis of aggregations? (See next section.)
22+
- Is it useful to support multiple alternative taxonomies within
23+
one study?
24+
25+
*** Degree of source fidelity
26+
-- Require that data from source papers should be preserved in the
27+
system as is (and linked to its interpretation/cleansing as don
28+
by the collector).
29+
vs
30+
-- Allow the collector to perform interpretive work outside the
31+
system and enter already-cleansed data?
32+
33+
*** Semantically different "NULLS"
34+
-- "This data point was not reported in the paper"
35+
-- "This data point was not yet looked for (may be, it is in the paper)"
36+
37+
*** Ad hoc extensibility of controlled vocabularies
38+
-- No two papers report data in the same way ==> Controlled
39+
vocabularies are not clear ahead of time: need flexibility of
40+
incorporating unexpected info [and tracking changes in data
41+
collection policy implying new collection passes over papers?]
42+
-- A student must have a pre-determined controlled vocabulary for
43+
entering data, but also ability to complement or override it with
44+
free text in exceptional cases. Such entries should be
45+
re-processable into controlled values later, when the controlled
46+
vocabulary is extended.
47+
48+
*** Integraton with EndNote
49+
- Linking of citations to EndNote would be useful -- instead of
50+
copying them into each cell.
51+
52+
*** Support for BLOBs
53+
-- Maintaining all data about species in a uniform system
54+
(descriptions, image files, xrays, superimposed coordinate files,
55+
all associated metadata). Currently, it is all in files, and
56+
often important metadata is encoded in file names (e.g., species
57+
name, specimen location, who and when took the picture, etc) --
58+
making almost impossible to do searches.
59+
-- Storing PDFs of source papers may also fall here.
60+
61+
*** Versioning
62+
Is it important to establish snapshots of the database at different
63+
points in time and be able to revert to them? How about branching
64+
and merging of versions?
65+
66+
67+
68+
** Data management/processing
69+
70+
*** Resolution of obsolete taxon names
71+
Taxonomic identifiers evolve over time. Is support for resolving
72+
obsolete identifiers from old publications into modern ones
73+
needed? One approach would be references through vouchers. Is
74+
there a database that keeps track of the corresponding links? If
75+
not, would community effort along these lines b useful, if
76+
facilitated by a protocol within our "trait bank"?
77+
78+
79+
*** Aggregation alongside taxonomic subordinations.
80+
An intended study is to be done in terms of a fixed taxonnomic
81+
rank (about species, about genuses, or about families). The source
82+
literature may provide data for the suitable rank, or for lower
83+
ranks. In the latter case, the data needs to be aggregated. The
84+
provenance of the aggregation must be preserved, to help with
85+
possible revisions of the aggregation methodology.
86+
87+
88+
*** Reconciliation of contradictions
89+
-- Papers may contradict one another
90+
-- Must record all info, as well as reconciliation desisions.
91+
92+
*** Revisions
93+
-- Data entry errors can be discovered at any time.
94+
-- Need support for tracking and correcting data that is depended
95+
on the corrected data.
96+
97+
*** Provenance must be tracked
98+
-- Exact raw data and where came from.
99+
-- How data was translated into common representation.
100+
-- How data was aggregated.
101+
-- Who performed each of collection, translation and aggregation.
102+
-- Who and why performed data corrections
103+
104+
105+
106+
** User interface
107+
108+
*** Record-like interface for initial data harvesting (to alleviate
109+
wrong data in wrong cells)
110+
111+
*** Automatic parsing of highlighted PDF for "semi-bulk" data entry,
112+
to avoid re-typing errors more likely with "field-by-field" entry.
113+
114+
** Collaboration and sharing
115+
116+
- Data contributed by one team member should be soon (immediately?)
117+
visible to everyone.
118+
- Should be no technical restrictions on carving areas of work to
119+
be assigned to members. "Stepping on each other's feet" should
120+
not lead to chaos.
121+
- Traceability of who did what.
122+
123+
124+
*** Collaboration
125+
-- Undergrads read papers and enter data
126+
-- The scientist merges data and maintains the master DB
127+
-- Fine-grained collaboration is possible in the future
128+
129+
*** Sharing
130+
-- Maintaining integrity of the curated data collection vs enabling
131+
other researchers to do the same kinds of data manipulation and
132+
annotation the original researcher did.
133+
134+
*** Audit trail
135+
-- who changed what when
136+
137+
*** Backups
138+
-- keep dated snapshots of the data
139+
140+
141+
over others. TraitBank (or TraitWeb?) just provides an
142+
environment where this happens.
143+
144+
145+
146+
** Overall architecture
147+
148+
*** Dissemination approach: "publications"
149+
Users can publish these kinds of contributions:
150+
- Static raw data collections -- as extracted from literature
151+
- Dynamic compilations
152+
- "Signed" snapshots of dynamic compilations
153+
- Corrections to the above
154+
155+
The goal: TraitBank itself does not maintain "authoritativeness"
156+
of data -- the community takes over this task by valuing some pubs
157+
158+
*** Evolution of infrastructure via mining of text annotations
159+
160+
The system should allow researchers to perform manual overrides for
161+
most automatic resolutions (e.g., write in a species name different
162+
from the one looked up by the system; write in a summary number
163+
different from the one calculated by the system).
164+
165+
It should also encourage the researcher to leave behind a textual
166+
explanation of the override (a reference to a more up-to-date
167+
taxonomy; a more appropriate summary formula).
168+
169+
Mining of these explanations will be a great source of system
170+
improvement ideas for the IT team.
171+
172+
*** Data warehousing?
173+
Are data cleansing, aggregation, and slicing tasks in TraitDB
174+
workflow similar to those in data warehousing?
175+
-- A clear difference: many more judgment-intensive non-automatable
176+
operations.
177+
-- Abstracting from that, can data warehousing approaches be
178+
transplanted here?
179+
180+
The core similarity is the inverted conceptial perspective: instead
181+
of a schematic container (table) filled with data we work with a
182+
"soup" of data values where each value is (appears to be) annotated
183+
with lots of meta-information about its meaning.
184+
- "Real data" is numeric and coded values.
185+
- Each data value is accompanied by meta-information about its meaning.
186+
- The metadata serves as coordinates that can be used to place
187+
its accompanied value into the proper cell in a table that a
188+
reasercher may want to construct from the values classified
189+
w.r.t. the same "metadata dimensions".
190+
191+
The tricky thing is this tension:
192+
- To support effective searching, the information must be seen
193+
as the soup of annotated data points.
194+
- For meaningful research work, the data must be arranged into
195+
neat tables (possibly, with drill-down) alongside selected
196+
metadata dimensions.
197+
- Efficient physical storage and interchange is yet another
198+
matter, that must find a sweet spot somewhere between the
199+
extremes dictated by the other two.
200+
201+
Does BioNumbers.org do something along the lines of storing the
202+
soup of data values? (Do they have dimensions anymore structured
203+
than textual descriptions? Do they have facilities for arranging
204+
data points into tables?)
205+
206+
207+
*** SQL as the query language
208+
209+
Based on the "schemas" and actual data, the system automatically
210+
synthesises the most appropriate relational schema, informs the user
211+
of the schema, and loads clean data into it. The user can then query
212+
this read-only data directly with SQL, creating table views of
213+
interest, essentially ready for exporting for analysis.

0 commit comments

Comments
 (0)