|
| 1 | +# Principles |
| 2 | + |
| 3 | +## Theoretical Foundations |
| 4 | +*DataJoint Core* implements a systematic framework for the joint management of structured scientific data and its associated computations. |
| 5 | +The framework builds on the theoretical foundations of the [Relational Model](https://en.wikipedia.org/wiki/Relational_model) and |
| 6 | +the [Entity-Relationship Model](https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model), |
| 7 | +introducing a number of critical clarifications for the effective use of databases as scientific data pipelines. |
| 8 | +Notably, DataJoint introduces the concept of *computational dependencies* as a native first-class citizen of the data model. |
| 9 | +This integration of data structure and computation into a single model, defines a new class of *computational scientific databases*. |
| 10 | + |
| 11 | +This page defines the key principles of this model without attachment to a specific implementation while |
| 12 | +a more complete description of the model can be found in [Yatsenko et al, 2018](https://doi.org/10.48550/arXiv.1807.11104). |
| 13 | + |
| 14 | +DataJoint developers are developing these principles into an |
| 15 | +[open standard](https://en.wikipedia.org/wiki/Open_standard) to allow multiple alternative implementations. |
| 16 | + |
| 17 | +## Data Representation |
| 18 | + |
| 19 | +### Tables = Entity Sets |
| 20 | + |
| 21 | +DataJoint uses only one data structure in all its operations—the *entity set*. |
| 22 | + |
| 23 | +1. All data are represented in the form of *entity sets*, i.e. an ordered collection of *entities*. |
| 24 | +2. All entities of an entity set belong to the same well-defined entity class and have the same set of named attributes. |
| 25 | +3. Attributes in an entity set has a *data type* (or *domain*), representing the set of its valid values. |
| 26 | +6. Each entity in an entity set provides the *attribute values* for all of the attributes of its entity class. |
| 27 | +4. Each entity set has a *primary key*, *i.e.* a subset of attributes that, jointly, uniquely identify any entity in the set. |
| 28 | + |
| 29 | +These formal terms have more common (even if less precise) variants: |
| 30 | + |
| 31 | +| formal | common | |
| 32 | +|:-:|:--:| |
| 33 | +| entity set | *table* | |
| 34 | +| attribute | *column* | |
| 35 | +| attribute value | *field* | |
| 36 | + |
| 37 | +A collection of *stored tables* make up a *database*. |
| 38 | +*Derived tables* are formed through *query expressions*. |
| 39 | + |
| 40 | +### Table Definition |
| 41 | +DataJoint introduces a streamlined syntax for defining a stored table. |
| 42 | + |
| 43 | +Each line in the definition defines an attribute with its name, data type, an optional default value, and an optional comment in the format: |
| 44 | +``` |
| 45 | +name [=default] : type [# comment] |
| 46 | +``` |
| 47 | + |
| 48 | +Primary attributes come first and are separated from the rest of the attributes with the divider `---`. |
| 49 | + |
| 50 | +For example, the following code defines the entity set for entities of class `Employee`: |
| 51 | + |
| 52 | +``` |
| 53 | +employee_id : int |
| 54 | +--- |
| 55 | +ssn = null : int # optional social security number |
| 56 | +date_of_birth : date |
| 57 | +gender : enum('male', 'female', 'other') |
| 58 | +home_address="" : varchar(1000) |
| 59 | +primary_phone="" : varchar(12) |
| 60 | +``` |
| 61 | + |
| 62 | + |
| 63 | +### Data Tiers |
| 64 | +Stored tables are designated into one of four *tiers* indicating how their data originates. |
| 65 | + |
| 66 | +| table tier | data origin | |
| 67 | +| --- | --- | |
| 68 | +| lookup | contents are part of the table definition, defined *a priori* rather than entered externally. Typical stores general facts, parameters, options, *etc.* | |
| 69 | +| manual | contents are populated by external mechanisms such as manual entry through web apps or by data ingest scripts | |
| 70 | +| imported | contents are populated automatically by pipeline computations accessing data from upstream in the pipeline **and** from external data sources such as raw data stores.| |
| 71 | +| computed | contents are populated automatically by pipeline computations accessing data from upstream in the pipeline. | |
| 72 | + |
| 73 | +### Object Serialization |
| 74 | + |
| 75 | +### Data Normalization |
| 76 | +A collection of data is considered normalized when organized into a collection of entity sets, |
| 77 | +where each entity set represents a well-defined entity class with all its attributes applicable |
| 78 | +to each entity in the set and the same primary key identifying |
| 79 | + |
| 80 | +The normalization procedure often includes splitting data from one table into several tables, |
| 81 | +one for each proper entity set. |
| 82 | + |
| 83 | + |
| 84 | + |
| 85 | +### Databases and Schemas |
| 86 | +Stored tables are named and grouped into namespaces called *schemas*. |
| 87 | +A collection of schemas make up a *database*. |
| 88 | +A *database* has a globally unique address or name. |
| 89 | +A *schema* has a unique name within its database. |
| 90 | +Within a *connection* to a particular database, a stored table is identified as `schema.Table`. |
| 91 | +A schema typically groups tables that are logically related. |
| 92 | + |
| 93 | + |
| 94 | +## Dependencies |
| 95 | +Entity sets can form referential dependencies that express and |
| 96 | + |
| 97 | + |
| 98 | + |
| 99 | + |
| 100 | +### Diagramming |
| 101 | + |
| 102 | +## Data integrity |
| 103 | + |
| 104 | +### Entity integrity |
| 105 | +*Entity integrity* is the guarantee made by the data management process of the 1:1 mapping between |
| 106 | +real-world entities and their digital representations. |
| 107 | +In practice, entity integrity is ensured when it is made clear |
| 108 | + |
| 109 | +### Referential integrity |
| 110 | + |
| 111 | +### Group integrity |
| 112 | + |
| 113 | +## Data manipulations |
| 114 | + |
| 115 | +## Data queries |
| 116 | + |
| 117 | +### Query Operators |
| 118 | + |
| 119 | +## Pipeline computations |
| 120 | + |
0 commit comments