Skip to content

Commit d93a4c5

Browse files
committed
Add concept pages
1 parent 67e8838 commit d93a4c5

File tree

2 files changed

+141
-0
lines changed

2 files changed

+141
-0
lines changed

docs/src/concepts/glossary.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
<!-- markdownlint-disable MD013 -->
2+
3+
# Glossary
4+
5+
We've taken careful consideration to use consistent terminology.
6+
7+
<!-- Contributors: Please keep this table in alphabetical order -->
8+
9+
| Term | Definition |
10+
| --- | --- |
11+
| <span id="DAG">DAG</span> | directed acyclic graph (DAG) is a set of nodes and connected with a set of directed edges that form no cycles. This means that there is never a path back to a node after passing through it by following the directed edges. Formal workflow management systems represent workflows in the form of DAGs. |
12+
| <span id="data-pipeline">data pipeline</span> | A sequence of data transformation steps from data sources through multiple intermediate structures. More generally, a data pipeline is a directed acyclic graph. In DataJoint, each step is represented by a table in a relational database. |
13+
| <span id="datajoint">DataJoint</span> | a software framework for database programming directly from matlab and python. Thanks to its support of automated computational dependencies, DataJoint serves as a workflow management system. |
14+
| <span id="datajoint-elements">DataJoint Elements</span> | software modules implementing portions of experiment workflows designed for ease of integration into diverse custom workflows. |
15+
| <span id="datajoint-pipeline">DataJoint pipeline</span> | the data schemas and transformations underlying a DataJoint workflow. DataJoint allows defining code that specifies both the workflow and the data pipeline, and we have used the words "pipeline" and "workflow" almost interchangeably. |
16+
| <span id="datajoint-schema">DataJoint schema</span> | a software module implementing a portion of an experiment workflow. Includes database table definitions, dependencies, and associated computations. |
17+
| <span id="djhub">djHub</span> | our team's internal platform for delivering cloud-based infrastructure to support online training resources, validation studies, and collaborative projects. |
18+
| <span id="foreign-key">foreign key</span> | a field that is linked to another table's primary key. |
19+
| <span id="primary-key">primary key</span> | the subset of table attributes that uniquely identify each entity in the table. |
20+
| <span id="secondary-attribute">secondray attribute</span> | any field in a table not in the primary key. |
21+
| <span id="workflow">workflow</span> | a formal representation of the steps for executing an experiment from data collection to analysis. Also the software configured for performing these steps. A typical workflow is composed of tables with inter-dependencies and processes to compute and insert data into the tables. |

docs/src/concepts/principles.md

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# Principles
2+
3+
## Theoretical Foundations
4+
*DataJoint Core* implements a systematic framework for the joint management of structured scientific data and its associated computations.
5+
The framework builds on the theoretical foundations of the [Relational Model](https://en.wikipedia.org/wiki/Relational_model) and
6+
the [Entity-Relationship Model](https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model),
7+
introducing a number of critical clarifications for the effective use of databases as scientific data pipelines.
8+
Notably, DataJoint introduces the concept of *computational dependencies* as a native first-class citizen of the data model.
9+
This integration of data structure and computation into a single model, defines a new class of *computational scientific databases*.
10+
11+
This page defines the key principles of this model without attachment to a specific implementation while
12+
a more complete description of the model can be found in [Yatsenko et al, 2018](https://doi.org/10.48550/arXiv.1807.11104).
13+
14+
DataJoint developers are developing these principles into an
15+
[open standard](https://en.wikipedia.org/wiki/Open_standard) to allow multiple alternative implementations.
16+
17+
## Data Representation
18+
19+
### Tables = Entity Sets
20+
21+
DataJoint uses only one data structure in all its operations—the *entity set*.
22+
23+
1. All data are represented in the form of *entity sets*, i.e. an ordered collection of *entities*.
24+
2. All entities of an entity set belong to the same well-defined entity class and have the same set of named attributes.
25+
3. Attributes in an entity set has a *data type* (or *domain*), representing the set of its valid values.
26+
6. Each entity in an entity set provides the *attribute values* for all of the attributes of its entity class.
27+
4. Each entity set has a *primary key*, *i.e.* a subset of attributes that, jointly, uniquely identify any entity in the set.
28+
29+
These formal terms have more common (even if less precise) variants:
30+
31+
| formal | common |
32+
|:-:|:--:|
33+
| entity set | *table* |
34+
| attribute | *column* |
35+
| attribute value | *field* |
36+
37+
A collection of *stored tables* make up a *database*.
38+
*Derived tables* are formed through *query expressions*.
39+
40+
### Table Definition
41+
DataJoint introduces a streamlined syntax for defining a stored table.
42+
43+
Each line in the definition defines an attribute with its name, data type, an optional default value, and an optional comment in the format:
44+
```
45+
name [=default] : type [# comment]
46+
```
47+
48+
Primary attributes come first and are separated from the rest of the attributes with the divider `---`.
49+
50+
For example, the following code defines the entity set for entities of class `Employee`:
51+
52+
```
53+
employee_id : int
54+
---
55+
ssn = null : int # optional social security number
56+
date_of_birth : date
57+
gender : enum('male', 'female', 'other')
58+
home_address="" : varchar(1000)
59+
primary_phone="" : varchar(12)
60+
```
61+
62+
63+
### Data Tiers
64+
Stored tables are designated into one of four *tiers* indicating how their data originates.
65+
66+
| table tier | data origin |
67+
| --- | --- |
68+
| lookup | contents are part of the table definition, defined *a priori* rather than entered externally. Typical stores general facts, parameters, options, *etc.* |
69+
| manual | contents are populated by external mechanisms such as manual entry through web apps or by data ingest scripts |
70+
| imported | contents are populated automatically by pipeline computations accessing data from upstream in the pipeline **and** from external data sources such as raw data stores.|
71+
| computed | contents are populated automatically by pipeline computations accessing data from upstream in the pipeline. |
72+
73+
### Object Serialization
74+
75+
### Data Normalization
76+
A collection of data is considered normalized when organized into a collection of entity sets,
77+
where each entity set represents a well-defined entity class with all its attributes applicable
78+
to each entity in the set and the same primary key identifying
79+
80+
The normalization procedure often includes splitting data from one table into several tables,
81+
one for each proper entity set.
82+
83+
84+
85+
### Databases and Schemas
86+
Stored tables are named and grouped into namespaces called *schemas*.
87+
A collection of schemas make up a *database*.
88+
A *database* has a globally unique address or name.
89+
A *schema* has a unique name within its database.
90+
Within a *connection* to a particular database, a stored table is identified as `schema.Table`.
91+
A schema typically groups tables that are logically related.
92+
93+
94+
## Dependencies
95+
Entity sets can form referential dependencies that express and
96+
97+
98+
99+
100+
### Diagramming
101+
102+
## Data integrity
103+
104+
### Entity integrity
105+
*Entity integrity* is the guarantee made by the data management process of the 1:1 mapping between
106+
real-world entities and their digital representations.
107+
In practice, entity integrity is ensured when it is made clear
108+
109+
### Referential integrity
110+
111+
### Group integrity
112+
113+
## Data manipulations
114+
115+
## Data queries
116+
117+
### Query Operators
118+
119+
## Pipeline computations
120+

0 commit comments

Comments
 (0)