Skip to content

Commit 2f7bbe6

Browse files
committed
Document dj.Top() and add missing pages
1 parent eef7e59 commit 2f7bbe6

File tree

14 files changed

+1253
-88
lines changed

14 files changed

+1253
-88
lines changed

.vscode/settings.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,5 +17,5 @@
1717
"[dockercompose]": {
1818
"editor.defaultFormatter": "disable"
1919
},
20-
"files.autoSave": "off"
20+
"files.autoSave": "afterDelay"
2121
}

docs/src/concepts/data-model.md

Lines changed: 93 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,23 @@
22

33
## What is a data model?
44

5-
A **data model** refers to a conceptual framework for thinking about data and about
6-
operations on data.
7-
A data model defines the mental toolbox of the data scientist; it has less to do with
8-
the architecture of the data systems, although architectures are often intertwined with
9-
data models.
5+
A **data model** is a conceptual framework that defines how data is organized,
6+
represented, and transformed. It gives us the components for creating blueprints for the
7+
structure and operations of data management systems, ensuring consistency and efficiency
8+
in data handling.
9+
10+
Data management systems are built to accommodate these models, allowing us to manage
11+
data according to the principles laid out by the model. If you’re studying data science
12+
or engineering, you’ve likely encountered different data models, each providing a unique
13+
approach to organizing and manipulating data.
14+
15+
A data model is defined by considering the following key aspects:
16+
17+
+ What are the fundamental elements used to structure the data?
18+
+ What operations are available for defining, creating, and manipulating the data?
19+
+ What mechanisms exist to enforce the structure and rules governing valid data interactions?
20+
21+
## Types of data models
1022

1123
Among the most familiar data models are those based on files and folders: data of any
1224
kind are lumped together into binary strings called **files**, files are collected into
@@ -24,17 +36,16 @@ objects in memory with properties and methods for transformations of such data.
2436
## Relational data model
2537

2638
The **relational model** is a way of thinking about data as sets and operations on sets.
27-
Formalized almost a half-century ago
28-
([Codd, 1969](https://dl.acm.org/citation.cfm?doid=362384.362685)), the relational data
29-
model provides the most rigorous approach to structured data storage and the most
30-
precise approach to data querying.
31-
The model is defined by the principles of data representation, domain constraints,
32-
uniqueness constraints, referential constraints, and declarative queries as summarized
33-
below.
39+
Formalized almost a half-century ago ([Codd,
40+
1969](https://dl.acm.org/citation.cfm?doid=362384.362685)). The relational data model is
41+
one of the most powerful and precise ways to store and manage structured data. At its
42+
core, this model organizes all data into tables--representing mathematical
43+
relations---where each table consists of rows (representing mathematical tuples) and
44+
columns (often called attributes).
3445

3546
### Core principles of the relational data model
3647

37-
**Data representation**
48+
**Data representation:**
3849
Data are represented and manipulated in the form of relations.
3950
A relation is a set (i.e. an unordered collection) of entities of values for each of
4051
the respective named attributes of the relation.
@@ -43,26 +54,26 @@ below.
4354
A collection of base relations with their attributes, domain constraints, uniqueness
4455
constraints, and referential constraints is called a schema.
4556

46-
**Domain constraints**
47-
Attribute values are drawn from corresponding attribute domains, i.e. predefined sets
48-
of values.
49-
Attribute domains may not include relations, which keeps the data model flat, i.e.
50-
free of nested structures.
57+
**Domain constraints:**
58+
Each attribute (column) in a table is associated with a specific attribute domain (or
59+
datatype, a set of possible values), ensuring that the data entered is valid.
60+
Attribute domains may not include relations, which keeps the data model
61+
flat, i.e. free of nested structures.
5162

52-
**Uniqueness constraints**
63+
**Uniqueness constraints:**
5364
Entities within relations are addressed by values of their attributes.
5465
To identify and relate data elements, uniqueness constraints are imposed on subsets
5566
of attributes.
5667
Such subsets are then referred to as keys.
5768
One key in a relation is designated as the primary key used for referencing its elements.
5869

59-
**Referential constraints**
70+
**Referential constraints:**
6071
Associations among data are established by means of referential constraints with the
6172
help of foreign keys.
6273
A referential constraint on relation A referencing relation B allows only those
6374
entities in A whose foreign key attributes match the key attributes of an entity in B.
6475

65-
**Declarative queries**
76+
**Declarative queries:**
6677
Data queries are formulated through declarative, as opposed to imperative,
6778
specifications of sought results.
6879
This means that query expressions convey the logic for the result rather than the
@@ -86,23 +97,26 @@ Similar to spreadsheets, relations are often visualized as tables with *attribut
8697
corresponding to *columns* and *entities* corresponding to *rows*.
8798
In particular, SQL uses the terms *table*, *column*, and *row*.
8899

89-
## DataJoint is a refinement of the relational data model
90-
91-
DataJoint is a conceptual refinement of the relational data model offering a more
92-
expressive and rigorous framework for database programming
93-
([Yatsenko et al., 2018](https://arxiv.org/abs/1807.11104)).
94-
The DataJoint model facilitates clear conceptual modeling, efficient schema design, and
95-
precise and flexible data queries.
96-
The model has emerged over a decade of continuous development of complex data pipelines
97-
for neuroscience experiments
98-
([Yatsenko et al., 2015](https://www.biorxiv.org/content/early/2015/11/14/031658)).
99-
DataJoint has allowed researchers with no prior knowledge of databases to collaborate
100-
effectively on common data pipelines sustaining data integrity and supporting flexible
101-
access.
102-
DataJoint is currently implemented as client libraries in MATLAB and Python.
103-
These libraries work by transpiling DataJoint queries into SQL before passing them on
104-
to conventional relational database systems that serve as the backend, in combination
105-
with bulk storage systems for storing large contiguous data objects.
100+
## The DataJoint Model
101+
102+
DataJoint is a conceptual refinement of the relational data model offering a more
103+
expressive and rigorous framework for database programming ([Yatsenko et al.,
104+
2018](https://arxiv.org/abs/1807.11104)). The DataJoint model facilitates conceptual
105+
clarity, efficiency, workflow management, and precise and flexible data
106+
queries. By enforcing entity normalization,
107+
simplifying dependency declarations, offering a rich query algebra, and visualizing
108+
relationships through schema diagrams, DataJoint makes relational database programming
109+
more intuitive and robust for complex data pipelines.
110+
111+
The model has emerged over a decade of continuous development of complex data
112+
pipelines for neuroscience experiments ([Yatsenko et al.,
113+
2015](https://www.biorxiv.org/content/early/2015/11/14/031658)). DataJoint has allowed
114+
researchers with no prior knowledge of databases to collaborate effectively on common
115+
data pipelines sustaining data integrity and supporting flexible access. DataJoint is
116+
currently implemented as client libraries in MATLAB and Python. These libraries work by
117+
transpiling DataJoint queries into SQL before passing them on to conventional relational
118+
database systems that serve as the backend, in combination with bulk storage systems for
119+
storing large contiguous data objects.
106120

107121
DataJoint comprises:
108122

@@ -115,3 +129,44 @@ modeled entities
115129
The key refinement of DataJoint over other relational data models and their
116130
implementations is DataJoint's support of
117131
[entity normalization](../design/normalization.md).
132+
133+
### Core principles of the DataJoint model
134+
135+
**Entity Normalization**
136+
DataJoint enforces entity normalization, ensuring that every entity set (table) is
137+
well-defined, with each element belonging to the same type, sharing the same
138+
attributes, and distinguished by the same primary key. This principle reduces
139+
redundancy and avoids data anomalies, similar to Boyce-Codd Normal Form, but with a
140+
more intuitive structure than traditional SQL.
141+
142+
**Simplified Schema Definition and Dependency Management**
143+
DataJoint introduces a schema definition language that is more expressive and less
144+
error-prone than SQL. Dependencies are explicitly declared using arrow notation
145+
(->), making referential constraints easier to understand and visualize. The
146+
dependency structure is enforced as an acyclic directed graph, which simplifies
147+
workflows by preventing circular dependencies.
148+
149+
**Integrated Query Operators producing a Relational Algebra**
150+
DataJoint introduces five query operators (restrict, join, project, aggregate, and
151+
union) with algebraic closure, allowing them to be combined seamlessly. These
152+
operators are designed to maintain operational entity normalization, ensuring query
153+
outputs remain valid entity sets.
154+
155+
**Diagramming Notation for Conceptual Clarity**
156+
DataJoint’s schema diagrams simplify the representation of relationships between
157+
entity sets compared to ERM diagrams. Relationships are expressed as dependencies
158+
between entity sets, which are visualized using solid or dashed lines for primary
159+
and secondary dependencies, respectively.
160+
161+
**Unified Logic for Binary Operators**
162+
DataJoint simplifies binary operations by requiring attributes involved in joins or
163+
comparisons to be homologous (i.e., sharing the same origin). This avoids the
164+
ambiguity and pitfalls of natural joins in SQL, ensuring more predictable query
165+
results.
166+
167+
**Optimized Data Pipelines for Scientific Workflows**
168+
DataJoint treats the database as a data pipeline where each entity set defines a
169+
step in the workflow. This makes it ideal for scientific experiments and complex
170+
data processing, such as in neuroscience. Its MATLAB and Python libraries transpile
171+
DataJoint queries into SQL, bridging the gap between scientific programming and
172+
relational databases.

docs/src/concepts/data-pipelines.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -157,10 +157,10 @@ with external groups.
157157
## Summary of DataJoint features
158158

159159
1. A free, open-source framework for scientific data pipelines and workflow management
160-
1. Data hosting in cloud or in-house
161-
1. MySQL, filesystems, S3, and Globus for data management
162-
1. Define, visualize, and query data pipelines from MATLAB or Python
163-
1. Enter and view data through GUIs
164-
1. Concurrent access by multiple users and computational agents
165-
1. Data integrity: identification, dependencies, groupings
166-
1. Automated distributed computation
160+
2. Data hosting in cloud or in-house
161+
3. MySQL, filesystems, S3, and Globus for data management
162+
4. Define, visualize, and query data pipelines from MATLAB or Python
163+
5. Enter and view data through GUIs
164+
6. Concurrent access by multiple users and computational agents
165+
7. Data integrity: identification, dependencies, groupings
166+
8. Automated distributed computation

docs/src/design/alter.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,53 @@
11
# Altering Populated Pipelines
2+
3+
Tables can be altered after they have been declared and populated. This is useful when
4+
you want to add new secondary attributes or change the data type of existing attributes.
5+
Users can use the `definition` property to update a table's attributes and then use
6+
`alter` to apply the changes in the database. Currently, `alter` does not support
7+
changes to primary key attributes.
8+
9+
Let's say we have a table `Student` with the following attributes:
10+
11+
```python
12+
@schema
13+
class Student(dj.Manual):
14+
definition = """
15+
student_id: int
16+
---
17+
first_name: varchar(40)
18+
last_name: varchar(40)
19+
home_address: varchar(100)
20+
"""
21+
```
22+
23+
We can modify the table to include a new attribute `email`:
24+
25+
```python
26+
Student.definition = """
27+
student_id: int
28+
---
29+
first_name: varchar(40)
30+
last_name: varchar(40)
31+
home_address: varchar(100)
32+
email: varchar(100)
33+
"""
34+
Student.alter()
35+
```
36+
37+
The `alter` method will update the table in the database to include the new attribute
38+
`email` added by the user in the table's `definition` property.
39+
40+
Similarly, you can modify the data type or length of an existing attribute. For example,
41+
to alter the `home_address` attribute to have a length of 200 characters:
42+
43+
```python
44+
Student.definition = """
45+
student_id: int
46+
---
47+
first_name: varchar(40)
48+
last_name: varchar(40)
49+
home_address: varchar(200)
50+
email: varchar(100)
51+
"""
52+
Student.alter()
53+
```

docs/src/design/tables/blobs.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,26 @@
1-
# Work in progress
1+
# Overview
2+
3+
DataJoint provides functionality for serializing and deserializing complex data types
4+
into binary blobs for efficient storage and compatibility with MATLAB's mYm
5+
serialization. This includes support for:
6+
7+
+ Basic Python data types (e.g., integers, floats, strings, dictionaries).
8+
+ NumPy arrays and scalars.
9+
+ Specialized data types like UUIDs, decimals, and datetime objects.
10+
11+
## Serialization and Deserialization Process
12+
13+
Serialization converts Python objects into a binary representation for efficient storage
14+
within the database. Deserialization converts the binary representation back into the
15+
original Python object.
16+
17+
Blobs over 1 KiB are compressed using the zlib library to reduce storage requirements.
18+
19+
## Supported Data Types
20+
21+
DataJoint supports the following data types for serialization:
22+
23+
+ Scalars: Integers, floats, booleans, strings.
24+
+ Collections: Lists, tuples, sets, dictionaries.
25+
+ NumPy: Arrays, structured arrays, and scalars.
26+
+ Custom Types: UUIDs, decimals, datetime objects, MATLAB cell and struct arrays.

docs/src/faq.md

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -4,17 +4,18 @@
44

55
It is common to enter data during experiments using a graphical user interface.
66

7-
1. [DataJoint LabBook](https://github.com/datajoint/datajoint-labbook) is an open
8-
source project for data entry.
7+
1. The [DataJoint Works](https://works.datajoint.com) platform is a web-based, fully
8+
managed service to host and execute data pipelines.
99

10-
2. The DataJoint Works platform is set up as a fully managed service to host and
11-
execute data pipelines.
10+
2. [DataJoint LabBook](https://github.com/datajoint/datajoint-labbook) is an open
11+
source project for data entry but is no longer actively maintained.
1212

1313
## Does DataJoint support other programming languages?
1414

15-
DataJoint [Python](https://datajoint.com/docs/core/datajoint-python/) and
16-
[Matlab](https://datajoint.com/docs/core/datajoint-matlab/) APIs are both actively
17-
supported. Previous projects implemented some DataJoint features in
15+
DataJoint [Python](https://datajoint.com/docs/core/datajoint-python/) is the most
16+
up-to-date version and all future development will focus on the Python API. The
17+
[Matlab](https://datajoint.com/docs/core/datajoint-matlab/) API was actively developed
18+
through 2023. Previous projects implemented some DataJoint features in
1819
[Julia](https://github.com/BrainCOGS/neuronex_workshop_2018/tree/julia/julia) and
1920
[Rust](https://github.com/datajoint/datajoint-core). DataJoint's data model and data
2021
representation are largely language independent, which means that any language with a
@@ -92,15 +93,15 @@ The entry of metadata can be manual, or it can be an automated part of data acqu
9293
into the database).
9394

9495
Depending on their size and contents, raw data files can be stored in a number of ways.
95-
In the simplest and most common scenario, raw data continue to be stored in either a
96+
In the simplest and most common scenario, raw data continue to be stored in either a
9697
local filesystem or in the cloud as collections of files and folders.
9798
The paths to these files are entered in the database (again, either manually or by
9899
automated processes).
99100
This is the point at which the notion of a **data pipeline** begins.
100101
Below these "manual tables" that contain metadata and file paths are a series of tables
101102
that load raw data from these files, process it in some way, and insert derived or
102103
summarized data directly into the database.
103-
For example, in an imaging application, the very large raw .TIFF stacks would reside on
104+
For example, in an imaging application, the very large raw `.TIFF` stacks would reside on
104105
the filesystem, but the extracted fluorescent trace timeseries for each cell in the
105106
image would be stored as a numerical array directly in the database.
106107
Or the raw video used for animal tracking might be stored in a standard video format on
@@ -163,8 +164,8 @@ This brings us to the final important question:
163164

164165
## How do I get my data out?
165166

166-
This is the fun part. See [queries](query/operators.md) for details of the DataJoint
167-
query language directly from MATLAB and Python.
167+
This is the fun part. See [queries](query/operators.md) for details of the DataJoint
168+
query language directly from Python.
168169

169170
## Interfaces
170171

docs/src/internal/transpilation.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ restriction appending the new condition to the input's restriction.
3434

3535
Property `support` represents the `FROM` clause and contains a list of either
3636
`QueryExpression` objects or table names in the case of base queries.
37-
The joint operator `*` adds new elements to the `support` attribute.
37+
The join operator `*` adds new elements to the `support` attribute.
3838

3939
At least one element must be present in `support`. Multiple elements in `support`
4040
indicate a join.
@@ -56,10 +56,10 @@ self: `heading`, `restriction`, and `support`.
5656

5757
The input object is treated as a subquery in the following cases:
5858

59-
1. A restriction is applied that uses alias attributes in the heading
60-
1. A projection uses an alias attribute to create a new alias attribute.
61-
1. A join is performed on an alias attribute.
62-
1. An Aggregation is used a restriction.
59+
1. A restriction is applied that uses alias attributes in the heading.
60+
2. A projection uses an alias attribute to create a new alias attribute.
61+
3. A join is performed on an alias attribute.
62+
4. An Aggregation is used a restriction.
6363

6464
An error arises if
6565

@@ -117,8 +117,8 @@ input — the *aggregated* query expression.
117117
The SQL equivalent of aggregation is
118118

119119
1. the NATURAL LEFT JOIN of the two inputs.
120-
1. followed by a GROUP BY on the primary key arguments of the first input
121-
1. followed by a projection.
120+
2. followed by a GROUP BY on the primary key arguments of the first input
121+
3. followed by a projection.
122122

123123
The projection works the same as `.proj` with respect to the first input.
124124
With respect to the second input, the projection part of aggregation allows only

docs/src/manipulation/transactions.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ interrupting the sequence of such operations halfway would leave the data in an
66
state.
77
While the sequence is in progress, other processes accessing the database will not see
88
the partial results until the transaction is complete.
9-
The sequence make include [data queries](../query/principles.md) and
9+
The sequence may include [data queries](../query/principles.md) and
1010
[manipulations](index.md).
1111

1212
In such cases, the sequence of operations may be enclosed in a transaction.

docs/src/publish-data.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ The code and the data can be found at https://github.com/sinzlab/Sinz2018_NIPS
2727

2828
## Exporting into a collection of files
2929

30-
Another option for publishing and archiving data is to export the data from the
30+
Another option for publishing and archiving data is to export the data from the
3131
DataJoint pipeline into a collection of files.
3232
DataJoint provides features for exporting and importing sections of the pipeline.
3333
Several ongoing projects are implementing the capability to export from DataJoint

0 commit comments

Comments
 (0)