Skip to content

Commit 9888c14

Browse files
Merge pull request #7 from dimitri-yatsenko/main
Harmonize sections
2 parents f1daf82 + f1792e9 commit 9888c14

29 files changed

+2352
-3581
lines changed

SIMPLIFICATION_RECOMMENDATIONS.md

Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
# Recommendations for Simplifying Main Text Examples
2+
3+
This report identifies opportunities to simplify examples in the main text by referencing comprehensive examples in the `book/80-examples/` section.
4+
5+
## Executive Summary
6+
7+
After reviewing the main text chapters and the examples section, I identified several opportunities for simplification. However, many examples in the main text serve specific pedagogical purposes and are intentionally minimal to focus on particular concepts. The recommendations below balance simplification with pedagogical effectiveness.
8+
9+
## Examples Section Inventory
10+
11+
| Notebook | Domain | Key Features |
12+
|----------|--------|--------------|
13+
| `015-university.ipynb` | Academic administration | Complete schema with Students, Courses, Departments, Terms, Enrollments, Grades; synthetic data generation |
14+
| `016-university-queries.ipynb` | Query patterns | Comprehensive query examples: restriction, joins, aggregation, universal sets |
15+
| `010-classic-sales.ipynb` | E-commerce | MySQL sample database; workflow-centric business operations |
16+
| `070-fractals.ipynb` | Computational pipeline | Table tiers (Manual, Lookup, Computed), populate mechanics, image processing |
17+
| `075-blob-detection.ipynb` | Image analysis | Master-part relationships, parameter sweeps, computational workflows |
18+
19+
---
20+
21+
## Recommendation 1: Queries Chapter - Reference University Queries
22+
23+
**File**: `book/50-queries/020-restriction.ipynb`
24+
25+
**Current State**: Creates a standalone languages/fluency database example to demonstrate restriction patterns.
26+
27+
**Opportunity**: The restriction chapter could be simplified by:
28+
1. Keeping the concise language/fluency example for basic concepts
29+
2. Adding a cross-reference note at the end directing readers to `016-university-queries.ipynb` for more comprehensive query patterns
30+
31+
**Suggested Addition** (at end of chapter):
32+
```markdown
33+
## Further Practice
34+
35+
For comprehensive query examples covering all patterns discussed here,
36+
see the [University Queries](../80-examples/016-university-queries.ipynb) example,
37+
which demonstrates these concepts on a realistic academic database.
38+
```
39+
40+
**Impact**: Low - additive, doesn't require removing existing content
41+
42+
---
43+
44+
## Recommendation 2: Relationships Chapter - Reference Classic Sales
45+
46+
**File**: `book/30-database-design/050-relationships.ipynb`
47+
48+
**Current State**: Creates 12 bank schemas (bank1-12) to demonstrate relationship patterns incrementally.
49+
50+
**Analysis**: The bank examples are intentionally minimal and incremental, which is pedagogically valuable. Each schema builds on the previous to illustrate specific cardinality concepts.
51+
52+
**Opportunity**: Add a cross-reference after the core patterns are established:
53+
54+
**Suggested Addition** (after the "Many-to-Many" section):
55+
```markdown
56+
:::{tip}
57+
For a complete business database demonstrating these relationship patterns
58+
in a realistic context, see the [Classic Sales](../80-examples/010-classic-sales.ipynb)
59+
example, which models offices, employees, customers, orders, and products
60+
as an integrated workflow.
61+
:::
62+
```
63+
64+
**Impact**: Low - additive only
65+
66+
---
67+
68+
## Recommendation 3: Master-Part Chapter - Reference Blob Detection
69+
70+
**File**: `book/30-database-design/053-master-part.ipynb`
71+
72+
**Current State**: Uses polygon/vertex example for master-part relationships.
73+
74+
**Analysis**: The polygon/vertex example is appropriately minimal for introducing the concept. The chapter already mentions computational workflows.
75+
76+
**Opportunity**: Add a practical cross-reference:
77+
78+
**Suggested Addition** (in "Master-Part in Computations" section):
79+
```markdown
80+
For a complete computational example demonstrating master-part relationships
81+
in an image analysis pipeline, see the [Blob Detection](../80-examples/075-blob-detection.ipynb)
82+
example, where `Detection` (master) and `Detection.Blob` (part) capture
83+
aggregate results and per-feature details atomically.
84+
```
85+
86+
**Impact**: Low - enhances existing content
87+
88+
---
89+
90+
## Recommendation 4: Computation Chapter - Already Well Cross-Referenced
91+
92+
**File**: `book/60-computation/010-computation.ipynb`
93+
94+
**Current State**: Already references `075-blob-detection.ipynb` extensively as a case study.
95+
96+
**Analysis**: This chapter demonstrates best practice - it explains concepts briefly and directs readers to the comprehensive example for implementation details.
97+
98+
**Recommendation**: No changes needed. This is a model for other chapters.
99+
100+
---
101+
102+
## Recommendation 5: Normalization Chapter - Potential for E-commerce Simplification
103+
104+
**File**: `book/30-database-design/055-normalization.ipynb`
105+
106+
**Current State**: Contains extensive E-commerce Order Processing example (Order → Payment → Shipment → Delivery → DeliveryConfirmation) spanning ~100 lines.
107+
108+
**Analysis**: This example is integral to explaining workflow normalization principles. It demonstrates how traditional normalization approaches differ from workflow normalization.
109+
110+
**Opportunity**: Consider adding reference to classic-sales after the e-commerce discussion:
111+
112+
**Suggested Addition**:
113+
```markdown
114+
:::{seealso}
115+
The [Classic Sales](../80-examples/010-classic-sales.ipynb) example demonstrates
116+
these workflow normalization principles in a complete business database with
117+
offices, employees, customers, orders, and products.
118+
:::
119+
```
120+
121+
**Impact**: Low - additive only
122+
123+
---
124+
125+
## Recommendation 6: Concepts Chapter - Reference Fractals Example
126+
127+
**File**: `book/20-concepts/04-workflows.md`
128+
129+
**Current State**: Explains Relational Workflow Model concepts theoretically.
130+
131+
**Opportunity**: Add reference to practical implementation:
132+
133+
**Suggested Addition** (after "Table Tiers: Workflow Roles" section):
134+
```markdown
135+
:::{tip}
136+
For a hands-on demonstration of all table tiers working together in a
137+
computational pipeline, see the [Julia Fractals](../80-examples/070-fractals.ipynb)
138+
example, which shows Manual tables for experimental parameters, Lookup tables
139+
for reference data, and Computed tables for derived results.
140+
:::
141+
```
142+
143+
**Impact**: Low - connects theory to practice
144+
145+
---
146+
147+
## Not Recommended for Simplification
148+
149+
### Bank Examples (050-relationships.ipynb)
150+
The 12 bank schemas serve a clear pedagogical purpose: demonstrating relationship patterns incrementally. Replacing them with references would lose the step-by-step learning progression.
151+
152+
### Language/Fluency Examples (020-restriction.ipynb)
153+
These are appropriately minimal for teaching restriction concepts. The university queries example is more complex and would overwhelm the focused explanation.
154+
155+
### Mouse/Cage Examples (055-normalization.ipynb)
156+
These examples are tightly integrated with the normalization discussion and demonstrate the specific points about workflow normalization vs. entity normalization.
157+
158+
### Polygon/Vertex Example (053-master-part.ipynb)
159+
This minimal example is ideal for introducing master-part concepts without distraction.
160+
161+
---
162+
163+
## Implementation Priority
164+
165+
| Priority | Recommendation | Effort | Impact |
166+
|----------|---------------|--------|--------|
167+
| 1 | Add blob-detection reference to master-part chapter | Low | High - connects concepts to practical example |
168+
| 2 | Add fractals reference to concepts chapter | Low | Medium - connects theory to practice |
169+
| 3 | Add university-queries reference to restriction chapter | Low | Medium - provides comprehensive practice |
170+
| 4 | Add classic-sales reference to relationships chapter | Low | Low - supplementary |
171+
| 5 | Add classic-sales reference to normalization chapter | Low | Low - supplementary |
172+
173+
---
174+
175+
## Conclusion
176+
177+
The main text examples are generally well-designed for their pedagogical purposes. The primary opportunity is to **add cross-references** to comprehensive examples rather than remove existing content. This approach:
178+
179+
1. Preserves the focused, incremental learning in main text chapters
180+
2. Directs motivated readers to comprehensive examples for deeper exploration
181+
3. Demonstrates how concepts apply in realistic, complete systems
182+
4. Reduces duplication of effort for readers who explore multiple chapters
183+
184+
The computation chapter (`010-computation.ipynb`) already exemplifies best practice by referencing `075-blob-detection.ipynb` as a case study rather than duplicating the full implementation.
Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
---
2+
title: Executive Summary
3+
subtitle: For Data Architects and Technical Leaders
4+
---
5+
6+
## The Core Problem
7+
8+
Scientific and engineering organizations face a fundamental challenge: as data volumes grow and analyses become more complex, traditional approaches break down. File-based workflows become unmaintainable. Metadata gets separated from the data it describes. Computational provenance is lost. Teams duplicate effort because they cannot discover or trust each other's work. Reproducing results requires archaeological expeditions through old scripts and folder structures.
9+
10+
Standard database solutions address storage and querying but not computation. Data warehouses and lakes handle scale but not scientific workflows. Workflow engines (Airflow, Luigi, Snakemake) manage task orchestration but lack the data model rigor needed for complex analytical dependencies. The result is a patchwork of tools that don't integrate cleanly, requiring custom glue code that itself becomes a maintenance burden.
11+
12+
## The DataJoint Solution
13+
14+
**DataJoint introduces the Relational Workflow Model**—an extension of classical relational theory that treats computational transformations as first-class citizens of the data model. The database schema becomes an executable specification: it defines not just what data exists, but how data flows through the pipeline and when computations should run.
15+
16+
This creates what we call a **Computational Database**: a system where inserting new raw data automatically triggers all downstream analyses in dependency order, maintaining computational validity throughout. Think of it as a spreadsheet that auto-recalculates, but with the rigor of a relational database and the scale of distributed computing.
17+
18+
### Key Differentiators
19+
20+
**Unified Design and Implementation**
21+
Unlike Entity-Relationship modeling that requires translation to SQL, DataJoint schemas are directly executable. The diagram *is* the implementation. Schema changes propagate immediately. Documentation cannot drift from reality because the schema is the documentation.
22+
23+
**Workflow-Aware Foreign Keys**
24+
Foreign keys in DataJoint do more than enforce referential integrity—they encode computational dependencies. A computed result that references raw data will be automatically deleted if that raw data is removed, preventing stale or orphaned results. This maintains *computational validity*, not just *referential integrity*.
25+
26+
**Declarative Computation**
27+
Computations are defined declaratively through `make()` methods attached to table definitions. The `populate()` operation identifies all missing results and executes computations in dependency order. Parallelization, error handling, and job distribution are handled automatically.
28+
29+
**Immutability by Design**
30+
Computed results are immutable. Correcting upstream data requires deleting dependent results and recomputing—ensuring the database always represents a consistent computational state. This naturally provides complete provenance: every result can be traced to its source data and the exact code that produced it.
31+
32+
**Hybrid Storage Model**
33+
Structured metadata lives in the relational database (MySQL/PostgreSQL). Large binary objects (images, recordings, arrays) live in scalable object storage (S3, GCS, filesystem) with the database maintaining the mapping. Queries operate on metadata; computation accesses objects transparently.
34+
35+
## Architecture Overview
36+
37+
The **DataJoint Platform** implements this model through a layered architecture:
38+
39+
```{figure} ../images/Platform.png
40+
:name: platform-architecture
41+
:align: center
42+
:width: 80%
43+
44+
The DataJoint Platform architecture: an open-source core (relational database, code repository, object store) surrounded by functional extensions for interactions, infrastructure, automation, and orchestration.
45+
```
46+
47+
**Open-Source Core**
48+
- Relational database (MySQL/PostgreSQL) as system of record
49+
- Code repository (Git) containing schema definitions and compute methods
50+
- Object store for large data with structured key naming
51+
52+
**Functional Extensions**
53+
- *Interactions*: Pipeline navigator, electronic lab notebook integration, visualization dashboards
54+
- *Infrastructure*: Security, deployment automation, compute resource management
55+
- *Automation*: Automated population, job orchestration, AI-assisted development
56+
- *Orchestration*: Data ingest, cross-team collaboration, DOI-based publishing
57+
58+
The core is fully open source. Organizations can build DIY solutions or use managed platform services depending on their needs.
59+
60+
## What This Book Covers
61+
62+
This book provides comprehensive coverage of DataJoint from foundations through advanced applications:
63+
64+
**Part I: Concepts**
65+
- Database fundamentals and why they matter for scientific work
66+
- Data models: schema-on-write vs. schema-on-read, and why schemas enable mathematical guarantees
67+
- Relational theory: the 150-year mathematical foundation from De Morgan through Codd
68+
- The Relational Workflow Model: DataJoint's extension treating computation as first-class
69+
- Scientific data pipelines: complete systems integrating database, compute, and collaboration
70+
71+
**Part II: Design**
72+
- Schema design principles and table definitions
73+
- Primary keys, foreign keys, and dependency structures
74+
- Master-part relationships for hierarchical data
75+
- Normalization through the lens of workflow entities
76+
- Schema evolution and migration strategies
77+
78+
**Part III: Operations**
79+
- Data insertion, deletion, and transaction handling
80+
- Caching strategies for performance optimization
81+
82+
**Part IV: Queries**
83+
- DataJoint's five-operator query algebra: restriction, projection, join, aggregation, union
84+
- Comparison with SQL and when to use each
85+
- Complex query patterns and optimization
86+
87+
**Part V: Computation**
88+
- The `make()` method pattern for automated computation
89+
- Parallel execution and distributed computing
90+
- Error handling and resumable computation
91+
92+
**Part VI: Interfaces and Integration**
93+
- Python and MATLAB APIs
94+
- Web interfaces and visualization tools
95+
- Integration with existing data systems
96+
97+
**Part VII: Examples and Exercises**
98+
- Complete worked examples from neuroscience, imaging, and other domains
99+
- Hands-on exercises for each major concept
100+
101+
## Who Should Use DataJoint
102+
103+
DataJoint is designed for organizations where:
104+
105+
- **Data has structure**: Experiments, subjects, sessions, trials, measurements—your domain has natural entities and relationships
106+
- **Analysis has dependencies**: Results depend on intermediate computations that depend on raw data
107+
- **Reproducibility matters**: You need to trace any result back to its source data and methodology
108+
- **Teams collaborate**: Multiple people work with shared data and build on each other's analyses
109+
- **Scale is growing**: What worked for one researcher doesn't work for a team; what worked for one project doesn't work for ten
110+
111+
DataJoint is used in over a hundred neuroscience labs worldwide, supporting projects of varying sizes and complexity—from single-investigator studies to large multi-site collaborations. It handles multimodal data spanning neurophysiology, imaging, behavior, sequencing, and machine learning, scaling from gigabytes to petabytes while maintaining the same rigor.
112+
113+
## Getting Started
114+
115+
The **Concepts** section builds the theoretical foundation. If you prefer to learn by doing, the hands-on tutorial in **Relational Practice** provides immediate experience with a working database. The **Design** section then covers practical schema construction.
116+
117+
The [Blob Detection example](../80-examples/075-blob-detection.ipynb) demonstrates a complete image processing pipeline with all table tiers (Manual, Lookup, Imported, Computed) working together, providing a concrete reference implementation.
118+
119+
The [DataJoint Specs 2.0](../95-reference/SPECS_2_0.md) provides the formal specification for those requiring precise technical definitions.
120+
121+
To evaluate DataJoint for your organization, visit [datajoint.com](https://datajoint.com) to subscribe to a pilot project and experience the platform firsthand with guided support.

book/00-introduction/20-prerequisites.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
1-
# Prerequisites and Essential Skills
1+
# Prerequisites
22

3-
This book teaches DataJoint and SQL for scientific data workflows. To get the most out of this course, you should be comfortable with a set of tools that form the bedrock of modern data science.
4-
While we will focus on database principles, we assume a working knowledge of the following.
5-
If you're new to these, we highly recommend exploring MIT's ["The Missing Semester of Your CS Education"](https://missing.csail.mit.edu/) to get up to speed.
3+
This book teaches the concept of relational data workflows in DataJoint.
4+
We provide some equivalent SQL for reference, but SQL knowledge is not required.
5+
To get the most out of this course, you should be comfortable with the following tools.
66

77
### Command-Line Proficiency
88

@@ -19,3 +19,5 @@ In collaborative science and software, version control is non-negotiable. We exp
1919
### Jupyter Notebooks
2020

2121
This textbook itself is built using Jupyter. You should know how to launch, navigate, and run code within Jupyter Notebooks or JupyterLab. The concept of "literate programming"—mixing executable code, text, and results—is central to reproducible science.
22+
23+
(If you're new to these tools, MIT's ["The Missing Semester of Your CS Education"](https://missing.csail.mit.edu/) is an excellent resource.)

0 commit comments

Comments
 (0)