Skip to content

Commit ba81cb8

Browse files
Implement primary key rules for join operations
Add functional dependency-based PK determination for joins: - A → B: PK = PK(A), A's attributes first - B → A (not A → B): PK = PK(B), B's attributes first - Neither: PK = union of both PKs Key changes: - Add Heading.determines() method to check A → B relationship - Update Heading.join() to apply PK rules based on functional dependencies - Add left join constraint requiring A → B (with allow_nullable_pk bypass) - Update Aggregation.create() to validate group → groupby requirement - Remove U.join() and rewrite U.aggr() to work without join - Add pk-rules-spec.md with semantic matching integration Tests: 509 passed (Python 3.12), 506 passed (Python 3.10) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
1 parent 19cde1c commit ba81cb8

File tree

5 files changed

+635
-47
lines changed

5 files changed

+635
-47
lines changed

docs/src/design/pk-rules-spec.md

Lines changed: 276 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,276 @@
1+
# Primary Key Rules in Relational Operators
2+
3+
In DataJoint, the result of each query operator produces a valid **entity set** with a well-defined **entity type** and **primary key**. This section specifies how the primary key is determined for each relational operator.
4+
5+
## General Principle
6+
7+
The primary key of a query result identifies unique entities in that result. For most operators, the primary key is preserved from the left operand. For joins, the primary key depends on the functional dependencies between the operands.
8+
9+
## Integration with Semantic Matching
10+
11+
Primary key determination is applied **after** semantic compatibility is verified. The evaluation order is:
12+
13+
1. **Semantic Check**: `assert_join_compatibility()` ensures all namesakes are homologous (same lineage)
14+
2. **PK Determination**: The "determines" relationship is computed using attribute names
15+
3. **Left Join Validation**: If `left=True`, verify A → B
16+
17+
This ordering is important because:
18+
- After semantic matching passes, namesakes represent semantically equivalent attributes
19+
- The name-based "determines" check is therefore semantically valid
20+
- Attribute names in the context of a semantically-valid join represent the same entity
21+
22+
The "determines" relationship uses attribute **names** (not lineages directly) because:
23+
- Lineage ensures namesakes are homologous
24+
- Once verified, checking by name is equivalent to checking by semantic identity
25+
- Aliased attributes (same lineage, different names) don't participate in natural joins anyway
26+
27+
## Notation
28+
29+
In the examples below, `*` marks primary key attributes:
30+
- `A(x*, y*, z)` means A has primary key `{x, y}` and secondary attribute `z`
31+
- `A → B` means "A determines B" (defined below)
32+
33+
### Rules by Operator
34+
35+
| Operator | Primary Key Rule |
36+
|----------|------------------|
37+
| `A & B` (restriction) | PK(A) — preserved from left operand |
38+
| `A - B` (anti-restriction) | PK(A) — preserved from left operand |
39+
| `A.proj(...)` (projection) | PK(A) — preserved from left operand |
40+
| `A.aggr(B, ...)` (aggregation) | PK(A) — preserved from left operand |
41+
| `A * B` (join) | Depends on functional dependencies (see below) |
42+
43+
### Join Primary Key Rule
44+
45+
The join operator requires special handling because it combines two entity sets. The primary key of `A * B` depends on the **functional dependency relationship** between the operands.
46+
47+
#### Definitions
48+
49+
**A determines B** (written `A → B`): Every attribute in PK(B) is in A.
50+
51+
```
52+
A → B iff ∀b ∈ PK(B): b ∈ A
53+
```
54+
55+
Since `PK(A) ∪ secondary(A) = all attributes in A`, this is equivalent to saying every attribute in B's primary key exists somewhere in A (as either a primary key or secondary attribute).
56+
57+
Intuitively, `A → B` means that knowing A's primary key is sufficient to determine B's primary key through the functional dependencies implied by A's structure.
58+
59+
**B determines A** (written `B → A`): Every attribute in PK(A) is in B.
60+
61+
```
62+
B → A iff ∀a ∈ PK(A): a ∈ B
63+
```
64+
65+
#### Join Primary Key Algorithm
66+
67+
For `A * B`:
68+
69+
| Condition | PK(A * B) | Attribute Order |
70+
|-----------|-----------|-----------------|
71+
| A → B | PK(A) | A's attributes first |
72+
| B → A (and not A → B) | PK(B) | B's attributes first |
73+
| Neither | PK(A) ∪ PK(B) | PK(A) first, then PK(B) − PK(A) |
74+
75+
When both `A → B` and `B → A` hold, the left operand takes precedence (use PK(A)).
76+
77+
#### Examples
78+
79+
**Example 1: B → A**
80+
```
81+
A: x*, y*
82+
B: x*, z*, y (y is secondary in B, so z → y)
83+
```
84+
- A → B? PK(B) = {x, z}. Is z in PK(A) or secondary in A? No (z not in A). **No.**
85+
- B → A? PK(A) = {x, y}. Is y in PK(B) or secondary in B? Yes (secondary). **Yes.**
86+
- Result: **PK(A * B) = {x, z}** with B's attributes first.
87+
88+
**Example 2: Both directions (bijection-like)**
89+
```
90+
A: x*, y*, z (z is secondary in A)
91+
B: y*, z*, x (x is secondary in B)
92+
```
93+
- A → B? PK(B) = {y, z}. Is z in PK(A) or secondary in A? Yes (secondary). **Yes.**
94+
- B → A? PK(A) = {x, y}. Is x in PK(B) or secondary in B? Yes (secondary). **Yes.**
95+
- Both hold, prefer left operand: **PK(A * B) = {x, y}** with A's attributes first.
96+
97+
**Example 3: Neither direction**
98+
```
99+
A: x*, y*
100+
B: z*, x (x is secondary in B)
101+
```
102+
- A → B? PK(B) = {z}. Is z in PK(A) or secondary in A? No. **No.**
103+
- B → A? PK(A) = {x, y}. Is y in PK(B) or secondary in B? No (y not in B). **No.**
104+
- Result: **PK(A * B) = {x, y, z}** (union) with A's attributes first.
105+
106+
**Example 4: A → B (subordinate relationship)**
107+
```
108+
Session: session_id*
109+
Trial: session_id*, trial_num* (references Session)
110+
```
111+
- A → B? PK(Trial) = {session_id, trial_num}. Is trial_num in PK(Session) or secondary? No. **No.**
112+
- B → A? PK(Session) = {session_id}. Is session_id in PK(Trial)? Yes. **Yes.**
113+
- Result: **PK(Session * Trial) = {session_id, trial_num}** with Trial's attributes first.
114+
115+
**Join primary key determination**:
116+
- `A * B` where `A → B`: result has PK(A)
117+
- `A * B` where `B → A` (not `A → B`): result has PK(B), B's attributes first
118+
- `A * B` where both `A → B` and `B → A`: result has PK(A) (left preference)
119+
- `A * B` where neither direction: result has PK(A) ∪ PK(B)
120+
- Verify attribute ordering matches primary key source
121+
- Verify non-commutativity: `A * B` vs `B * A` may differ in PK and order
122+
123+
### Design Tradeoff: Predictability vs. Minimality
124+
125+
The join primary key rule prioritizes **predictability** over **minimality**. In some cases, the resulting primary key may not be minimal (i.e., it may contain functionally redundant attributes).
126+
127+
**Example of non-minimal result:**
128+
```
129+
A: x*, y*
130+
B: z*, x (x is secondary in B, so z → x)
131+
```
132+
133+
The mathematically minimal primary key for `A * B` would be `{y, z}` because:
134+
- `z → x` (from B's structure)
135+
- `{y, z} → {x, y, z}` (z gives us x, and we have y)
136+
137+
However, `{y, z}` is problematic:
138+
- It is **not the primary key of either operand** (A has `{x, y}`, B has `{z}`)
139+
- It is **not the union** of the primary keys
140+
- It represents a **novel entity type** that doesn't correspond to A, B, or their natural pairing
141+
142+
This creates confusion: what kind of entity does `{y, z}` identify?
143+
144+
**The simplified rule produces `{x, y, z}`** (the union), which:
145+
- Is immediately recognizable as "one A entity paired with one B entity"
146+
- Contains A's full primary key and B's full primary key
147+
- May have redundancy (`x` is determined by `z`) but is semantically clear
148+
149+
**Rationale:** Users can always project away redundant attributes if they need the minimal key. But starting with a predictable, interpretable primary key reduces confusion and errors.
150+
151+
### Attribute Ordering
152+
153+
The primary key attributes always appear **first** in the result's attribute list, followed by secondary attributes. When `B → A` (and not `A → B`), the join is conceptually reordered as `B * A` to maintain this invariant:
154+
155+
- If PK = PK(A): A's attributes appear first
156+
- If PK = PK(B): B's attributes appear first
157+
- If PK = PK(A) ∪ PK(B): PK(A) attributes first, then PK(B) − PK(A), then secondaries
158+
159+
### Non-Commutativity
160+
161+
With these rules, join is **not commutative** in terms of:
162+
1. **Primary key selection**: `A * B` may have a different PK than `B * A` when one direction determines but not the other
163+
2. **Attribute ordering**: The left operand's attributes appear first (unless B → A)
164+
165+
The **result set** (the actual rows returned) remains the same regardless of order, but the **schema** (primary key and attribute order) may differ.
166+
167+
### Left Join Constraint
168+
169+
For left joins (`A.join(B, left=True)`), the functional dependency **A → B is required**.
170+
171+
**Why this constraint exists:**
172+
173+
In a left join, all rows from A are retained even if there's no matching row in B. For unmatched rows, B's attributes are NULL. This creates a problem for primary key validity:
174+
175+
| Scenario | PK by inner join rule | Left join problem |
176+
|----------|----------------------|-------------------|
177+
| A → B | PK(A) | ✅ Safe — A's attrs always present |
178+
| B → A | PK(B) | ❌ B's PK attrs could be NULL |
179+
| Neither | PK(A) ∪ PK(B) | ❌ B's PK attrs could be NULL |
180+
181+
**Example of invalid left join:**
182+
```
183+
A: x*, y* PK(A) = {x, y}
184+
B: x*, z*, y PK(B) = {x, z}, y is secondary
185+
186+
Inner join: PK = {x, z} (B → A rule)
187+
Left join attempt: FAILS because z could be NULL for unmatched A rows
188+
```
189+
190+
**Valid left join example:**
191+
```
192+
Session: session_id*, date
193+
Trial: session_id*, trial_num*, stimulus (references Session)
194+
195+
Session.join(Trial, left=True) # OK: Session → Trial
196+
# PK = {session_id}, all sessions retained even without trials
197+
```
198+
199+
**Error message:**
200+
```
201+
DataJointError: Left join requires the left operand to determine the right operand (A → B).
202+
The following attributes from the right operand's primary key are not determined by
203+
the left operand: ['z']. Use an inner join or restructure the query.
204+
```
205+
206+
### Bypassing the Left Join Constraint
207+
208+
For special cases where the user takes responsibility for handling the potentially nullable primary key, the constraint can be bypassed using `allow_nullable_pk=True`:
209+
210+
```python
211+
# Normally blocked - A does not determine B
212+
A.join(B, left=True) # Error: A → B not satisfied
213+
214+
# Bypass the constraint - user takes responsibility
215+
A.join(B, left=True, allow_nullable_pk=True) # Allowed, PK = PK(A) ∪ PK(B)
216+
```
217+
218+
When bypassed, the resulting primary key is the union of both operands' primary keys (PK(A) ∪ PK(B)). The user must ensure that subsequent operations (such as `GROUP BY` or projection) establish a valid primary key. The parameter name `allow_nullable_pk` reflects the specific issue: primary key attributes from the right operand could be NULL for unmatched rows.
219+
220+
This mechanism is used internally by aggregation (`aggr`) with `keep_all_rows=True`, which resets the primary key via the `GROUP BY` clause.
221+
222+
### Aggregation Exception
223+
224+
`A.aggr(B, keep_all_rows=True)` uses a left join internally but has the **opposite requirement**: **B → A** (the group expression B must have all of A's primary key attributes).
225+
226+
This apparent contradiction is resolved by the `GROUP BY` clause:
227+
228+
1. Aggregation requires B → A so that B can be grouped by A's primary key
229+
2. The intermediate left join `A LEFT JOIN B` would have an invalid PK under the normal left join rules
230+
3. Aggregation internally allows the invalid PK, producing PK(A) ∪ PK(B)
231+
4. The `GROUP BY PK(A)` clause then **resets** the primary key to PK(A)
232+
5. The final result has PK(A), which consists entirely of non-NULL values from A
233+
234+
Note: The semantic check (homologous namesake validation) is still performed for aggregation's internal join. Only the primary key validity constraint is bypassed.
235+
236+
**Example:**
237+
```
238+
Session: session_id*, date
239+
Trial: session_id*, trial_num*, response_time (references Session)
240+
241+
# Aggregation with keep_all_rows=True
242+
Session.aggr(Trial, keep_all_rows=True, avg_rt='avg(response_time)')
243+
244+
# Internally: Session LEFT JOIN Trial (with invalid PK allowed)
245+
# Intermediate PK would be {session_id} ∪ {session_id, trial_num} = {session_id, trial_num}
246+
# But GROUP BY session_id resets PK to {session_id}
247+
# Result: All sessions, with avg_rt=NULL for sessions without trials
248+
```
249+
250+
## Universal Set `dj.U`
251+
252+
`dj.U()` or `dj.U('attr1', 'attr2', ...)` represents the universal set of all possible values and lineages.
253+
254+
### Homology with `dj.U`
255+
Since `dj.U` conceptually contains all possible lineages, its attributes are **homologous to any namesake attribute** in other expressions.
256+
257+
### Valid Operations
258+
259+
```python
260+
# Restriction: promotes a, b to PK; lineage transferred from A
261+
dj.U('a', 'b') & A
262+
263+
# Aggregation: groups by a, b
264+
dj.U('a', 'b').aggr(A, count='count(*)')
265+
```
266+
267+
### Invalid Operations
268+
269+
```python
270+
# Anti-restriction: produces infinite set
271+
dj.U('a', 'b') - A # DataJointError
272+
273+
# Join: deprecated, use & instead
274+
dj.U('a', 'b') * A # DataJointError with migration guidance
275+
```
276+

0 commit comments

Comments
 (0)