Skip to content

Commit 9110515

Browse files
committed
Remove FK constraints from jobs tables for performance
- Jobs tables have matching primary key structure but no FK constraints - Stale jobs (from deleted upstream records) handled by refresh() - Added created_time field for stale detection - refresh() now returns {added, removed} counts - Updated rationale sections to reflect performance-focused design
1 parent df94fcc commit 9110515

File tree

1 file changed

+49
-22
lines changed

1 file changed

+49
-22
lines changed

docs/src/design/autopopulate-2.0-spec.md

Lines changed: 49 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -30,9 +30,9 @@ The existing `~jobs` table has significant limitations:
3030
1. **Foreign-key-only primary keys**: Auto-populated tables cannot introduce new primary key attributes; their primary key must comprise only foreign key references
3131
2. **Per-table jobs**: Each computed table gets its own hidden jobs table
3232
3. **Native primary keys**: Jobs table uses the same primary key structure as its parent table (no hashes)
33-
4. **Referential integrity**: Jobs are foreign-key linked to parent tables with cascading deletes
33+
4. **No FK constraints on jobs**: Jobs tables omit foreign key constraints for performance; stale jobs are cleaned by `refresh()`
3434
5. **Rich status tracking**: Extended status values for full lifecycle visibility
35-
6. **Automatic refresh**: `populate()` automatically refreshes the jobs queue
35+
6. **Automatic refresh**: `populate()` automatically refreshes the jobs queue (adding new jobs, removing stale ones)
3636

3737
### Primary Key Constraint
3838

@@ -84,12 +84,13 @@ Each `dj.Imported` or `dj.Computed` table `MyTable` will have an associated hidd
8484

8585
```
8686
# Job queue for MyTable
87-
-> ParentTable1
88-
-> ParentTable2
89-
... # Same primary key structure as MyTable
87+
subject_id : int
88+
session_id : int
89+
... # Same primary key attributes as MyTable (NO foreign key constraints)
9090
---
9191
status : enum('pending', 'reserved', 'success', 'error', 'ignore')
9292
priority : int # Higher priority = processed first (default: 0)
93+
created_time : datetime # When job was added to queue
9394
scheduled_time : datetime # Process on or after this time (default: now)
9495
reserved_time : datetime # When job was reserved (null if not reserved)
9596
completed_time : datetime # When job completed (null if not completed)
@@ -103,6 +104,11 @@ connection_id : bigint unsigned # MySQL connection ID
103104
version : varchar(255) # Code version (git hash, package version, etc.)
104105
```
105106

107+
**Important**: The jobs table has the same primary key *structure* as the target table but **no foreign key constraints**. This is intentional for performance:
108+
- Foreign key constraints add overhead on every insert/update/delete
109+
- Jobs tables are high-traffic (frequent reservations and completions)
110+
- Stale jobs (referencing deleted upstream records) are handled by `refresh()` instead
111+
106112
### Access Pattern
107113

108114
Jobs are accessed as a property of the computed table:
@@ -166,15 +172,23 @@ class JobsTable(Table):
166172
"""Dynamically generated based on parent table's primary key."""
167173
...
168174

169-
def refresh(self, *restrictions) -> int:
175+
def refresh(self, *restrictions, stale_timeout: float = None) -> dict:
170176
"""
171-
Refresh the jobs queue by scanning for missing entries.
177+
Refresh the jobs queue: add new jobs and remove stale ones.
178+
179+
Operations performed:
180+
1. Add new jobs: (key_source & restrictions) - target - jobs → insert as 'pending'
181+
2. Remove stale jobs: pending jobs older than stale_timeout whose keys
182+
are no longer in key_source (upstream records were deleted)
172183
173-
Computes: (key_source & restrictions) - target - jobs
174-
Inserts new entries with status='pending'.
184+
Args:
185+
restrictions: Conditions to filter key_source
186+
stale_timeout: Seconds after which pending jobs are checked for staleness.
187+
Jobs older than this are removed if their key is no longer
188+
in key_source. Default from config: jobs.stale_timeout (3600s)
175189
176190
Returns:
177-
Number of new jobs added to queue.
191+
{'added': int, 'removed': int} - counts of jobs added and stale jobs removed
178192
"""
179193
...
180194

@@ -335,9 +349,9 @@ Jobs tables follow the existing hidden table naming pattern:
335349
- Table `FilteredImage` (stored as `__filtered_image`)
336350
- Jobs table: `~filtered_image__jobs` (stored as `_filtered_image__jobs`)
337351

338-
### Referential Integrity
352+
### Primary Key Matching (No Foreign Keys)
339353

340-
The jobs table references the same parent tables as the computed table:
354+
The jobs table has the same primary key *attributes* as the target table, but **without foreign key constraints**:
341355

342356
```python
343357
# If FilteredImage has definition:
@@ -349,18 +363,31 @@ class FilteredImage(dj.Computed):
349363
filtered_image : <djblob>
350364
"""
351365

352-
# The jobs table will have:
353-
# -> Image (same foreign key reference)
354-
# This ensures cascading deletes work correctly
366+
# The jobs table will have the same primary key (image_id),
367+
# but NO foreign key constraint to Image.
368+
# This is for performance - FK constraints add overhead.
355369
```
356370

357-
### Cascading Behavior
371+
### Stale Job Handling
358372

359-
When a parent record is deleted:
360-
1. The corresponding computed table record is deleted (existing behavior)
361-
2. The corresponding jobs table record is also deleted (new behavior)
373+
When upstream records are deleted, their corresponding jobs become "stale" (orphaned). Since there are no FK constraints, these jobs remain in the table until cleaned up:
374+
375+
```python
376+
# refresh() handles stale jobs automatically
377+
result = FilteredImage.jobs.refresh()
378+
# Returns: {'added': 10, 'removed': 3} # 3 stale jobs cleaned up
379+
380+
# Stale detection logic:
381+
# 1. Find pending jobs where created_time < (now - stale_timeout)
382+
# 2. Check if their keys still exist in key_source
383+
# 3. Remove jobs whose keys no longer exist
384+
```
362385

363-
This prevents orphaned job records.
386+
**Why not use foreign key cascading deletes?**
387+
- FK constraints add overhead on every insert/update/delete operation
388+
- Jobs tables are high-traffic (frequent reservations and status updates)
389+
- Stale jobs are harmless until refresh—they simply won't match key_source
390+
- The `refresh()` approach is more efficient for batch cleanup
364391

365392
### Migration from Current System
366393

@@ -557,7 +584,7 @@ Per-table jobs tables provide:
557584
1. **Better isolation**: Jobs for one table don't affect others
558585
2. **Simpler queries**: No need to filter by table_name
559586
3. **Native keys**: Primary keys are readable, not hashed
560-
4. **Referential integrity**: Automatic cleanup via foreign keys
587+
4. **High performance**: No FK constraints means minimal overhead on job operations
561588
5. **Scalability**: Each table's jobs can be indexed independently
562589

563590
### Why Remove Key Hashing?
@@ -574,7 +601,7 @@ The current system hashes primary keys to support arbitrary key types. The new s
574601
Restricting auto-populated tables to foreign-key-only primary keys provides:
575602

576603
1. **1:1 job correspondence**: Each `key_source` entry maps to exactly one job, eliminating ambiguity about what constitutes a "job"
577-
2. **Proper referential integrity**: The jobs table can reference the same parent tables, enabling cascading deletes
604+
2. **Matching key structure**: The jobs table primary key exactly matches the target table, enabling efficient stale detection via `key_source` comparison
578605
3. **Eliminates key_source complexity**: No need for custom `key_source` definitions to enumerate non-foreign-key combinations
579606
4. **Clearer data model**: The computation graph is fully determined by table dependencies
580607
5. **Simpler populate logic**: No need to handle partial key matching or key enumeration

0 commit comments

Comments
 (0)