Skip to content

Commit cb332a9

Browse files
fix #1246, updated docs to explain three-part make pattern
and generator function implementation
1 parent d1202f6 commit cb332a9

File tree

1 file changed

+187
-0
lines changed

1 file changed

+187
-0
lines changed

docs/src/compute/populate.md

Lines changed: 187 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,193 @@ The `make` callback does three things:
6565
`make` may populate multiple entities in one call when `key` does not specify the
6666
entire primary key of the populated table.
6767

68+
### Three-Part Make Pattern for Long Computations
69+
70+
For long-running computations, DataJoint provides an advanced pattern called the
71+
**three-part make** that separates the `make` method into three distinct phases.
72+
This pattern is essential for maintaining database performance and data integrity
73+
during expensive computations.
74+
75+
#### The Problem: Long Transactions
76+
77+
Traditional `make` methods perform all operations within a single database transaction:
78+
79+
```python
80+
def make(self, key):
81+
# All within one transaction
82+
data = (ParentTable & key).fetch1() # Fetch
83+
result = expensive_computation(data) # Compute (could take hours)
84+
self.insert1(dict(key, result=result)) # Insert
85+
```
86+
87+
This approach has significant limitations:
88+
- **Database locks**: Long transactions hold locks on tables, blocking other operations
89+
- **Connection timeouts**: Database connections may timeout during long computations
90+
- **Memory pressure**: All fetched data must remain in memory throughout the computation
91+
- **Failure recovery**: If computation fails, the entire transaction is rolled back
92+
93+
#### The Solution: Three-Part Make Pattern
94+
95+
The three-part make pattern splits the `make` method into three distinct phases,
96+
allowing the expensive computation to occur outside of database transactions:
97+
98+
```python
99+
def make_fetch(self, key):
100+
"""Phase 1: Fetch all required data from parent tables"""
101+
fetched_data = ((ParentTable & key).fetch1(),)
102+
return fetched_data # must be a sequence, eg tuple or list
103+
104+
def make_compute(self, key, *fetched_data):
105+
"""Phase 2: Perform expensive computation (outside transaction)"""
106+
computed_result = expensive_computation(*fetched_data)
107+
return computed_result # must be a sequence, eg tuple or list
108+
109+
def make_insert(self, key, *computed_result):
110+
"""Phase 3: Insert results into the current table"""
111+
self.insert1(dict(key, result=computed_result))
112+
```
113+
114+
#### Execution Flow
115+
116+
To achieve data intensity without long transactions, the three-part make pattern follows this sophisticated execution sequence:
117+
118+
```python
119+
# Step 1: Fetch data outside transaction
120+
fetched_data1 = self.make_fetch(key)
121+
computed_result = self.make_compute(key, *fetched_data1)
122+
123+
# Step 2: Begin transaction and verify data consistency
124+
begin transaction:
125+
fetched_data2 = self.make_fetch(key)
126+
if fetched_data1 != fetched_data2: # deep comparison
127+
cancel transaction # Data changed during computation
128+
else:
129+
self.make_insert(key, *computed_result)
130+
commit_transaction
131+
```
132+
133+
#### Key Benefits
134+
135+
1. **Reduced Database Lock Time**: Only the fetch and insert operations occur within transactions, minimizing lock duration
136+
2. **Connection Efficiency**: Database connections are only used briefly for data transfer
137+
3. **Memory Management**: Fetched data can be processed and released during computation
138+
4. **Fault Tolerance**: Computation failures don't affect database state
139+
5. **Scalability**: Multiple computations can run concurrently without database contention
140+
141+
#### Referential Integrity Protection
142+
143+
The pattern includes a critical safety mechanism: **referential integrity verification**.
144+
Before inserting results, the system:
145+
146+
1. Re-fetches the source data within the transaction
147+
2. Compares it with the originally fetched data using deep hashing
148+
3. Only proceeds with insertion if the data hasn't changed
149+
150+
This prevents the "phantom read" problem where source data changes during long computations,
151+
ensuring that results remain consistent with their inputs.
152+
153+
#### Implementation Details
154+
155+
The pattern is implemented using Python generators in the `AutoPopulate` class:
156+
157+
```python
158+
def make(self, key):
159+
# Step 1: Fetch data from parent tables
160+
fetched_data = self.make_fetch(key)
161+
computed_result = yield fetched_data
162+
163+
# Step 2: Compute if not provided
164+
if computed_result is None:
165+
computed_result = self.make_compute(key, *fetched_data)
166+
yield computed_result
167+
168+
# Step 3: Insert the computed result
169+
self.make_insert(key, *computed_result)
170+
yield
171+
```
172+
Therefore, it is possible to override the `make` method to implement the three-part make pattern by using the `yield` statement to return the fetched data and computed result as above.
173+
174+
#### Use Cases
175+
176+
This pattern is particularly valuable for:
177+
178+
- **Machine learning model training**: Hours-long training sessions
179+
- **Image processing pipelines**: Large-scale image analysis
180+
- **Statistical computations**: Complex statistical analyses
181+
- **Data transformations**: ETL processes with heavy computation
182+
- **Simulation runs**: Time-consuming simulations
183+
184+
#### Example: Long-Running Image Analysis
185+
186+
Here's an example of how to implement the three-part make pattern for a
187+
long-running image analysis task:
188+
189+
```python
190+
@schema
191+
class ImageAnalysis(dj.Computed):
192+
definition = """
193+
# Complex image analysis results
194+
-> Image
195+
---
196+
analysis_result : longblob
197+
processing_time : float
198+
"""
199+
200+
def make_fetch(self, key):
201+
"""Fetch the image data needed for analysis"""
202+
return (Image & key).fetch1('image'),
203+
204+
def make_compute(self, key, image_data):
205+
"""Perform expensive image analysis outside transaction"""
206+
import time
207+
start_time = time.time()
208+
209+
# Expensive computation that could take hours
210+
result = complex_image_analysis(image_data)
211+
processing_time = time.time() - start_time
212+
return result, processing_time
213+
214+
def make_insert(self, key, analysis_result, processing_time):
215+
"""Insert the analysis results"""
216+
self.insert1(dict(key,
217+
analysis_result=analysis_result,
218+
processing_time=processing_time))
219+
```
220+
221+
The exact same effect may be achieved by overriding the `make` method as a generator function using the `yield` statement to return the fetched data and computed result as above:
222+
223+
```python
224+
@schema
225+
class ImageAnalysis(dj.Computed):
226+
definition = """
227+
# Complex image analysis results
228+
-> Image
229+
---
230+
analysis_result : longblob
231+
processing_time : float
232+
"""
233+
234+
def make(self, key):
235+
fetched_data = (Image & key).fetch1('image'),
236+
computed_result = yield fetched_data
237+
238+
if computed_result is None:
239+
# Expensive computation that could take hours
240+
import time
241+
start_time = time.time()
242+
result = complex_image_analysis(image_data)
243+
processing_time = time.time() - start_time
244+
computed_result = result, processing_time
245+
yield computed_result
246+
247+
result, processing_time = computed_result
248+
self.insert1(dict(key,
249+
analysis_result=result,
250+
processing_time=processing_time))
251+
yield # yield control back to the caller
252+
```
253+
We expect that most users will prefer to use the three-part implementation over the generator function implementation due to its conceptual complexity.
254+
68255
## Populate
69256

70257
The inherited `populate` method of `dj.Imported` and `dj.Computed` automatically calls

0 commit comments

Comments
 (0)