@@ -65,6 +65,193 @@ The `make` callback does three things:
6565` make ` may populate multiple entities in one call when ` key ` does not specify the
6666entire primary key of the populated table.
6767
68+ ### Three-Part Make Pattern for Long Computations
69+
70+ For long-running computations, DataJoint provides an advanced pattern called the
71+ ** three-part make** that separates the ` make ` method into three distinct phases.
72+ This pattern is essential for maintaining database performance and data integrity
73+ during expensive computations.
74+
75+ #### The Problem: Long Transactions
76+
77+ Traditional ` make ` methods perform all operations within a single database transaction:
78+
79+ ``` python
80+ def make (self , key ):
81+ # All within one transaction
82+ data = (ParentTable & key).fetch1() # Fetch
83+ result = expensive_computation(data) # Compute (could take hours)
84+ self .insert1(dict (key, result = result)) # Insert
85+ ```
86+
87+ This approach has significant limitations:
88+ - ** Database locks** : Long transactions hold locks on tables, blocking other operations
89+ - ** Connection timeouts** : Database connections may timeout during long computations
90+ - ** Memory pressure** : All fetched data must remain in memory throughout the computation
91+ - ** Failure recovery** : If computation fails, the entire transaction is rolled back
92+
93+ #### The Solution: Three-Part Make Pattern
94+
95+ The three-part make pattern splits the ` make ` method into three distinct phases,
96+ allowing the expensive computation to occur outside of database transactions:
97+
98+ ``` python
99+ def make_fetch (self , key ):
100+ """ Phase 1: Fetch all required data from parent tables"""
101+ fetched_data = ((ParentTable & key).fetch1(),)
102+ return fetched_data # must be a sequence, eg tuple or list
103+
104+ def make_compute (self , key , * fetched_data ):
105+ """ Phase 2: Perform expensive computation (outside transaction)"""
106+ computed_result = expensive_computation(* fetched_data)
107+ return computed_result # must be a sequence, eg tuple or list
108+
109+ def make_insert (self , key , * computed_result ):
110+ """ Phase 3: Insert results into the current table"""
111+ self .insert1(dict (key, result = computed_result))
112+ ```
113+
114+ #### Execution Flow
115+
116+ To achieve data intensity without long transactions, the three-part make pattern follows this sophisticated execution sequence:
117+
118+ ``` python
119+ # Step 1: Fetch data outside transaction
120+ fetched_data1 = self .make_fetch(key)
121+ computed_result = self .make_compute(key, * fetched_data1)
122+
123+ # Step 2: Begin transaction and verify data consistency
124+ begin transaction:
125+ fetched_data2 = self .make_fetch(key)
126+ if fetched_data1 != fetched_data2: # deep comparison
127+ cancel transaction # Data changed during computation
128+ else :
129+ self .make_insert(key, * computed_result)
130+ commit_transaction
131+ ```
132+
133+ #### Key Benefits
134+
135+ 1 . ** Reduced Database Lock Time** : Only the fetch and insert operations occur within transactions, minimizing lock duration
136+ 2 . ** Connection Efficiency** : Database connections are only used briefly for data transfer
137+ 3 . ** Memory Management** : Fetched data can be processed and released during computation
138+ 4 . ** Fault Tolerance** : Computation failures don't affect database state
139+ 5 . ** Scalability** : Multiple computations can run concurrently without database contention
140+
141+ #### Referential Integrity Protection
142+
143+ The pattern includes a critical safety mechanism: ** referential integrity verification** .
144+ Before inserting results, the system:
145+
146+ 1 . Re-fetches the source data within the transaction
147+ 2 . Compares it with the originally fetched data using deep hashing
148+ 3 . Only proceeds with insertion if the data hasn't changed
149+
150+ This prevents the "phantom read" problem where source data changes during long computations,
151+ ensuring that results remain consistent with their inputs.
152+
153+ #### Implementation Details
154+
155+ The pattern is implemented using Python generators in the ` AutoPopulate ` class:
156+
157+ ``` python
158+ def make (self , key ):
159+ # Step 1: Fetch data from parent tables
160+ fetched_data = self .make_fetch(key)
161+ computed_result = yield fetched_data
162+
163+ # Step 2: Compute if not provided
164+ if computed_result is None :
165+ computed_result = self .make_compute(key, * fetched_data)
166+ yield computed_result
167+
168+ # Step 3: Insert the computed result
169+ self .make_insert(key, * computed_result)
170+ yield
171+ ```
172+ Therefore, it is possible to override the ` make ` method to implement the three-part make pattern by using the ` yield ` statement to return the fetched data and computed result as above.
173+
174+ #### Use Cases
175+
176+ This pattern is particularly valuable for:
177+
178+ - ** Machine learning model training** : Hours-long training sessions
179+ - ** Image processing pipelines** : Large-scale image analysis
180+ - ** Statistical computations** : Complex statistical analyses
181+ - ** Data transformations** : ETL processes with heavy computation
182+ - ** Simulation runs** : Time-consuming simulations
183+
184+ #### Example: Long-Running Image Analysis
185+
186+ Here's an example of how to implement the three-part make pattern for a
187+ long-running image analysis task:
188+
189+ ``` python
190+ @schema
191+ class ImageAnalysis (dj .Computed ):
192+ definition = """
193+ # Complex image analysis results
194+ -> Image
195+ ---
196+ analysis_result : longblob
197+ processing_time : float
198+ """
199+
200+ def make_fetch (self , key ):
201+ """ Fetch the image data needed for analysis"""
202+ return (Image & key).fetch1(' image' ),
203+
204+ def make_compute (self , key , image_data ):
205+ """ Perform expensive image analysis outside transaction"""
206+ import time
207+ start_time = time.time()
208+
209+ # Expensive computation that could take hours
210+ result = complex_image_analysis(image_data)
211+ processing_time = time.time() - start_time
212+ return result, processing_time
213+
214+ def make_insert (self , key , analysis_result , processing_time ):
215+ """ Insert the analysis results"""
216+ self .insert1(dict (key,
217+ analysis_result = analysis_result,
218+ processing_time = processing_time))
219+ ```
220+
221+ The exact same effect may be achieved by overriding the ` make ` method as a generator function using the ` yield ` statement to return the fetched data and computed result as above:
222+
223+ ``` python
224+ @schema
225+ class ImageAnalysis (dj .Computed ):
226+ definition = """
227+ # Complex image analysis results
228+ -> Image
229+ ---
230+ analysis_result : longblob
231+ processing_time : float
232+ """
233+
234+ def make (self , key ):
235+ image_data = (Image & key).fetch1(' image' )
236+ computed_result = yield (image, ) # pack fetched_data
237+
238+ if computed_result is None :
239+ # Expensive computation that could take hours
240+ import time
241+ start_time = time.time()
242+ result = complex_image_analysis(image_data)
243+ processing_time = time.time() - start_time
244+ computed_result = result, processing_time # pack
245+ yield computed_result
246+
247+ result, processing_time = computed_result # unpack
248+ self .insert1(dict (key,
249+ analysis_result = result,
250+ processing_time = processing_time))
251+ yield # yield control back to the caller
252+ ```
253+ We expect that most users will prefer to use the three-part implementation over the generator function implementation due to its conceptual complexity.
254+
68255## Populate
69256
70257The inherited ` populate ` method of ` dj.Imported ` and ` dj.Computed ` automatically calls
0 commit comments