Skip to content
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions .devcontainer/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
version: '2.4'
services:
# Update this to the name of the service you want to work with in your docker-compose.yml file
app:
Expand All @@ -7,13 +6,13 @@ services:
# docker-compose.yml file (the first in the devcontainer.json "dockerComposeFile"
# array). The sample below assumes your primary file is in the root of your project.
container_name: datajoint-python-devcontainer
image: datajoint/datajoint-python-devcontainer:${PY_VER:-3.11}-${DISTRO:-buster}
image: datajoint/datajoint-python-devcontainer:${PY_VER:-3.11}-${DISTRO:-bookworm}
build:
context: .
dockerfile: .devcontainer/Dockerfile
args:
- PY_VER=${PY_VER:-3.11}
- DISTRO=${DISTRO:-buster}
- DISTRO=${DISTRO:-bookworm}

volumes:
# Update this to wherever you want VS Code to mount the folder of your project
Expand Down
6 changes: 3 additions & 3 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,18 +21,18 @@ repos:
hooks:
- id: codespell
- repo: https://github.com/pycqa/isort
rev: 5.12.0 # Use the latest stable version
rev: 6.0.1 # Use the latest stable version
hooks:
- id: isort
args:
- --profile=black # Optional, makes isort compatible with Black
- repo: https://github.com/psf/black
rev: 24.2.0 # matching versions in pyproject.toml and github actions
rev: 25.1.0 # matching versions in pyproject.toml and github actions
hooks:
- id: black
args: ["--check", "-v", "datajoint", "tests", "--diff"] # --required-version is conflicting with pre-commit
- repo: https://github.com/PyCQA/flake8
rev: 7.1.2
rev: 7.3.0
hooks:
# syntax tests
- id: flake8
Expand Down
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
## Release notes

**Note:** This file is no longer updated. See the GitHub change log page for the
latest release notes: <https://github.com/datajoint/datajoint-python/releases>.

### 0.14.3 -- Sep 23, 2024
- Added - `dj.Top` restriction - PR [#1024](https://github.com/datajoint/datajoint-python/issues/1024)) PR [#1084](https://github.com/datajoint/datajoint-python/pull/1084)
- Fixed - Added encapsulating double quotes to comply with [DOT language](https://graphviz.org/doc/info/lang.html) - PR [#1177](https://github.com/datajoint/datajoint-python/pull/1177)
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ ARG IMAGE=mambaorg/micromamba:1.5-bookworm-slim
FROM ${IMAGE}

ARG CONDA_BIN=micromamba
ARG PY_VER=3.9
ARG PY_VER=3.11
ARG HOST_UID=1000

RUN ${CONDA_BIN} install --no-pin -qq -y -n base -c conda-forge \
Expand Down
187 changes: 187 additions & 0 deletions docs/src/compute/populate.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,193 @@ The `make` callback does three things:
`make` may populate multiple entities in one call when `key` does not specify the
entire primary key of the populated table.

### Three-Part Make Pattern for Long Computations

For long-running computations, DataJoint provides an advanced pattern called the
**three-part make** that separates the `make` method into three distinct phases.
This pattern is essential for maintaining database performance and data integrity
during expensive computations.

#### The Problem: Long Transactions

Traditional `make` methods perform all operations within a single database transaction:

```python
def make(self, key):
# All within one transaction
data = (ParentTable & key).fetch1() # Fetch
result = expensive_computation(data) # Compute (could take hours)
self.insert1(dict(key, result=result)) # Insert
```

This approach has significant limitations:
- **Database locks**: Long transactions hold locks on tables, blocking other operations
- **Connection timeouts**: Database connections may timeout during long computations
- **Memory pressure**: All fetched data must remain in memory throughout the computation
- **Failure recovery**: If computation fails, the entire transaction is rolled back

#### The Solution: Three-Part Make Pattern

The three-part make pattern splits the `make` method into three distinct phases,
allowing the expensive computation to occur outside of database transactions:

```python
def make_fetch(self, key):
"""Phase 1: Fetch all required data from parent tables"""
fetched_data = ((ParentTable & key).fetch1(),)
return fetched_data # must be a sequence, eg tuple or list

def make_compute(self, key, *fetched_data):
"""Phase 2: Perform expensive computation (outside transaction)"""
computed_result = expensive_computation(*fetched_data)
return computed_result # must be a sequence, eg tuple or list

def make_insert(self, key, *computed_result):
"""Phase 3: Insert results into the current table"""
self.insert1(dict(key, result=computed_result))
```

#### Execution Flow

To achieve data intensity without long transactions, the three-part make pattern follows this sophisticated execution sequence:

```python
# Step 1: Fetch data outside transaction
fetched_data1 = self.make_fetch(key)
computed_result = self.make_compute(key, *fetched_data1)

# Step 2: Begin transaction and verify data consistency
begin transaction:
fetched_data2 = self.make_fetch(key)
if fetched_data1 != fetched_data2: # deep comparison
cancel transaction # Data changed during computation
else:
self.make_insert(key, *computed_result)
commit_transaction
```

#### Key Benefits

1. **Reduced Database Lock Time**: Only the fetch and insert operations occur within transactions, minimizing lock duration
2. **Connection Efficiency**: Database connections are only used briefly for data transfer
3. **Memory Management**: Fetched data can be processed and released during computation
4. **Fault Tolerance**: Computation failures don't affect database state
5. **Scalability**: Multiple computations can run concurrently without database contention

#### Referential Integrity Protection

The pattern includes a critical safety mechanism: **referential integrity verification**.
Before inserting results, the system:

1. Re-fetches the source data within the transaction
2. Compares it with the originally fetched data using deep hashing
3. Only proceeds with insertion if the data hasn't changed

This prevents the "phantom read" problem where source data changes during long computations,
ensuring that results remain consistent with their inputs.

#### Implementation Details

The pattern is implemented using Python generators in the `AutoPopulate` class:

```python
def make(self, key):
# Step 1: Fetch data from parent tables
fetched_data = self.make_fetch(key)
computed_result = yield fetched_data

# Step 2: Compute if not provided
if computed_result is None:
computed_result = self.make_compute(key, *fetched_data)
yield computed_result

# Step 3: Insert the computed result
self.make_insert(key, *computed_result)
yield
```
Therefore, it is possible to override the `make` method to implement the three-part make pattern by using the `yield` statement to return the fetched data and computed result as above.

#### Use Cases

This pattern is particularly valuable for:

- **Machine learning model training**: Hours-long training sessions
- **Image processing pipelines**: Large-scale image analysis
- **Statistical computations**: Complex statistical analyses
- **Data transformations**: ETL processes with heavy computation
- **Simulation runs**: Time-consuming simulations

#### Example: Long-Running Image Analysis

Here's an example of how to implement the three-part make pattern for a
long-running image analysis task:

```python
@schema
class ImageAnalysis(dj.Computed):
definition = """
# Complex image analysis results
-> Image
---
analysis_result : longblob
processing_time : float
"""

def make_fetch(self, key):
"""Fetch the image data needed for analysis"""
return (Image & key).fetch1('image'),

def make_compute(self, key, image_data):
"""Perform expensive image analysis outside transaction"""
import time
start_time = time.time()

# Expensive computation that could take hours
result = complex_image_analysis(image_data)
processing_time = time.time() - start_time
return result, processing_time

def make_insert(self, key, analysis_result, processing_time):
"""Insert the analysis results"""
self.insert1(dict(key,
analysis_result=analysis_result,
processing_time=processing_time))
```

The exact same effect may be achieved by overriding the `make` method as a generator function using the `yield` statement to return the fetched data and computed result as above:

```python
@schema
class ImageAnalysis(dj.Computed):
definition = """
# Complex image analysis results
-> Image
---
analysis_result : longblob
processing_time : float
"""

def make(self, key):
fetched_data = (Image & key).fetch1('image'),
computed_result = yield fetched_data

if computed_result is None:
# Expensive computation that could take hours
import time
start_time = time.time()
result = complex_image_analysis(image_data)
Copy link

Copilot AI Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable image_data is not defined in this scope. It should be fetched_data[0] since fetched_data is the tuple returned from the yield statement.

Suggested change
result = complex_image_analysis(image_data)
result = complex_image_analysis(fetched_data[0])

Copilot uses AI. Check for mistakes.
processing_time = time.time() - start_time
computed_result = result, processing_time
yield computed_result

result, processing_time = computed_result
self.insert1(dict(key,
analysis_result=result,
processing_time=processing_time))
yield # yield control back to the caller
```
We expect that most users will prefer to use the three-part implementation over the generator function implementation due to its conceptual complexity.

## Populate

The inherited `populate` method of `dj.Imported` and `dj.Computed` automatically calls
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ dependencies = [
requires-python = ">=3.9,<4.0"
authors = [
{name = "Dimitri Yatsenko", email = "[email protected]"},
{name = "Thinh Nguyen", email = "[email protected]"},
{name = "Raphael Guzman"},
{name = "Edgar Walker"},
{name = "DataJoint Contributors", email = "[email protected]"},
Expand Down
Loading