diff --git a/.devcontainer/docker-compose.yml b/.devcontainer/docker-compose.yml index 71d74e46f..5c22aaf14 100644 --- a/.devcontainer/docker-compose.yml +++ b/.devcontainer/docker-compose.yml @@ -1,4 +1,3 @@ -version: '2.4' services: # Update this to the name of the service you want to work with in your docker-compose.yml file app: @@ -7,13 +6,13 @@ services: # docker-compose.yml file (the first in the devcontainer.json "dockerComposeFile" # array). The sample below assumes your primary file is in the root of your project. container_name: datajoint-python-devcontainer - image: datajoint/datajoint-python-devcontainer:${PY_VER:-3.11}-${DISTRO:-buster} + image: datajoint/datajoint-python-devcontainer:${PY_VER:-3.11}-${DISTRO:-bookworm} build: context: . dockerfile: .devcontainer/Dockerfile args: - PY_VER=${PY_VER:-3.11} - - DISTRO=${DISTRO:-buster} + - DISTRO=${DISTRO:-bookworm} volumes: # Update this to wherever you want VS Code to mount the folder of your project diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index b8992481a..4a58e0483 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -21,18 +21,18 @@ repos: hooks: - id: codespell - repo: https://github.com/pycqa/isort - rev: 5.12.0 # Use the latest stable version + rev: 6.0.1 # Use the latest stable version hooks: - id: isort args: - --profile=black # Optional, makes isort compatible with Black - repo: https://github.com/psf/black - rev: 24.2.0 # matching versions in pyproject.toml and github actions + rev: 25.1.0 # matching versions in pyproject.toml and github actions hooks: - id: black args: ["--check", "-v", "datajoint", "tests", "--diff"] # --required-version is conflicting with pre-commit - repo: https://github.com/PyCQA/flake8 - rev: 7.1.2 + rev: 7.3.0 hooks: # syntax tests - id: flake8 diff --git a/CHANGELOG.md b/CHANGELOG.md index 1a7b86032..4bf094509 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,8 @@ ## Release notes +**Note:** This file is no longer updated. See the GitHub change log page for the +latest release notes: . + ### 0.14.3 -- Sep 23, 2024 - Added - `dj.Top` restriction - PR [#1024](https://github.com/datajoint/datajoint-python/issues/1024)) PR [#1084](https://github.com/datajoint/datajoint-python/pull/1084) - Fixed - Added encapsulating double quotes to comply with [DOT language](https://graphviz.org/doc/info/lang.html) - PR [#1177](https://github.com/datajoint/datajoint-python/pull/1177) diff --git a/Dockerfile b/Dockerfile index dce8a6438..0d727f6b4 100644 --- a/Dockerfile +++ b/Dockerfile @@ -2,7 +2,7 @@ ARG IMAGE=mambaorg/micromamba:1.5-bookworm-slim FROM ${IMAGE} ARG CONDA_BIN=micromamba -ARG PY_VER=3.9 +ARG PY_VER=3.11 ARG HOST_UID=1000 RUN ${CONDA_BIN} install --no-pin -qq -y -n base -c conda-forge \ diff --git a/docs/src/compute/make.md b/docs/src/compute/make.md index c67711079..1b5569b65 100644 --- a/docs/src/compute/make.md +++ b/docs/src/compute/make.md @@ -23,3 +23,193 @@ The `make` call of a master table first inserts the master entity and then inser the matching part entities in the part tables. None of the entities become visible to other processes until the entire `make` call completes, at which point they all become visible. + +### Three-Part Make Pattern for Long Computations + +For long-running computations, DataJoint provides an advanced pattern called the +**three-part make** that separates the `make` method into three distinct phases. +This pattern is essential for maintaining database performance and data integrity +during expensive computations. + +#### The Problem: Long Transactions + +Traditional `make` methods perform all operations within a single database transaction: + +```python +def make(self, key): + # All within one transaction + data = (ParentTable & key).fetch1() # Fetch + result = expensive_computation(data) # Compute (could take hours) + self.insert1(dict(key, result=result)) # Insert +``` + +This approach has significant limitations: +- **Database locks**: Long transactions hold locks on tables, blocking other operations +- **Connection timeouts**: Database connections may timeout during long computations +- **Memory pressure**: All fetched data must remain in memory throughout the computation +- **Failure recovery**: If computation fails, the entire transaction is rolled back + +#### The Solution: Three-Part Make Pattern + +The three-part make pattern splits the `make` method into three distinct phases, +allowing the expensive computation to occur outside of database transactions: + +```python +def make_fetch(self, key): + """Phase 1: Fetch all required data from parent tables""" + fetched_data = ((ParentTable1 & key).fetch1(), (ParentTable2 & key).fetch1()) + return fetched_data # must be a sequence, eg tuple or list + +def make_compute(self, key, *fetched_data): + """Phase 2: Perform expensive computation (outside transaction)""" + computed_result = expensive_computation(*fetched_data) + return computed_result # must be a sequence, eg tuple or list + +def make_insert(self, key, *computed_result): + """Phase 3: Insert results into the current table""" + self.insert1(dict(key, result=computed_result)) +``` + +#### Execution Flow + +To achieve data intensity without long transactions, the three-part make pattern follows this sophisticated execution sequence: + +```python +# Step 1: Fetch data outside transaction +fetched_data1 = self.make_fetch(key) +computed_result = self.make_compute(key, *fetched_data1) + +# Step 2: Begin transaction and verify data consistency +begin transaction: + fetched_data2 = self.make_fetch(key) + if fetched_data1 != fetched_data2: # deep comparison + cancel transaction # Data changed during computation + else: + self.make_insert(key, *computed_result) + commit_transaction +``` + +#### Key Benefits + +1. **Reduced Database Lock Time**: Only the fetch and insert operations occur within transactions, minimizing lock duration +2. **Connection Efficiency**: Database connections are only used briefly for data transfer +3. **Memory Management**: Fetched data can be processed and released during computation +4. **Fault Tolerance**: Computation failures don't affect database state +5. **Scalability**: Multiple computations can run concurrently without database contention + +#### Referential Integrity Protection + +The pattern includes a critical safety mechanism: **referential integrity verification**. +Before inserting results, the system: + +1. Re-fetches the source data within the transaction +2. Compares it with the originally fetched data using deep hashing +3. Only proceeds with insertion if the data hasn't changed + +This prevents the "phantom read" problem where source data changes during long computations, +ensuring that results remain consistent with their inputs. + +#### Implementation Details + +The pattern is implemented using Python generators in the `AutoPopulate` class: + +```python +def make(self, key): + # Step 1: Fetch data from parent tables + fetched_data = self.make_fetch(key) + computed_result = yield fetched_data + + # Step 2: Compute if not provided + if computed_result is None: + computed_result = self.make_compute(key, *fetched_data) + yield computed_result + + # Step 3: Insert the computed result + self.make_insert(key, *computed_result) + yield +``` +Therefore, it is possible to override the `make` method to implement the three-part make pattern by using the `yield` statement to return the fetched data and computed result as above. + +#### Use Cases + +This pattern is particularly valuable for: + +- **Machine learning model training**: Hours-long training sessions +- **Image processing pipelines**: Large-scale image analysis +- **Statistical computations**: Complex statistical analyses +- **Data transformations**: ETL processes with heavy computation +- **Simulation runs**: Time-consuming simulations + +#### Example: Long-Running Image Analysis + +Here's an example of how to implement the three-part make pattern for a +long-running image analysis task: + +```python +@schema +class ImageAnalysis(dj.Computed): + definition = """ + # Complex image analysis results + -> Image + --- + analysis_result : longblob + processing_time : float + """ + + def make_fetch(self, key): + """Fetch the image data needed for analysis""" + image_data = (Image & key).fetch1('image') + params = (Params & key).fetch1('params') + return (image_data, params) # pack fetched_data + + def make_compute(self, key, image_data, params): + """Perform expensive image analysis outside transaction""" + import time + start_time = time.time() + + # Expensive computation that could take hours + result = complex_image_analysis(image_data, params) + processing_time = time.time() - start_time + return result, processing_time + + def make_insert(self, key, analysis_result, processing_time): + """Insert the analysis results""" + self.insert1(dict(key, + analysis_result=analysis_result, + processing_time=processing_time)) +``` + +The exact same effect may be achieved by overriding the `make` method as a generator function using the `yield` statement to return the fetched data and computed result as above: + +```python +@schema +class ImageAnalysis(dj.Computed): + definition = """ + # Complex image analysis results + -> Image + --- + analysis_result : longblob + processing_time : float + """ + + def make(self, key): + image_data = (Image & key).fetch1('image') + params = (Params & key).fetch1('params') + computed_result = yield (image, params) # pack fetched_data + + if computed_result is None: + # Expensive computation that could take hours + import time + start_time = time.time() + result = complex_image_analysis(image_data, params) + processing_time = time.time() - start_time + computed_result = result, processing_time #pack + yield computed_result + + result, processing_time = computed_result # unpack + self.insert1(dict(key, + analysis_result=result, + processing_time=processing_time)) + yield # yield control back to the caller +``` +We expect that most users will prefer to use the three-part implementation over the generator function implementation due to its conceptual complexity. \ No newline at end of file diff --git a/docs/src/compute/populate.md b/docs/src/compute/populate.md index 76fc62aee..eb7ae5f0b 100644 --- a/docs/src/compute/populate.md +++ b/docs/src/compute/populate.md @@ -62,8 +62,7 @@ The `make` callback does three things: 2. Computes and adds any missing attributes to the fields already in `key`. 3. Inserts the entire entity into `self`. -`make` may populate multiple entities in one call when `key` does not specify the -entire primary key of the populated table. +`make` may populate multiple entities in one call when `key` does not specify the entire primary key of the populated table. ## Populate diff --git a/pyproject.toml b/pyproject.toml index 075bb92b7..c787cc11d 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -27,6 +27,7 @@ dependencies = [ requires-python = ">=3.9,<4.0" authors = [ {name = "Dimitri Yatsenko", email = "dimitri@datajoint.com"}, + {name = "Thinh Nguyen", email = "thinh@datajoint.com"}, {name = "Raphael Guzman"}, {name = "Edgar Walker"}, {name = "DataJoint Contributors", email = "support@datajoint.com"},