Move a lot of checkpointing code into Memoizer, from DFK #3986

benclifford · 2025-10-10T11:17:47Z

This should not change behaviour and only move code around and add in more wiring.

This is working towards all of checkpointing/memoizer code being plugable, with the DFK having no knowledge of the specifics - for example, the DFK no longer imports pickle, because that is now the business of the memoizer.

Because of that goal, the parameters for Memoizer are arranged in two ways:

Constructor parameters for Memoizer are parameters that the future user will supply at configuration when Memoizer is exposed as a user plugin, like other plugins.

DFK internals that should be injected into the (eventually pluggable) memoizer (small m) are set as startup-time attributes, as for other plugins.

Right now these two things happen right next to each other, but the Memoizer constructor call will move into user configuration space in a subsequent PR.

As before this PR, there is still a separation between "checkpointing" and "memoization" that is slightly artificial. Subsequent PRs will merge these actions together more in this Memoizer implementation.

This PR preserves a user-facing dfk.checkpoint() call, that becomes a passthrough to Memoizer.checkpoint(). I think that is likely to disappear in a future PR: manual checkpoint triggering will become a feature (or non-feature) of a specific memoizer implementation and so a method directly on that memoizer (or not).

To go alongside the existing update_memo call, called by the DFK when a task is ready for memoization, this PR adds update_checkpoint, which captures the slightly different notion of a task being ready for checkpointing -- the current implementation can memoize a task that is not yet complete because it memoizes the future, not the result, while a checkpoint needs the actual result to be ready and so must be called later. This latter call happens in the race-condition-prone handle_app_update. A later PR will remove this distinction and move everything to around the same place as update_memo. See PR #3979 for description of fixing a related race condition related to update_memo.

See PR #3535 for broader context.

Changed Behaviour

none

Type of change

New feature

This should not change behaviour and only move code around and add in more wiring. This is working towards all of checkpointing/memoizer code being plugable, with the DFK having no knowledge of the specifics - for example, the DFK no longer imports `pickle`, because that is now the business of the memoizer. Because of that goal, the parameters for Memoizer are arranged in two ways: Constructor parameters for Memoizer are parameters that the future user will supply at configuration when Memoizer is exposed as a user plugin, like other plugins. DFK internals that should be injected into the (eventually pluggable) memoizer (small m) are set as startup-time attributes, as for other plugins. Right now these two things happen right next to each other, but the Memoizer constructor call will move into user configuration space in a subsequent PR. As before this PR, there is still a separation between "checkpointing" and "memoization" that is slightly artificial. Subsequent PRs will merge these actions together more in this Memoizer implementation. This PR preserves a user-facing dfk.checkpoint() call, that becomes a passthrough to Memoizer.checkpoint(). I think that is likely to disappear in a future PR: manual checkpoint triggering will become a feature (or non-feature) of a specific memoizer implementation and so a method directly on that memoizer (or not). To go alongside the existing update_memo call, called by the DFK when a task is ready for memoization, this PR adds update_checkpoint, which captures the slightly different notion of a task being ready for checkpointing -- the current implementation can memoize a task that is not yet complete because it memoizes the future, not the result, while a checkpoint needs the actual result to be ready and so must be called later. This latter call happens in the race-condition-prone handle_app_update. A later PR will remove this distinction and move everything to around the same place as update_memo. See PR #3979 for description of fixing a related race condition related to update_memo. See PR #3535 for broader context.

khk-globus

Looks like, as you say, this is basically just moving code. One question about the run_dir, but the others look to effectively be comments on years-old code.

khk-globus · 2025-10-13T14:30:14Z

parsl/dataflow/memoization.py

+    run_dir: str
+


Why make run_dir an optional instance member? For example, if checkpoint_files and checkpoint_mode are the right state and the consumer forgets to externally set it before calling .start() ... seems like a potential traceback?

As opposed to, say, choosing a default value, or ensuring that the mistake is caught at instantiation.

there's two initializations in the parsl style of doing configuration/plugins: the user creates the object, and so user-specified parameters go into __init__. Secondly, there's the framework that is the DFK that invents runtime stuff like the rundir. By the time this happens, __init__ is long in the past - perhaps in a different source file and well before the DFK even begins to exist.

So it's probably right to make this be mandatory parameter of start().

The plugin style in general doesn't do well in differentiating between "uninitialised, configured" (pre-start) plugins and "intialised" (post-start) objects. That differentiation was something that was forced with monitoring plugin radios, because the "configured, uninitialized" state needs to be moved around over the network. And if I was going to re-implement this object model, I'd probably try to make that separation happen everywhere. But that's not how Parsl is right now.

khk-globus · 2025-10-13T15:02:50Z

parsl/dataflow/memoization.py

+            if self.checkpoint_period is None:
+                raise ConfigurationError("Checkpoint period must be specified with periodic checkpoint mode")
+            else:


I guess the checkpoint_period attribute can be changed later? In some sense "too bad," as this check is nominally not needed after instantiation or comes much later than would be ideal (i.e., "immediately").

Meanwhile, stylistically, given the [no-return] semantic of the raise, could remove the else: and unindent the block.

It's probably weird to change it at all, and the validation can go into the __init__ of Memoizer I think. But this PR was trying to move blocks as big as possible and the subsequent timer start can't happen until start time.

khk-globus · 2025-10-13T15:08:11Z

parsl/dataflow/memoization.py

+                except Exception:
+                    raise ConfigurationError("invalid checkpoint_period provided: {0} expected HH:MM:SS".format(self.checkpoint_period))


Any sense in raise ... from ...? Given the not None check and typing, .split() is guaranteed to work, and the only failures would be an invalid int or not enough items for the destructure. (If interested, could also imit the number of splits to 2.)

It would be nice to ensure that the this field is valid much earlier. Perhaps in a setter-style setup?

khk-globus force-pushed the benc-checkpoint-rearrange branch from 42e7c51 to e81d668 Compare October 13, 2025 14:23

khk-globus approved these changes Oct 13, 2025

View reviewed changes

benclifford added this pull request to the merge queue Oct 15, 2025

Merged via the queue into master with commit 1aff212 Oct 15, 2025
11 checks passed

benclifford deleted the benc-checkpoint-rearrange branch October 15, 2025 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move a lot of checkpointing code into Memoizer, from DFK #3986

Move a lot of checkpointing code into Memoizer, from DFK #3986

Uh oh!

benclifford commented Oct 10, 2025

Uh oh!

khk-globus left a comment

Uh oh!

khk-globus Oct 13, 2025

Uh oh!

benclifford Oct 15, 2025

Uh oh!

khk-globus Oct 13, 2025

Uh oh!

benclifford Oct 15, 2025

Uh oh!

khk-globus Oct 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		except Exception:
		raise ConfigurationError("invalid checkpoint_period provided: {0} expected HH:MM:SS".format(self.checkpoint_period))

Move a lot of checkpointing code into Memoizer, from DFK #3986

Move a lot of checkpointing code into Memoizer, from DFK #3986

Uh oh!

Conversation

benclifford commented Oct 10, 2025

Changed Behaviour

Type of change

Uh oh!

khk-globus left a comment

Choose a reason for hiding this comment

Uh oh!

khk-globus Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

benclifford Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

khk-globus Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

benclifford Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

khk-globus Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants