- 
                Notifications
    
You must be signed in to change notification settings  - Fork 211
 
Move a lot of checkpointing code into Memoizer, from DFK #3986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This should not change behaviour and only move code around and add in more wiring. This is working towards all of checkpointing/memoizer code being plugable, with the DFK having no knowledge of the specifics - for example, the DFK no longer imports `pickle`, because that is now the business of the memoizer. Because of that goal, the parameters for Memoizer are arranged in two ways: Constructor parameters for Memoizer are parameters that the future user will supply at configuration when Memoizer is exposed as a user plugin, like other plugins. DFK internals that should be injected into the (eventually pluggable) memoizer (small m) are set as startup-time attributes, as for other plugins. Right now these two things happen right next to each other, but the Memoizer constructor call will move into user configuration space in a subsequent PR. As before this PR, there is still a separation between "checkpointing" and "memoization" that is slightly artificial. Subsequent PRs will merge these actions together more in this Memoizer implementation. This PR preserves a user-facing dfk.checkpoint() call, that becomes a passthrough to Memoizer.checkpoint(). I think that is likely to disappear in a future PR: manual checkpoint triggering will become a feature (or non-feature) of a specific memoizer implementation and so a method directly on that memoizer (or not). To go alongside the existing update_memo call, called by the DFK when a task is ready for memoization, this PR adds update_checkpoint, which captures the slightly different notion of a task being ready for checkpointing -- the current implementation can memoize a task that is not yet complete because it memoizes the future, not the result, while a checkpoint needs the actual result to be ready and so must be called later. This latter call happens in the race-condition-prone handle_app_update. A later PR will remove this distinction and move everything to around the same place as update_memo. See PR #3979 for description of fixing a related race condition related to update_memo. See PR #3535 for broader context.
42e7c51    to
    e81d668      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like, as you say, this is basically just moving code.  One question about the run_dir, but the others look to effectively be comments on years-old code.
| run_dir: str | ||
| 
               | 
          
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make run_dir an optional instance member?  For example, if checkpoint_files and checkpoint_mode are the right state and the consumer forgets to externally set it before calling .start() ... seems like a potential traceback?
As opposed to, say, choosing a default value, or ensuring that the mistake is caught at instantiation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's two initializations in the parsl style of doing configuration/plugins: the user creates the object, and so user-specified parameters go into __init__. Secondly, there's the framework that is the DFK that invents runtime stuff like the rundir. By the time this happens, __init__ is long in the past - perhaps in a different source file and well before the DFK even begins to exist.
So it's probably right to make this be mandatory parameter of start().
The plugin style in general doesn't do well in differentiating between "uninitialised, configured" (pre-start) plugins and "intialised" (post-start) objects.  That differentiation was something that was forced with monitoring plugin radios, because the "configured, uninitialized" state needs to be moved around over the network. And if I was going to re-implement this object model, I'd probably try to make that separation happen everywhere. But that's not how Parsl is right now.
| if self.checkpoint_period is None: | ||
| raise ConfigurationError("Checkpoint period must be specified with periodic checkpoint mode") | ||
| else: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the checkpoint_period attribute can be changed later?  In some sense "too bad," as this check is nominally not needed after instantiation or comes much later than would be ideal (i.e., "immediately").
Meanwhile, stylistically, given the [no-return] semantic of the raise, could remove the else: and unindent the block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably weird to change it at all, and the validation can go into the __init__ of Memoizer I think. But this PR was trying to move blocks as big as possible and the subsequent timer start can't happen until start time.
| except Exception: | ||
| raise ConfigurationError("invalid checkpoint_period provided: {0} expected HH:MM:SS".format(self.checkpoint_period)) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any sense in raise ... from ...?  Given the not None check and typing, .split() is guaranteed to work, and the only failures would be an invalid int or not enough items for the destructure.  (If interested, could also imit the number of splits to 2.)
It would be nice to ensure that the this field is valid much earlier. Perhaps in a setter-style setup?
This should not change behaviour and only move code around and add in more wiring.
This is working towards all of checkpointing/memoizer code being plugable, with the DFK having no knowledge of the specifics - for example, the DFK no longer imports
pickle, because that is now the business of the memoizer.Because of that goal, the parameters for Memoizer are arranged in two ways:
Constructor parameters for Memoizer are parameters that the future user will supply at configuration when Memoizer is exposed as a user plugin, like other plugins.
DFK internals that should be injected into the (eventually pluggable) memoizer (small m) are set as startup-time attributes, as for other plugins.
Right now these two things happen right next to each other, but the Memoizer constructor call will move into user configuration space in a subsequent PR.
As before this PR, there is still a separation between "checkpointing" and "memoization" that is slightly artificial. Subsequent PRs will merge these actions together more in this Memoizer implementation.
This PR preserves a user-facing dfk.checkpoint() call, that becomes a passthrough to Memoizer.checkpoint(). I think that is likely to disappear in a future PR: manual checkpoint triggering will become a feature (or non-feature) of a specific memoizer implementation and so a method directly on that memoizer (or not).
To go alongside the existing update_memo call, called by the DFK when a task is ready for memoization, this PR adds update_checkpoint, which captures the slightly different notion of a task being ready for checkpointing -- the current implementation can memoize a task that is not yet complete because it memoizes the future, not the result, while a checkpoint needs the actual result to be ready and so must be called later. This latter call happens in the race-condition-prone handle_app_update. A later PR will remove this distinction and move everything to around the same place as update_memo. See PR #3979 for description of fixing a related race condition related to update_memo.
See PR #3535 for broader context.
Changed Behaviour
none
Type of change