Checkpointing simulations #4892
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR refactors how the
Checkpointerworks by now checkpointing simulations, rather than just models. This is needed as the simulations (+ output writers, callbacks, etc.) all contain crucial information needed to properly restore/pickup a simulation and continue time stepping.Basic design idea:
prognostic_state(obj)which returns a named tuple corresponding to the prognostic state ofobjandrestore_prognostic_state!(obj, state)which restoresobjbased on information contained instate(which is a named tuple and is read from a checkpoint file).prognostic_stateandrestore_prognostic_state!.Right now I've only implemented proper checkpointing for non-hydrostatic model but it looks like it'll be straightforward to do it for hydrostatic and shallow water models. I'm working on adding comprehensive testing too.
Will continue working on this PR, but any feedback is very welcome!
Resolves #1249
Resolves #3670
Resolves #3845
Resolves #4857
Rhetorical aside
In general, the checkpointer is assuming that the simulation setup is the same. So only prognostic state information that changes will be checkpointed (e.g. field data,
TimeInterval.actuations, etc.). The approach I have been taking (based on #4857) is to only checkpoint the prognostic state.Should we operate under this assumption? I think so because not doing so can lead to a lot of undefined behavior. The checkpointer should not be responsible for checking that you set up the same simulation as the one that was checkpointed.
For example, take the
SpecifiedTimesschedule. It has two propertiestimesandprevious_actuation. Sinceprevious_actuationchanges as the simulation runs, onlyprevious_actuationneeds to be checkpointed.This leads to the possibility of the user changing
timesthen picking upprevious_actuationwhich can lead to undefined behavior. I think this is fine, because the checkpointer only works assuming you set up the same simulation as the one that was checkpointed.Checkpointing both
timesandprevious_actuationallows us to check thattimesis the same when restoring. But I don't think this is the checkpointer's responsibility.