Skip to content

Conversation

@ali-ramadhan
Copy link
Member

@ali-ramadhan ali-ramadhan commented Oct 30, 2025

This PR refactors how the Checkpointer works by now checkpointing simulations, rather than just models. This is needed as the simulations (+ output writers, callbacks, etc.) all contain crucial information needed to properly restore/pickup a simulation and continue time stepping.

Basic design idea:

  • We now have two new functions: prognostic_state(obj) which returns a named tuple corresponding to the prognostic state of obj and restore_prognostic_state!(obj, state) which restores obj based on information contained in state (which is a named tuple and is read from a checkpoint file).
  • Objects are checkpointed recursively by serializing prognostic information to the JLD2 checkpoint file.
  • The goal is for checkpointing to be flexible enough that we can very easily use it for different types of simulations, e.g. coupled simulations in ClimaOcean.jl by just defining prognostic_state and restore_prognostic_state!.

Right now I've only implemented proper checkpointing for non-hydrostatic model but it looks like it'll be straightforward to do it for hydrostatic and shallow water models. I'm working on adding comprehensive testing too.

Will continue working on this PR, but any feedback is very welcome!

Resolves #1249
Resolves #3670
Resolves #3845
Resolves #4857


Rhetorical aside

In general, the checkpointer is assuming that the simulation setup is the same. So only prognostic state information that changes will be checkpointed (e.g. field data, TimeInterval.actuations, etc.). The approach I have been taking (based on #4857) is to only checkpoint the prognostic state.

Should we operate under this assumption? I think so because not doing so can lead to a lot of undefined behavior. The checkpointer should not be responsible for checking that you set up the same simulation as the one that was checkpointed.

For example, take the SpecifiedTimes schedule. It has two properties times and previous_actuation. Since previous_actuation changes as the simulation runs, only previous_actuation needs to be checkpointed.

This leads to the possibility of the user changing times then picking up previous_actuation which can lead to undefined behavior. I think this is fine, because the checkpointer only works assuming you set up the same simulation as the one that was checkpointed.

Checkpointing both times and previous_actuation allows us to check that times is the same when restoring. But I don't think this is the checkpointer's responsibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

3 participants