|
1 | 1 | # Checkpointer
|
2 | 2 |
|
3 |
| -This module contains general functions for logging the model states and restarting simulations. The `Checkpointer` uses `ClimaCore.InputOutput` infrastructure, which allows it to handle arbitrarily distributed logging and restart setups. |
| 3 | +## How to save and restart from checkpoints |
| 4 | + |
| 5 | +`ClimaCoupler` supports saving and reading simulation checkpoints. This is |
| 6 | +useful to split a long simulation into smaller, more manageable chunks. |
| 7 | + |
| 8 | +Checkpoints are a mix of HDF5 and JLD2 files and are typically saved in a |
| 9 | +`checkpoints` folder in the simulation output. See |
| 10 | +[`Utilities.setup_output_dirs`](@ref) for more information. |
| 11 | + |
| 12 | +!!! known limitations |
| 13 | + |
| 14 | + - The number of MPI processes has to remain the same across checkpoints |
| 15 | + - Restart files are generally not portable across machines, julia versions, and package versions |
| 16 | + - Adding/changing new component models will probably require adding/changing code |
| 17 | + |
| 18 | +### Saving checkpoints |
| 19 | + |
| 20 | +If you are running a model (such as AMIP), chances are that you can enable |
| 21 | +checkpointing just by setting a command-line argument; The `checkpoint_dt` |
| 22 | +option controls how frequently a checkpoint should be produced. |
| 23 | + |
| 24 | +If your model does not come with this option already, you can checkpoint the |
| 25 | +simulation by adding a callback that calls the |
| 26 | +[`Checkpointer.checkpoint_sims`](@ref) function. |
| 27 | + |
| 28 | +For example, to add a callback to checkpoint every hour of simulated time, |
| 29 | +assuming you have a `start_date` |
| 30 | +```julia |
| 31 | +import Dates |
| 32 | + |
| 33 | +import ClimaCoupler: Checkpointer, TimeManager |
| 34 | +import ClimaDiagnostics.Schedules: EveryCalendarDtSchedule |
| 35 | + |
| 36 | +schedule = EveryCalendarDtSchedule(Dates.Hour(1); start_date) |
| 37 | +checkpoint_callback = TimeManager.Callback(schedule_checkpoint, Checkpointer.checkpoint_sims) |
| 38 | + |
| 39 | +# In the coupling loop: |
| 40 | +TimeManager.maybe_trigger_callback(checkpoint_callback, coupled_simulation, time) |
| 41 | +``` |
| 42 | + |
| 43 | +### Reading checkpoints |
| 44 | + |
| 45 | +There are two ways to restart a simulation from checkpoints. By default, |
| 46 | +`ClimaCoupler` tries finding suitable checkpoints and automatically use them. |
| 47 | +Alternatively, you can specify a directory `restart_dir` and a simulation time |
| 48 | +`restart_t` and restart from files saved in the given directory at the given |
| 49 | +time. If the model you are running supports writing checkpoints via command-line |
| 50 | +argument, it will probably also support reading them. In this case, the |
| 51 | +arguments `restart_dir` and `restart_t` identify the path of the top level |
| 52 | +directory containing all the checkpoint files and the simulated times in second. |
| 53 | + |
| 54 | +If the model does not support directly reading a checkpoint, the `Checkpointer` |
| 55 | +module provides a straightforward way to add this feature. |
| 56 | +[`Checkpointer.restart!`](@ref) takes a coupled simulation, a `restart_dir`, and |
| 57 | +a `restart_t` and overwrites the content of the coupled simulation with what is |
| 58 | +in the checkpoint. |
| 59 | + |
| 60 | +## Developer notes |
| 61 | + |
| 62 | +In theory, the state of the component models should fully determine the state of |
| 63 | +the coupled simulation and one should be able to restart a coupled simulation |
| 64 | +just by using the states of the component models. Unfortunately, this is |
| 65 | +currently not the case in `ClimaCoupler`. The main reason for this is the |
| 66 | +complex interdependencies between component models and within `ClimaAtmos` which |
| 67 | +make the initialization step inconsistent. For example, in a coupled simulation, |
| 68 | +the surface albedo should be determined by the surface models and used by the |
| 69 | +atmospheric model for radiation transfer, but `ClimaAtmos` also tries to set the |
| 70 | +surface albedo (since it has to do so when run in standalone mode). In addition |
| 71 | +to this, `ClimaAtmos` has a large cache that has internal interdependencies that |
| 72 | +are hard to disentangle, and changing a field might require changing some other |
| 73 | +field in a different part of the cache. As a result, it is not easy for |
| 74 | +`ClimaCoupler` to consistently do initialization from a cold state. To conclude, |
| 75 | +restarting a simulation exclusively using the states of the component models is |
| 76 | +currently impossible. |
| 77 | + |
| 78 | +Given that restarting a simulation from the state is impossible, `ClimaCoupler` |
| 79 | +needs to save the states and the caches. Let us review how we use |
| 80 | +`ClimaCore.InputOutput` and `JLD2` package to accomplish this. |
| 81 | + |
| 82 | +`ClimaCore.InputOutput` provides a loss-less way to save the content of certain |
| 83 | +`ClimaCore` objects to HDF5 files. Objects saved in this way are not tied to a |
| 84 | +particular computing device or configuration. When running with MPI, |
| 85 | +`ClimaCore.InputOutput` are also efficiently written in parallel. |
| 86 | + |
| 87 | +Unfortunately, `ClimaCore.InputOutput` only supports certain objects, such as |
| 88 | +`Field`s and `Space`s, but the cache in component models is more complex than |
| 89 | +this and contains complex objects with highly stateful quantities (e.g., C |
| 90 | +pointers). Because of this, model states are saved to HDF5 but caches must be |
| 91 | +saved to JLD2 files. |
| 92 | + |
| 93 | +`JLD2` allows us to save more complex objects without writing specific |
| 94 | +serialization methods for every struct. `JLD2` allows us to take a big step |
| 95 | +forward, but there are still several challenges that need to be solved: |
| 96 | +1. `JLD2` does not support CUDA natively. To go around this, we have to move |
| 97 | + everything onto the CPU first. Then, when the data is read back, we have to |
| 98 | + move it back to the GPU. |
| 99 | +2. `JLD2` does not support MPI natively. To go around this, each process writes |
| 100 | + its `jld2` checkpoint and reads it back. This introduces the constraint that |
| 101 | + the number of MPI processes cannot change across restarts. |
| 102 | +3. Some quantities are best not saved and read (for example, anything with |
| 103 | + pointers). For this, we write a recursive function that traverses the cache |
| 104 | + and only restores quantities of a certain type (typically, `ClimaCore` |
| 105 | + objects) |
| 106 | + |
| 107 | +Point 3. adds significant amount of code and requires component models to |
| 108 | +specify how their cache has to be restored. |
| 109 | + |
| 110 | +If you are adding a component model, you have to extend the |
| 111 | +``` |
| 112 | +Checkpointer.get_model_prog_state |
| 113 | +Checkpointer.get_model_cache |
| 114 | +Checkpointer.restore_cache! |
| 115 | +``` |
| 116 | +methods. |
| 117 | + |
| 118 | +`ClimaCoupler` moves objects to the CPU with `Adapt(Array, x)`. `Adapt` |
| 119 | +traverses the object recursively, and proper `Adapt` methods have to be defined |
| 120 | +for every object involved in the chain. The easiest way to do this is using the |
| 121 | +`Adapt.@adapt_structure` macro, which defines a recursive Adapt for the given |
| 122 | +object. |
| 123 | + |
| 124 | +Types to watch for: |
| 125 | +- `MPI` related objects (e.g., `MPICommsContext`) |
| 126 | +- `TimeVaryingInputs` (because they contain `NCDatasets`, which contain pointers |
| 127 | + to files) |
4 | 128 |
|
5 | 129 | ## Checkpointer API
|
6 | 130 |
|
7 | 131 | ```@docs
|
8 | 132 | ClimaCoupler.Checkpointer.get_model_prog_state
|
9 |
| - ClimaCoupler.Checkpointer.restart_model_state! |
10 |
| - ClimaCoupler.Checkpointer.checkpoint_model_state |
| 133 | + ClimaCoupler.Checkpointer.get_model_cache |
| 134 | + ClimaCoupler.Checkpointer.restart! |
11 | 135 | ClimaCoupler.Checkpointer.checkpoint_sims
|
| 136 | + ClimaCoupler.Checkpointer.t_start_from_checkpoint |
12 | 137 | ```
|
0 commit comments