Skip to content

Commit d2f3956

Browse files
authored
Merge pull request #1179 from CliMA/gb/checkpoint3
Support identical restarts with JLD2 files
2 parents a62d04e + 528d5bc commit d2f3956

27 files changed

+899
-199
lines changed

.buildkite/pipeline.yml

Lines changed: 23 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -67,16 +67,6 @@ steps:
6767
- group: "Unit Tests"
6868
steps:
6969

70-
- label: "MPI Checkpointer unit tests"
71-
key: "checkpointer_mpi_tests"
72-
command: "srun julia --color=yes --project=test/ test/mpi_tests/checkpointer_mpi_tests.jl"
73-
timeout_in_minutes: 20
74-
env:
75-
CLIMACOMMS_CONTEXT: "MPI"
76-
agents:
77-
slurm_ntasks: 2
78-
slurm_mem: 16GB
79-
8070
- label: "MPI Utilities unit tests"
8171
key: "utilities_mpi_tests"
8272
command: "srun julia --color=yes --project=test/ test/utilities_tests.jl"
@@ -97,6 +87,7 @@ steps:
9787
agents:
9888
slurm_ntasks: 1
9989
slurm_gres: "gpu:1"
90+
slurm_mem: 24GB
10091

10192
- group: "GPU: experiments/ClimaEarth/ unit tests and global bucket"
10293
steps:
@@ -109,6 +100,27 @@ steps:
109100
slurm_gres: "gpu:1"
110101
slurm_mem: 20GB
111102

103+
- group: "ClimaEarth test"
104+
steps:
105+
- label: "ClimaEarth test"
106+
key: "restarts"
107+
command: "julia --color=yes --project=experiments/ClimaEarth/ experiments/ClimaEarth/test/runtests.jl"
108+
agents:
109+
slurm_mem: 16GB
110+
111+
- label: "MPI restarts"
112+
key: "mpi_restarts"
113+
command: "srun julia --color=yes --project=experiments/ClimaEarth/ experiments/ClimaEarth/test/restart.jl"
114+
env:
115+
CLIMACOMMS_CONTEXT: "MPI"
116+
timeout_in_minutes: 40
117+
soft_fail:
118+
- exit_status: -1
119+
- exit_status: 255
120+
agents:
121+
slurm_ntasks: 2
122+
slurm_mem: 32G
123+
112124
- group: "Integration Tests"
113125
steps:
114126
# SLABPLANET EXPERIMENTS
@@ -218,7 +230,7 @@ steps:
218230
CLIMACOMMS_CONTEXT: "MPI"
219231
agents:
220232
slurm_ntasks: 4
221-
slurm_mem_per_cpu: 8GB
233+
slurm_mem_per_cpu: 12GB
222234

223235
# short high-res performance test
224236
- label: "Unthreaded AMIP FINE" # also reported by longruns with a flame graph

NEWS.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,20 @@ TOA radiation and net precipitation are added only if conservation is enabled.
2121
The coupler fields are also now stored as a ClimaCore Field of NamedTuples,
2222
rather than as a NamedTuple of ClimaCore Fields.
2323

24+
#### Restart simulations with JLD2 files PR[#1179](https://github.com/CliMA/ClimaCoupler.jl/pull/1179)
25+
26+
`ClimaCoupler` can now use `JLD2` files to save state and cache for its model
27+
component, allowing it to restart from saved checkpoints. Some restrictions
28+
apply:
29+
30+
- The number of MPI processes has to remain the same across checkpoints
31+
- Restart files are generally not portable across machines, julia versions, and package versions
32+
- Adding/changing new component models will probably require adding/changing code
33+
34+
Please, refer to the
35+
[documentation](https://clima.github.io/ClimaCoupler.jl/dev/checkpointer/) for
36+
more information.
37+
2438
#### Remove extra `get_field` functions PR[#1203](https://github.com/CliMA/ClimaCoupler.jl/pull/1203)
2539
Removes the `get_field` functions for `air_density` for all models, which
2640
were unused except for the `BucketSimulation` method, which is replaced by a

Project.toml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ ClimaComms = "3a4d1b5c-c61d-41fd-a00a-5873ba7a1b0d"
88
ClimaCore = "d414da3d-4745-48bb-8d80-42e94e092884"
99
ClimaUtilities = "b3f4f4ca-9299-4f7f-bd9b-81e1242a7513"
1010
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
11+
JLD2 = "033835bb-8acc-5ee8-8aae-3f567f8a3819"
1112
Logging = "56ddb016-857b-54e1-b83d-db4d58db5568"
1213
SciMLBase = "0bca4576-84f4-4d90-8ffe-ffa030f20462"
1314
StaticArrays = "90137ffa-7385-5640-81b9-e52037218182"
@@ -16,9 +17,10 @@ Thermodynamics = "b60c26fb-14c3-4610-9d3e-2d17fe7ff00c"
1617

1718
[compat]
1819
ClimaComms = "0.6.2"
19-
ClimaCore = "0.14.23"
20+
ClimaCore = "0.14.25"
2021
ClimaUtilities = "0.1.22"
2122
Dates = "1"
23+
JLD2 = "0.5.11"
2224
Logging = "1"
2325
SciMLBase = "2.11"
2426
StaticArrays = "1.6"

docs/src/checkpointer.md

Lines changed: 128 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,137 @@
11
# Checkpointer
22

3-
This module contains general functions for logging the model states and restarting simulations. The `Checkpointer` uses `ClimaCore.InputOutput` infrastructure, which allows it to handle arbitrarily distributed logging and restart setups.
3+
## How to save and restart from checkpoints
4+
5+
`ClimaCoupler` supports saving and reading simulation checkpoints. This is
6+
useful to split a long simulation into smaller, more manageable chunks.
7+
8+
Checkpoints are a mix of HDF5 and JLD2 files and are typically saved in a
9+
`checkpoints` folder in the simulation output. See
10+
[`Utilities.setup_output_dirs`](@ref) for more information.
11+
12+
!!! known limitations
13+
14+
- The number of MPI processes has to remain the same across checkpoints
15+
- Restart files are generally not portable across machines, julia versions, and package versions
16+
- Adding/changing new component models will probably require adding/changing code
17+
18+
### Saving checkpoints
19+
20+
If you are running a model (such as AMIP), chances are that you can enable
21+
checkpointing just by setting a command-line argument; The `checkpoint_dt`
22+
option controls how frequently a checkpoint should be produced.
23+
24+
If your model does not come with this option already, you can checkpoint the
25+
simulation by adding a callback that calls the
26+
[`Checkpointer.checkpoint_sims`](@ref) function.
27+
28+
For example, to add a callback to checkpoint every hour of simulated time,
29+
assuming you have a `start_date`
30+
```julia
31+
import Dates
32+
33+
import ClimaCoupler: Checkpointer, TimeManager
34+
import ClimaDiagnostics.Schedules: EveryCalendarDtSchedule
35+
36+
schedule = EveryCalendarDtSchedule(Dates.Hour(1); start_date)
37+
checkpoint_callback = TimeManager.Callback(schedule_checkpoint, Checkpointer.checkpoint_sims)
38+
39+
# In the coupling loop:
40+
TimeManager.maybe_trigger_callback(checkpoint_callback, coupled_simulation, time)
41+
```
42+
43+
### Reading checkpoints
44+
45+
There are two ways to restart a simulation from checkpoints. By default,
46+
`ClimaCoupler` tries finding suitable checkpoints and automatically use them.
47+
Alternatively, you can specify a directory `restart_dir` and a simulation time
48+
`restart_t` and restart from files saved in the given directory at the given
49+
time. If the model you are running supports writing checkpoints via command-line
50+
argument, it will probably also support reading them. In this case, the
51+
arguments `restart_dir` and `restart_t` identify the path of the top level
52+
directory containing all the checkpoint files and the simulated times in second.
53+
54+
If the model does not support directly reading a checkpoint, the `Checkpointer`
55+
module provides a straightforward way to add this feature.
56+
[`Checkpointer.restart!`](@ref) takes a coupled simulation, a `restart_dir`, and
57+
a `restart_t` and overwrites the content of the coupled simulation with what is
58+
in the checkpoint.
59+
60+
## Developer notes
61+
62+
In theory, the state of the component models should fully determine the state of
63+
the coupled simulation and one should be able to restart a coupled simulation
64+
just by using the states of the component models. Unfortunately, this is
65+
currently not the case in `ClimaCoupler`. The main reason for this is the
66+
complex interdependencies between component models and within `ClimaAtmos` which
67+
make the initialization step inconsistent. For example, in a coupled simulation,
68+
the surface albedo should be determined by the surface models and used by the
69+
atmospheric model for radiation transfer, but `ClimaAtmos` also tries to set the
70+
surface albedo (since it has to do so when run in standalone mode). In addition
71+
to this, `ClimaAtmos` has a large cache that has internal interdependencies that
72+
are hard to disentangle, and changing a field might require changing some other
73+
field in a different part of the cache. As a result, it is not easy for
74+
`ClimaCoupler` to consistently do initialization from a cold state. To conclude,
75+
restarting a simulation exclusively using the states of the component models is
76+
currently impossible.
77+
78+
Given that restarting a simulation from the state is impossible, `ClimaCoupler`
79+
needs to save the states and the caches. Let us review how we use
80+
`ClimaCore.InputOutput` and `JLD2` package to accomplish this.
81+
82+
`ClimaCore.InputOutput` provides a loss-less way to save the content of certain
83+
`ClimaCore` objects to HDF5 files. Objects saved in this way are not tied to a
84+
particular computing device or configuration. When running with MPI,
85+
`ClimaCore.InputOutput` are also efficiently written in parallel.
86+
87+
Unfortunately, `ClimaCore.InputOutput` only supports certain objects, such as
88+
`Field`s and `Space`s, but the cache in component models is more complex than
89+
this and contains complex objects with highly stateful quantities (e.g., C
90+
pointers). Because of this, model states are saved to HDF5 but caches must be
91+
saved to JLD2 files.
92+
93+
`JLD2` allows us to save more complex objects without writing specific
94+
serialization methods for every struct. `JLD2` allows us to take a big step
95+
forward, but there are still several challenges that need to be solved:
96+
1. `JLD2` does not support CUDA natively. To go around this, we have to move
97+
everything onto the CPU first. Then, when the data is read back, we have to
98+
move it back to the GPU.
99+
2. `JLD2` does not support MPI natively. To go around this, each process writes
100+
its `jld2` checkpoint and reads it back. This introduces the constraint that
101+
the number of MPI processes cannot change across restarts.
102+
3. Some quantities are best not saved and read (for example, anything with
103+
pointers). For this, we write a recursive function that traverses the cache
104+
and only restores quantities of a certain type (typically, `ClimaCore`
105+
objects)
106+
107+
Point 3. adds significant amount of code and requires component models to
108+
specify how their cache has to be restored.
109+
110+
If you are adding a component model, you have to extend the
111+
```
112+
Checkpointer.get_model_prog_state
113+
Checkpointer.get_model_cache
114+
Checkpointer.restore_cache!
115+
```
116+
methods.
117+
118+
`ClimaCoupler` moves objects to the CPU with `Adapt(Array, x)`. `Adapt`
119+
traverses the object recursively, and proper `Adapt` methods have to be defined
120+
for every object involved in the chain. The easiest way to do this is using the
121+
`Adapt.@adapt_structure` macro, which defines a recursive Adapt for the given
122+
object.
123+
124+
Types to watch for:
125+
- `MPI` related objects (e.g., `MPICommsContext`)
126+
- `TimeVaryingInputs` (because they contain `NCDatasets`, which contain pointers
127+
to files)
4128

5129
## Checkpointer API
6130

7131
```@docs
8132
ClimaCoupler.Checkpointer.get_model_prog_state
9-
ClimaCoupler.Checkpointer.restart_model_state!
10-
ClimaCoupler.Checkpointer.checkpoint_model_state
133+
ClimaCoupler.Checkpointer.get_model_cache
134+
ClimaCoupler.Checkpointer.restart!
11135
ClimaCoupler.Checkpointer.checkpoint_sims
136+
ClimaCoupler.Checkpointer.t_start_from_checkpoint
12137
```

0 commit comments

Comments
 (0)