Skip to content

Commit be873ce

Browse files
committed
Add OnlineLogging
`OnlineLogging` is a new module (mostly based on ClimaAtmos `WallTimeEstimate`) that contains tools to provide informative messages while a simulation is running. At the moment, `OnlineLogging` only contains `WallTimeInfo`, to report on timing information. This new module will allow ClimaLand and ClimaCoupler to add some progress information, and remove some entries from the atmos cache.
1 parent 4229def commit be873ce

File tree

8 files changed

+435
-0
lines changed

8 files changed

+435
-0
lines changed

NEWS.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,30 @@ main
77
v0.1.20
88
------
99

10+
#### New module `OnlineLogging`. PR [#137](https://github.com/CliMA/ClimaUtilities.jl/pull/137)
11+
12+
A new module, `OnlineLogging`, provides tools to report messages while a
13+
simulation is running. Currently, the only feature implemented is
14+
`WallTimeInfo`, to report on timing information.
15+
16+
With `WallTimeInfo`, one can add reports like the following to their simulations
17+
```
18+
┌ Info: Progress
19+
│ simulation_time = "2 seconds"
20+
│ n_steps_completed = 20
21+
│ wall_time_per_step = "10 milliseconds, 100 microseconds"
22+
│ wall_time_total = "1 second, 10 milliseconds"
23+
│ wall_time_remaining = "808 milliseconds, 35 microseconds"
24+
│ wall_time_spent = "202 milliseconds, 8 microseconds"
25+
│ percent_complete = "20.0%"
26+
│ estimated_sypd = "0.027"
27+
│ date_now = 2024-12-03T16:01:13.660
28+
└ estimated_finish_date = 2024-12-03T16:01:14.660
29+
```
30+
31+
Find more information in the
32+
[documentation](https://clima.github.io/ClimaUtilities.jl/dev/onlinelogging.html).
33+
1034
#### Support reading time data across multiple files. PRs [#127](https://github.com/CliMA/ClimaUtilities.jl/pull/127), [#132](https://github.com/CliMA/ClimaUtilities.jl/pull/132)
1135

1236
`NCFileReader`s can now read multiple files at the same time. The files have to

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,8 @@ ClimaUtilities.jl
2424
[`DataHandling`](https://clima.github.io/ClimaUtilities.jl/dev/datahandling/),
2525
and [`Regridders`](https://clima.github.io/ClimaUtilities.jl/dev/regridders/)
2626
to process input data and remap it onto the simulation grid.
27+
- [`OnlineLogging`](https://clima.github.io/ClimaUtilities.jl/dev/onlinelogging/)
28+
to add output information while a simulation is running.
2729
- [`OutputPathGenerator`](https://clima.github.io/ClimaUtilities.jl/dev/outputpathgenerator/)
2830
to prepare the output directory structure of a simulation.
2931
- [`TimeManager`](https://clima.github.io/ClimaUtilities.jl/dev/timemanager/) to

docs/make.jl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ pages = [
2323
"FileReaders" => "filereaders.md",
2424
"DataHandling" => "datahandling.md",
2525
"Regridders" => "regridders.md",
26+
"OnlineLogging" => "onlinelogging.md",
2627
"OutputPathGenerator" => "outputpathgenerator.md",
2728
"TimeManager" => "timemanager.md",
2829
"Frequently Asked Questions" => "faqs.md",

docs/src/index.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,11 @@ conservative interpolation onto lat-long grids.
6060
At the moment, `TempestRegridder` *does not* support MPI/GPUs and can
6161
only perform interpolation onto lat-long grids (not on z).
6262

63+
### `OnlineLogging`
64+
65+
`OnlineLogging` provides tools to produce informative messages while a
66+
simulation is running (e.g., current/average timing information).
67+
6368
### `OutputPathGenerator`
6469

6570
`OutputPathGenerator` handles the directory structure for the output of a

docs/src/onlinelogging.md

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# Online Simulation Progress Reporting
2+
3+
The `OnlineLogging` module provides tools to monitor and report the simulation
4+
progress and other information of simulations.
5+
6+
Currently, the only feature implemented is timing report, to print information
7+
about current step, simulation time, and average performance. With
8+
`report_walltime`, you will see progress information like the following
9+
```
10+
┌ Info: Progress
11+
│ simulation_time = "2 seconds"
12+
│ n_steps_completed = 20
13+
│ wall_time_per_step = "10 milliseconds, 100 microseconds"
14+
│ wall_time_total = "1 second, 10 milliseconds"
15+
│ wall_time_remaining = "808 milliseconds, 35 microseconds"
16+
│ wall_time_spent = "202 milliseconds, 8 microseconds"
17+
│ percent_complete = "20.0%"
18+
│ estimated_sypd = "0.027"
19+
│ date_now = 2024-12-03T16:01:13.660
20+
└ estimated_finish_date = 2024-12-03T16:01:14.660
21+
```
22+
23+
## `WallTimeInfo` and `report_walltime`
24+
25+
`WallTimeInfo` is a struct that holds information about wall time (the time you
26+
see on your watch, not the simulation time) and that can be used to report
27+
timing information with [`report_walltime`](@ref).
28+
29+
`WallTimeInfo` keeps track and accumulates how much time has elapsed since the
30+
last time it was updated. In this, `WallTimeInfo` tries to automatically remove
31+
the compilation overhead that your simulation might run into in the first step
32+
(this is accomplished by ignoring the first step and doubling the cost of the
33+
second step to compensate).
34+
35+
The simplest way to use `WallTimeInfo` is to make `report_walltime` a callback.
36+
Here is an example using a `SciML` integrator
37+
```julia
38+
import SciMLBase
39+
import ClimaUtilities.OnlineLogging: WallTimeInfo, report_walltime
40+
# Create the WallTimeInfo
41+
wt = WallTimeInfo()
42+
43+
# Define a schedule that defines how often to report. Here, we follow the signature
44+
# required by SciML. This function is triggered every 10 steps
45+
function every10steps(u, t, integrator)
46+
return mod(integrator.step, 10) == 0
47+
end
48+
49+
# Next, define the callback
50+
report_callback = SciMLBase.DiscreteCallback(every10steps,
51+
let wt = wt
52+
integrator -> report_walltime(wt, integrator)
53+
end)
54+
55+
# The let wt = wt is not strickly required, but it can improve type-stability and performance
56+
57+
# Now that we have the callback, we can pass it to the SciML constructor for the integrator
58+
```
59+
60+
!!! todo
61+
62+
Describe schedules when we add them to `ClimaUtilities`
63+
64+
The `report_walltime` function prints various timing statistics:
65+
66+
* **`simulation_time`:** The elapsed time within the simulation.
67+
* **`n_steps_completed`:** The number of simulation steps completed.
68+
* **`wall_time_per_step`:** The average wall clock time per step.
69+
* **`wall_time_total`:** Estimated total wall clock time for the simulation.
70+
* **`wall_time_remaining`:** Estimated remaining wall clock time.
71+
* **`wall_time_spent`:** Total wall clock time already spent.
72+
* **`percent_complete`:** Percentage of the simulation completed.
73+
* **`estimated_sypd`:** Estimated simulated years per day (or days per day if
74+
the rate is slow).
75+
* **`date_now`:** The current date and time.
76+
* **`estimated_finish_date`:** The estimated date and time of completion.
77+
78+
!!! note
79+
80+
The estimated values (like `wall_time_remaining` and `estimated_sypd`) are
81+
based on the average wall time per step and can fluctuate, especially early
82+
in the simulation. They become more reliable as the simulation progresses.
83+
84+
## API
85+
86+
```@docs
87+
ClimaUtilities.OnlineLogging.WallTimeInfo
88+
ClimaUtilities.OnlineLogging.report_walltime
89+
```

src/ClimaUtilities.jl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ include("TimeVaryingInputs.jl")
1212

1313
include("ClimaArtifacts.jl")
1414

15+
include("OnlineLogging.jl")
1516
include("OutputPathGenerator.jl")
1617

1718
end # module ClimaUtilities

src/OnlineLogging.jl

Lines changed: 200 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,200 @@
1+
module OnlineLogging
2+
3+
import Dates
4+
5+
"""
6+
WallTimeInfo
7+
8+
Hold information about wall time.
9+
"""
10+
struct WallTimeInfo
11+
"""How many times the `update!` function has been called on this object. `update!` is
12+
intended to be call at the end of every timestep with a callback. This field is
13+
primarily used to exclude compilation from the timings."""
14+
n_calls::Base.RefValue{Int}
15+
16+
"""Wall time of the previous call to update `WallTimeInfo`."""
17+
t_wall_last::Base.RefValue{Float64}
18+
19+
# We don't save t_wall_init, but ∑Δt_wall because we want to avoid including the
20+
# compilation time in here. update_progress_reporter! will skip the first couple of
21+
# steps
22+
"""Sum of elapsed walltime all the calls to `_update!`"""
23+
∑Δt_wall::Base.RefValue{Float64}
24+
function WallTimeInfo()
25+
n_calls = Ref(0)
26+
t_wall_last = Ref(-1.0)
27+
∑Δt_wall = Ref(0.0)
28+
return new(n_calls, t_wall_last, ∑Δt_wall)
29+
end
30+
end
31+
32+
"""
33+
_update!(wt::WallTimeInfo)
34+
35+
Update the timing information stored in the `WallTimeInfo` struct `wt`.
36+
37+
This function tracks the wall time elapsed since the last call to `update!`. It handles the
38+
initial calls specially to exclude compilation time from the overall timing measurements.
39+
"""
40+
function _update!(wt::WallTimeInfo)
41+
# - The very first call (when `n_calls == 0`), there's no elapsed times to report (and
42+
# this is called during initialization, before `step!` has been called).
43+
# - The second call (`n_calls == 1`) is after `step!` is called for the first time, but
44+
# we don't want to include this since it includes compilation time.
45+
# - In the third call (`n_calls == 2`), we account for 2Δt_wall to compensate for the first
46+
# call that we didn't include (to exclude compilation time).
47+
# - All the other calls are included without any special operation.
48+
49+
if wt.n_calls[] == 0 || wt.n_calls[] == 1
50+
Δt_wall = 0.0
51+
else
52+
# How much walltime elapsed since list time we called `update!`?
53+
Δt_wall = time() - wt.t_wall_last[]
54+
55+
# If this is our third call, we need to compensate for the one call we didn't
56+
# include to exclude compilation time
57+
wt.n_calls[] == 2 && (Δt_wall = 2Δt_wall)
58+
end
59+
60+
wt.n_calls[] += 1
61+
wt.t_wall_last[] = time()
62+
wt.∑Δt_wall[] += Δt_wall
63+
return nothing
64+
end
65+
66+
"""
67+
report_walltime(wt::WallTimeInfo, integrator)
68+
69+
Report the current progress and estimated completion time of a simulation.
70+
71+
This function calculates and displays various timing statistics based on the provided
72+
`WallTimeInfo` (`wt`) and the `integrator` state. It estimates the remaining wall time,
73+
total simulation time, and simulated time per real-time unit.
74+
75+
Prints a summary of the simulation progress to the console, including:
76+
77+
- `simulation_time`: The current simulated time.
78+
- `n_steps_completed`: The number of completed steps.
79+
- `wall_time_per_step`: Average wall time per simulation step. You should expect this to be
80+
unreliable until the number of completed steps is large.
81+
- `wall_time_total`: Estimated total wall time for the simulation. You should expect this to be
82+
unreliable until the number of completed steps is large.
83+
- `wall_time_remaining`: Estimated remaining wall time. You should expect this to be
84+
unreliable until the number of completed steps is large.
85+
- `wall_time_spent`: Total wall time spent so far.
86+
- `percent_complete`: Percentage of the simulation completed.
87+
- `estimated_sypd`: Estimated simulated years per day (or simulated days per day if sypd is
88+
very small). You should expect this to be unreliable until the number of
89+
completed steps is large.
90+
- `date_now`: The current date and time.
91+
- `estimated_finish_date`: The estimated date and time of simulation completion. You should
92+
expect this to be unreliable until the number of completed steps
93+
is large.
94+
95+
!!! note
96+
97+
Average quantities and simulated-years-per-day are computed by taking the total time
98+
elapsed (minus initial compilation) and dividing by the number of steps completed. You
99+
should expect them to fluctuate heavily and to be unreliable until the number of steps
100+
become large. "Large" is defined by your problem: for example, the code has to go
101+
through all the callbacks and diagnostics a few times before stabilizing (and this is
102+
different for different simulations).
103+
104+
## Arguments:
105+
106+
* `wt::WallTimeInfo`: A struct containing wall time information.
107+
* `integrator`: The integrator object containing simulation state information, including the
108+
current time `t`, timestep `dt`. It also have to have time span `tspan`
109+
in `integrator.sol.prob.tspan`.
110+
111+
## How to use `report_walltime`
112+
113+
`report_walltime` is intended to be used as a callback executed at the end of a simulation
114+
step. The callback can be called with an arbitrary schedule, so that reporting can be
115+
customized.
116+
117+
### Example
118+
119+
Suppose we want to report progress every 10 steps in a `SciMLBase`-type of integrator.
120+
```julia
121+
import ClimaUtilities.OnlineLogging: WallTimeInfo, report_progress
122+
import SciMLBase
123+
124+
# Prepare the WallTimeInfo
125+
walltime_info = WallTimeInfo()
126+
127+
# Define schedule, a boolean function that takes the integrator
128+
every10steps(u, t, integrator) = mod(integrator.step, 10) == 0
129+
130+
# Define the callback, we use `let` to make this a little faster
131+
report = let wt = walltime_info
132+
(integrator) -> report_progress(wt, integrator)
133+
end
134+
report_callback = SciMLBase.DiscreteCallback(every10steps, report)
135+
136+
# Then, we can attach this callback to the integrator
137+
```
138+
TODO: Discuss/link `Schedules` when we move them to `ClimaUtilities`.
139+
"""
140+
function report_walltime(wt, integrator)
141+
_update!(wt)
142+
143+
t_start, t_end = integrator.sol.prob.tspan
144+
dt = integrator.dt
145+
t = integrator.t
146+
147+
n_steps_total = ceil(Int, (t_end - t_start) / dt)
148+
n_steps = ceil(Int, (t - t_start) / dt)
149+
150+
wall_time_ave_per_step = wt.∑Δt_wall[] / n_steps
151+
wall_time_ave_per_step_str = _time_and_units_str(wall_time_ave_per_step)
152+
percent_complete = round((t - t_start) / t_end * 100; digits = 1)
153+
n_steps_remaining = n_steps_total - n_steps
154+
wall_time_remaining = wall_time_ave_per_step * n_steps_remaining
155+
wall_time_remaining_str = _time_and_units_str(wall_time_remaining)
156+
wall_time_total =
157+
_time_and_units_str(wall_time_ave_per_step * n_steps_total)
158+
wall_time_spent = _time_and_units_str(wt.∑Δt_wall[])
159+
simulation_time = _time_and_units_str(Float64(t))
160+
161+
simulated_seconds_per_second = (t - t_start) / wt.∑Δt_wall[]
162+
simulated_seconds_per_day = simulated_seconds_per_second * 86400
163+
simulated_days_per_day = simulated_seconds_per_day / 86400
164+
simulated_years_per_day = simulated_days_per_day / 365.25
165+
166+
sypd_estimate = string(round(simulated_years_per_day; digits = 3))
167+
# When simulated_years_per_day is too small, also report the simulated_days_per_day
168+
if simulated_years_per_day < 0.01
169+
sdpd_estimate = round(simulated_days_per_day, digits = 3)
170+
sypd_estimate *= " (sdpd_estimate = $sdpd_estimate)"
171+
end
172+
173+
estimated_finish_date =
174+
Dates.now() + Dates.Second(ceil(wall_time_remaining))
175+
176+
@info "Progress" simulation_time = simulation_time n_steps_completed =
177+
n_steps wall_time_per_step = wall_time_ave_per_step_str wall_time_total =
178+
wall_time_total wall_time_remaining = wall_time_remaining_str wall_time_spent =
179+
wall_time_spent percent_complete = "$percent_complete%" estimated_sypd =
180+
sypd_estimate date_now = Dates.now() estimated_finish_date =
181+
estimated_finish_date
182+
183+
return nothing
184+
end
185+
186+
# TODO: Consider moving this to TimeManager
187+
"""
188+
_time_and_units_str(seconds::Real)
189+
190+
Returns a truncated string of time and units, given a time `x` in Seconds.
191+
"""
192+
function _time_and_units_str(seconds)
193+
isapprox(seconds, 0) && return "0 seconds"
194+
nanoseconds = Dates.Nanosecond(ceil(1_000_000_000seconds))
195+
compound_period = Dates.canonicalize(Dates.CompoundPeriod(nanoseconds))
196+
return _trunc_time(string(compound_period))
197+
end
198+
_trunc_time(s::String) = count(',', s) > 1 ? join(split(s, ",")[1:2], ",") : s
199+
200+
end

0 commit comments

Comments
 (0)