Checkpointer
How to save and restart from checkpoints
ClimaCoupler supports saving and reading simulation checkpoints. This is useful to split a long simulation into smaller, more manageable chunks.
Checkpoints are a mix of HDF5 and JLD2 files and are typically saved in a checkpoints folder in the simulation output. See Utilities.setup_output_dirs for more information.
- The number of MPI processes has to remain the same across checkpoints
- Restart files are generally not portable across machines, julia versions, and package versions
- Adding/changing new component models will probably require adding/changing code
Saving checkpoints
If you are running a model (such as AMIP), chances are that you can enable checkpointing just by setting a command-line argument; The checkpoint_dt option controls how frequently a checkpoint should be produced.
If your model does not come with this option already, you can checkpoint the simulation by adding a callback that calls the Checkpointer.checkpoint_sims function.
For example, to add a callback to checkpoint every hour of simulated time, assuming you have a start_date
import Dates
import ClimaCoupler: Checkpointer, TimeManager
import ClimaDiagnostics.Schedules: EveryCalendarDtSchedule
schedule = EveryCalendarDtSchedule(Dates.Hour(1); start_date)
checkpoint_callback = TimeManager.Callback(schedule_checkpoint, Checkpointer.checkpoint_sims)
# In the coupling loop:
TimeManager.maybe_trigger_callback(checkpoint_callback, coupled_simulation, time)Reading checkpoints
There are two ways to restart a simulation from checkpoints. By default, ClimaCoupler tries finding suitable checkpoints and automatically use them. Alternatively, you can specify a directory restart_dir and a simulation time restart_t and restart from files saved in the given directory at the given time. If the model you are running supports writing checkpoints via command-line argument, it will probably also support reading them. In this case, the arguments restart_dir and restart_t identify the path of the top level directory containing all the checkpoint files and the simulated times in second.
If the model does not support directly reading a checkpoint, the Checkpointer module provides a straightforward way to add this feature. Checkpointer.restart! takes a coupled simulation, a restart_dir, and a restart_t and overwrites the content of the coupled simulation with what is in the checkpoint.
Developer notes
In theory, the state of the component models should fully determine the state of the coupled simulation and one should be able to restart a coupled simulation just by using the states of the component models. Unfortunately, this is currently not the case in ClimaCoupler. The main reason for this is the complex interdependencies between component models and within ClimaAtmos which make the initialization step inconsistent. For example, in a coupled simulation, the surface albedo should be determined by the surface models and used by the atmospheric model for radiation transfer, but ClimaAtmos also tries to set the surface albedo (since it has to do so when run in standalone mode). In addition to this, ClimaAtmos has a large cache that has internal interdependencies that are hard to disentangle, and changing a field might require changing some other field in a different part of the cache. As a result, it is not easy for ClimaCoupler to consistently do initialization from a cold state. To conclude, restarting a simulation exclusively using the states of the component models is currently impossible.
Given that restarting a simulation from the state is impossible, ClimaCoupler needs to save the states and the caches. Let us review how we use ClimaCore.InputOutput and JLD2 package to accomplish this.
ClimaCore.InputOutput provides a loss-less way to save the content of certain ClimaCore objects to HDF5 files. Objects saved in this way are not tied to a particular computing device or configuration. When running with MPI, ClimaCore.InputOutput are also efficiently written in parallel.
Unfortunately, ClimaCore.InputOutput only supports certain objects, such as Fields and Spaces, but the cache in component models is more complex than this and contains complex objects with highly stateful quantities (e.g., C pointers). Because of this, model states are saved to HDF5 but caches must be saved to JLD2 files.
JLD2 allows us to save more complex objects without writing specific serialization methods for every struct. JLD2 allows us to take a big step forward, but there are still several challenges that need to be solved:
JLD2does not support CUDA natively. To go around this, we have to move everything onto the CPU first. Then, when the data is read back, we have to move it back to the GPU.JLD2does not support MPI natively. To go around this, each process writes itsjld2checkpoint and reads it back. This introduces the constraint that the number of MPI processes cannot change across restarts.- Some quantities are best not saved and read (for example, anything with pointers). For this, we write a recursive function that traverses the cache and only restores quantities of a certain type (typically,
ClimaCoreobjects)
Point 3. adds significant amount of code and requires component models to specify how their cache has to be restored.
If you are adding a component model, you have to extend the
Checkpointer.get_model_prog_state
Checkpointer.get_model_cache
Checkpointer.restore_cache!methods.
ClimaCoupler moves objects to the CPU with Adapt(Array, x). Adapt traverses the object recursively, and proper Adapt methods have to be defined for every object involved in the chain. The easiest way to do this is using the Adapt.@adapt_structure macro, which defines a recursive Adapt for the given object.
Types to watch for:
MPIrelated objects (e.g.,MPICommsContext)TimeVaryingInputs(because they containNCDatasets, which contain pointers to files)
Checkpointer API
ClimaCoupler.Checkpointer.get_model_prog_state — Functionget_model_prog_state(sim::Interfacer.ComponentModelSimulation)Returns the model state of a simulation as a ClimaCore.FieldVector. This is a template function that should be implemented for each component model.
ClimaCoupler.Checkpointer.get_model_cache — Functionget_model_cache(sim::Interfacer.ComponentModelSimulation)Returns the model cache of a simulation. This is a template function that should be implemented for each component model.
ClimaCoupler.Checkpointer.restart! — Functionrestart!(cs::CoupledSimulation, checkpoint_dir, checkpoint_t, restart_cache)Overwrite the content of cs with checkpoints in checkpoint_dir at time checkpoint_t.
If restart_cache is true, the cache will be read from the restart file using restore_cache!. Otherwise, the cache will be left unchanged.
Return a true if the simulation was restarted.
ClimaCoupler.Checkpointer.checkpoint_sims — Functioncheckpoint_sims(cs::CoupledSimulation)This is a callback function that checkpoints all simulations defined in the current coupled simulation.
ClimaCoupler.Checkpointer.t_start_from_checkpoint — Functiont_start_from_checkpoint(checkpoint_dir)Look for restart files in checkpoint_dir, if found, return the time of the latest. If not found, return nothing.
ClimaCoupler.Checkpointer.restore! — Functionrestore!(v1, v2, comms_ctx; name = "", ignore = Set())Recursively traverse v1 and v2, setting each field of v1 with the corresponding field in v2. In this, ignore all the properties that have name within the ignore iterable.
This is intended to be used when restarting a simulation's cache object from a checkpoint.
ignore is useful when there are stateful properties, such as live pointers.
restore!(
v1::Union{
AbstractTimeVaryingInput,
ClimaComms.AbstractCommsContext,
ClimaComms.AbstractDevice,
UnionAll,
DataType,
},
v2::Union{
AbstractTimeVaryingInput,
ClimaComms.AbstractCommsContext,
ClimaComms.AbstractDevice,
UnionAll,
DataType,
},
_comms_ctx;
name = "",
ignore = Set(),
)Ignore certain types that don't need to be restored. UnionAll and DataType are infinitely recursive, so we also ignore those.
restore!(
v1::Union{CC.DataLayouts.AbstractData, AbstractArray},
v2::Union{CC.DataLayouts.AbstractData, AbstractArray},
comms_ctx;
name = "",
ignore = Set(),
)For array-like objects, we move the original data (v2) to the device of the new data (v1). Then we copy the original data to the new object.
restore!(
v1::Union{StaticArrays.StaticArray, Number, UnitRange, LinRange, Symbol},
v2::Union{StaticArrays.StaticArray, Number, UnitRange, LinRange, Symbol},
comms_ctx;
name = "",
ignore = Set(),
)Ensure that immutable objects have been initialized correctly, as they cannot be restored from a checkpoint.
restore!(v1::Dict, v2::Dict, comms_ctx; name = "", ignore = Set())RRTMGP has some internal dictionaries, which we check for consistency.
restore!(
v1::T1,
v2::T2,
comms_ctx;
name = "",
ignore = Set(),
) where {
T1 <: Union{Dates.DateTime, Dates.UTInstant, Dates.Millisecond},
T2 <: Union{Dates.DateTime, Dates.UTInstant, Dates.Millisecond},
}Special case to compare time-related types to allow different timestamps during restore.