Checkpointer

How to save and restart from checkpoints

ClimaCoupler supports saving and reading simulation checkpoints. This is useful to split a long simulation into smaller, more manageable chunks.

Checkpoints are a mix of HDF5 and JLD2 files and are typically saved in a checkpoints folder in the simulation output. See Utilities.setup_output_dirs for more information.

!!! known limitations

- The number of MPI processes has to remain the same across checkpoints
- Restart files are generally not portable across machines, julia versions, and package versions
- Adding/changing new component models will probably require adding/changing code

Saving checkpoints

If you are running a model (such as AMIP), chances are that you can enable checkpointing just by setting a command-line argument; The checkpoint_dt option controls how frequently a checkpoint should be produced.

If your model does not come with this option already, you can checkpoint the simulation by adding a callback that calls the Checkpointer.checkpoint_sims function.

For example, to add a callback to checkpoint every hour of simulated time, assuming you have a start_date

import Dates

import ClimaCoupler: Checkpointer, TimeManager
import ClimaDiagnostics.Schedules: EveryCalendarDtSchedule 

schedule = EveryCalendarDtSchedule(Dates.Hour(1); start_date)
checkpoint_callback = TimeManager.Callback(schedule_checkpoint, Checkpointer.checkpoint_sims)

# In the coupling loop:
TimeManager.maybe_trigger_callback(checkpoint_callback, coupled_simulation, time)

Reading checkpoints

There are two ways to restart a simulation from checkpoints. By default, ClimaCoupler tries finding suitable checkpoints and automatically use them. Alternatively, you can specify a directory restart_dir and a simulation time restart_t and restart from files saved in the given directory at the given time. If the model you are running supports writing checkpoints via command-line argument, it will probably also support reading them. In this case, the arguments restart_dir and restart_t identify the path of the top level directory containing all the checkpoint files and the simulated times in second.

If the model does not support directly reading a checkpoint, the Checkpointer module provides a straightforward way to add this feature. Checkpointer.restart! takes a coupled simulation, a restart_dir, and a restart_t and overwrites the content of the coupled simulation with what is in the checkpoint.

Developer notes

In theory, the state of the component models should fully determine the state of the coupled simulation and one should be able to restart a coupled simulation just by using the states of the component models. Unfortunately, this is currently not the case in ClimaCoupler. The main reason for this is the complex interdependencies between component models and within ClimaAtmos which make the initialization step inconsistent. For example, in a coupled simulation, the surface albedo should be determined by the surface models and used by the atmospheric model for radiation transfer, but ClimaAtmos also tries to set the surface albedo (since it has to do so when run in standalone mode). In addition to this, ClimaAtmos has a large cache that has internal interdependencies that are hard to disentangle, and changing a field might require changing some other field in a different part of the cache. As a result, it is not easy for ClimaCoupler to consistently do initialization from a cold state. To conclude, restarting a simulation exclusively using the states of the component models is currently impossible.

Given that restarting a simulation from the state is impossible, ClimaCoupler needs to save the states and the caches. Let us review how we use ClimaCore.InputOutput and JLD2 package to accomplish this.

ClimaCore.InputOutput provides a loss-less way to save the content of certain ClimaCore objects to HDF5 files. Objects saved in this way are not tied to a particular computing device or configuration. When running with MPI, ClimaCore.InputOutput are also efficiently written in parallel.

Unfortunately, ClimaCore.InputOutput only supports certain objects, such as Fields and Spaces, but the cache in component models is more complex than this and contains complex objects with highly stateful quantities (e.g., C pointers). Because of this, model states are saved to HDF5 but caches must be saved to JLD2 files.

JLD2 allows us to save more complex objects without writing specific serialization methods for every struct. JLD2 allows us to take a big step forward, but there are still several challenges that need to be solved:

JLD2 does not support CUDA natively. To go around this, we have to move

everything onto the CPU first. Then, when the data is read back, we have to move it back to the GPU.

JLD2 does not support MPI natively. To go around this, each process writes

its jld2 checkpoint and reads it back. This introduces the constraint that the number of MPI processes cannot change across restarts.

Some quantities are best not saved and read (for example, anything with

pointers). For this, we write a recursive function that traverses the cache and only restores quantities of a certain type (typically, ClimaCore objects)

Point 3. adds significant amount of code and requires component models to specify how their cache has to be restored.

If you are adding a component model, you have to extend the

Checkpointer.get_model_prog_state
Checkpointer.get_model_cache
Checkpointer.restore_cache!

methods.

ClimaCoupler moves objects to the CPU with Adapt(Array, x). Adapt traverses the object recursively, and proper Adapt methods have to be defined for every object involved in the chain. The easiest way to do this is using the Adapt.@adapt_structure macro, which defines a recursive Adapt for the given object.

Types to watch for:

MPI related objects (e.g., MPICommsContext)
TimeVaryingInputs (because they contain NCDatasets, which contain pointers to files)

Checkpointer API

ClimaCoupler.Checkpointer.get_model_prog_state — Function

get_model_prog_state(sim::Interfacer.ComponentModelSimulation)

Returns the model state of a simulation as a ClimaCore.FieldVector. This is a template function that should be implemented for each component model.

source

ClimaCoupler.Checkpointer.get_model_cache — Function

get_model_cache(sim::Interfacer.ComponentModelSimulation)

Returns the model cache of a simulation. This is a template function that should be implemented for each component model.

source

ClimaCoupler.Checkpointer.restart! — Function

restart!(cs::CoupledSimulation, checkpoint_dir, checkpoint_t)

Overwrite the content of cs with checkpoints in checkpoint_dir at time checkpoint_t.

Return a true if the simulation was restarted.

source

ClimaCoupler.Checkpointer.checkpoint_sims — Function

checkpoint_sims(cs::CoupledSimulation)

This is a callback function that checkpoints all simulations defined in the current coupled simulation.

source

ClimaCoupler.Checkpointer.t_start_from_checkpoint — Function

t_start_from_checkpoint(checkpoint_dir)

Look for restart files in checkpoint_dir, if found, return the time of the latest. If not found, return nothing.

source