Checkpointing

A Checkpointer can be used to serialize the simulation state to a file from which the simulation can be restored at any time. This is useful if you'd like to periodically checkpoint when running long simulations in case of crashes or hitting cluster time limits, but also if you'd like to restore from a checkpoint and try out multiple scenarios.

Here's an example where we checkpoint every 5 iterations to files of the form model_checkpoint_iteration5.jld2 (where the iteration number is automatically included in the filename). This is far more often than appropriate for typical applications: we only do it here for illustration purposes.

julia> using Oceananigans
julia> model = NonhydrostaticModel(RectilinearGrid(size=(8, 8, 8), extent=(1, 1, 1)))NonhydrostaticModel{CPU, RectilinearGrid}(time = 0 seconds, iteration = 0) ├── grid: 8×8×8 RectilinearGrid{Float64, Periodic, Periodic, Bounded} on CPU with 3×3×3 halo ├── timestepper: RungeKutta3TimeStepper ├── advection scheme: Centered(order=2) ├── tracers: () ├── closure: Nothing ├── buoyancy: Nothing └── coriolis: Nothing
julia> simulation = Simulation(model, Δt=1, stop_iteration=8)Simulation of NonhydrostaticModel{CPU, RectilinearGrid}(time = 0 seconds, iteration = 0) ├── Next time step: 1 second ├── run_wall_time: 0 seconds ├── run_wall_time / iteration: NaN days ├── stop_time: Inf days ├── stop_iteration: 8.0 ├── wall_time_limit: Inf ├── minimum_relative_step: 0.0 ├── callbacks: OrderedDict with 4 entries: │ ├── stop_time_exceeded => Callback of stop_time_exceeded on IterationInterval(1) │ ├── stop_iteration_exceeded => Callback of stop_iteration_exceeded on IterationInterval(1) │ ├── wall_time_limit_exceeded => Callback of wall_time_limit_exceeded on IterationInterval(1) │ └── nan_checker => Callback of NaNChecker for u on IterationInterval(100) └── output_writers: OrderedDict with no entries
julia> simulation.output_writers[:checkpointer] = Checkpointer(model, schedule=IterationInterval(5), prefix="model_checkpoint")Checkpointer{IterationInterval}(IterationInterval(5, 0), ".", "model_checkpoint", false, false, false)

Use cleanup=true to automatically delete old checkpoint files when a new one is written, keeping only the latest checkpoint:

Checkpointer(model, schedule=IterationInterval(1000), prefix="checkpoint", cleanup=true)

Again, for illustration purposes, we also add a callback so we can see the simulation progress:

julia> show_iteration(sim) = @info "iteration: $(iteration(sim)), time: $(prettytime(sim.model.clock.time))"show_iteration (generic function with 1 method)
julia> add_callback!(simulation, show_iteration, name=:info, schedule=IterationInterval(1))ERROR: MethodError: no method matching Callback(::typeof(Main.show_iteration), ::IterationInterval; schedule::IterationInterval) This method does not support all of the given keyword arguments (and may not support any). Closest candidates are: Callback(::Any, ::Any; parameters, callsite) got unsupported keyword argument "schedule" @ Oceananigans ~/Oceananigans.jl-28811/src/Simulations/callback.jl:69 Callback(::F, ::S, ::CS, ::P) where {P, F, S, CS} got unsupported keyword argument "schedule" @ Oceananigans ~/Oceananigans.jl-28811/src/Simulations/callback.jl:9 Callback(::WindowedTimeAverage, ::Any; kw...) @ Oceananigans ~/Oceananigans.jl-28811/src/Simulations/callback.jl:91 ...

Now let's run

julia> run!(simulation)[ Info: Initializing simulation...
[ Info:     ... simulation initialization complete (999.083 ms)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (2.930 seconds).
[ Info: Simulation is stopping after running for 4.170 seconds.
[ Info: Model iteration 8 equals or exceeds stop iteration 8.

The default options should provide checkpoint files that are easy to restore from (in most cases). For more advanced options and features, see Checkpointer.

Checkpointing is supported for: ShallowWaterModel, NonhydrostaticModel, and HydrostaticFreeSurfaceModel (including split-explicit, implicit, and explicit free surfaces, as well as z-star vertical coordinates).

Picking up a simulation from a checkpoint file

Picking up a simulation from a checkpoint requires recreating the simulation identically to how it was originally configured. This means using the same grid, model type, boundary conditions, forcing, closures, and output writers. Only the prognostic state (data that evolves during simulation) is restored from the checkpoint - not the simulation configuration.

When pickup=true is provided to run!, it finds the latest checkpoint file in the Checkpointer's directory, restores the simulation state (including model fields, clock, and timestepper state), and then continues the time-stepping loop. In this simple example, although the simulation ran up to iteration 8, the latest checkpoint is associated with iteration 5.

julia> simulation.stop_iteration = 1212
julia> run!(simulation, pickup=true)[ Info: Initializing simulation... [ Info: ... simulation initialization complete (265.950 μs) [ Info: Executing initial time step... [ Info: ... initial time step complete (1.236 ms). [ Info: Simulation is stopping after running for 9.903 ms. [ Info: Model iteration 12 equals or exceeds stop iteration 12.

Use pickup=iteration, where iteration is an Integer, to pick up from a specific iteration. Or, use pickup=filepath, where filepath is a string, to pickup from a specific file located at filepath.

The set! function can also be used to restore from a checkpoint without immediately running the simulation:

set!(simulation; checkpoint="path/to/file.jld2")  # restore from specific file
set!(simulation; checkpoint=:latest)              # restore from latest checkpoint (requires Checkpointer)
set!(simulation; iteration=12345)                 # restore from specific iteration (requires Checkpointer)

Checkpointing on wall-clock time

For cluster jobs with time limits, use WallTimeInterval to checkpoint based on elapsed wall-clock time rather than simulation time or iterations:

# Checkpoint every 30 minutes of wall-clock time
Checkpointer(model, schedule=WallTimeInterval(30minute), prefix="checkpoint")

This ensures checkpoints are saved regularly even if individual time steps vary significantly.

Manual checkpointing

Use checkpoint to manually save the simulation state at any point:

checkpoint(simulation)                            # uses Checkpointer settings if available
checkpoint(simulation, filepath="my_state.jld2")  # write to specific file

If a Checkpointer is configured in simulation.output_writers, it will be used (respecting its dir, prefix, and other settings). Otherwise, the checkpoint is written to the specified filepath, or to checkpoint_iteration{N}.jld2 in the current directory.

Automatic checkpointing at end

Use checkpoint_at_end=true to automatically checkpoint the simulation when it finishes:

run!(simulation, checkpoint_at_end=true)  # Checkpoints when done

This ensures the final simulation state is saved, even if the simulation stops due to wall time limits or other callbacks.

If a Checkpointer is configured, it will be used. Otherwise, a file named checkpoint_iteration{N}.jld2 is created in the current directory.

What gets checkpointed

Checkpointing saves the prognostic state which is data that evolves during simulation. This includes prognostic model fields (velocities, tracers, diffusivities, etc.), the clock, the state of the time stepper, output writer state, turbulence closure state, free surface state, and Lagrangian particle properties.

Static configuration is not checkpointed. This includes the grid, boundary conditions, forcing functions, closure parameters, model options, and callbacks.

This means your script must recreate the simulation with identical configuration before restoring from a checkpoint.