Checkpointing

A Checkpointer can be used to serialize the simulation state to a file from which the simulation can be restored at any time. This is useful if you'd like to periodically checkpoint when running long simulations in case of crashes or hitting cluster time limits, but also if you'd like to restore from a checkpoint and try out multiple scenarios.

Here's an example where we checkpoint every 5 iterations to files of the form model_checkpoint_iteration5.jld2 (where the iteration number is automatically included in the filename). This is far more often than appropriate for typical applications: we only do it here for illustration purposes.

julia

julia> using Oceananigans

julia> model = NonhydrostaticModel(RectilinearGrid(size=(8, 8, 8), extent=(1, 1, 1)))
NonhydrostaticModel{CPU, RectilinearGrid}(time = 0 seconds, iteration = 0)
├── grid: 8×8×8 RectilinearGrid{Float64, Periodic, Periodic, Bounded} on CPU with 3×3×3 halo
├── timestepper: RungeKutta3TimeStepper
├── advection scheme: Centered(order=2)
├── tracers: ()
├── closure: Nothing
├── buoyancy: Nothing
└── coriolis: Nothing

julia> simulation = Simulation(model, Δt=1, stop_iteration=8)
Simulation of NonhydrostaticModel{CPU, RectilinearGrid}(time = 0 seconds, iteration = 0)
├── Next time step: 1 second
├── run_wall_time: 0 seconds
├── run_wall_time / iteration: NaN days
├── stop_time: Inf days
├── stop_iteration: 8.0
├── wall_time_limit: Inf
├── minimum_relative_step: 0.0
├── callbacks: OrderedDict with 4 entries:
│   ├── stop_time_exceeded => Callback of stop_time_exceeded on IterationInterval(1)
│   ├── stop_iteration_exceeded => Callback of stop_iteration_exceeded on IterationInterval(1)
│   ├── wall_time_limit_exceeded => Callback of wall_time_limit_exceeded on IterationInterval(1)
│   └── nan_checker => Callback of NaNChecker for u on IterationInterval(100)
└── output_writers: OrderedDict with no entries

julia> simulation.output_writers[:checkpointer] = Checkpointer(model, schedule=IterationInterval(5), prefix="model_checkpoint")
Checkpointer{IterationInterval}(IterationInterval(5, 0), ".", "model_checkpoint", false, false, false)

Use cleanup=true to automatically delete old checkpoint files when a new one is written, keeping only the latest checkpoint:

julia

julia> Checkpointer(model, schedule=IterationInterval(1000), prefix="checkpoint", cleanup=true)
Checkpointer{IterationInterval}(IterationInterval(1000, 0), ".", "checkpoint", false, false, true)

Again, for illustration purposes, we also add a callback so we can see the simulation progress:

julia

julia> show_iteration(sim) = @info "iteration: $(iteration(sim)), time: $(prettytime(sim.model.clock.time))"
show_iteration (generic function with 1 method)

julia> add_callback!(simulation, show_iteration, IterationInterval(1), name=:info)

julia> simulation.callbacks[:info]
Callback of show_iteration on IterationInterval(1)

Now let's run

julia

julia> run!(simulation)
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mInitializing simulation...
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39miteration: 0, time: 0 seconds
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39m    ... simulation initialization complete (1.084 seconds)
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mExecuting initial time step...
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39miteration: 1, time: 1 second
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39m    ... initial time step complete (2.724 seconds).
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39miteration: 2, time: 2 seconds
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39miteration: 3, time: 3 seconds
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39miteration: 4, time: 4 seconds
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39miteration: 5, time: 5 seconds
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39miteration: 6, time: 6 seconds
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39miteration: 7, time: 7 seconds
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mSimulation is stopping after running for 4.054 seconds.
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mModel iteration 8 equals or exceeds stop iteration 8.
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39miteration: 8, time: 8 seconds

The default options should provide checkpoint files that are easy to restore from (in most cases). For more advanced options and features, see Checkpointer.

Checkpointing is supported for: ShallowWaterModel, NonhydrostaticModel, and HydrostaticFreeSurfaceModel (including split-explicit, implicit, and explicit free surfaces, as well as z-star vertical coordinates).

Picking up a simulation from a checkpoint file

Picking up a simulation from a checkpoint requires recreating the simulation identically to how it was originally configured. This means using the same grid, model type, boundary conditions, forcing, closures, and output writers. Only the prognostic state (data that evolves during simulation) is restored from the checkpoint - not the simulation configuration.

When pickup=true is provided to run!, it finds the latest checkpoint file in the Checkpointer's directory, restores the simulation state (including model fields, clock, and timestepper state), and then continues the time-stepping loop. In this simple example, although the simulation ran up to iteration 8, the latest checkpoint is associated with iteration 5.

julia

julia> simulation.stop_iteration = 12
12

julia> run!(simulation, pickup=true)
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mInitializing simulation...
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39m    ... simulation initialization complete (223.306 μs)
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mExecuting initial time step...
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39miteration: 6, time: 6 seconds
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39m    ... initial time step complete (895.286 μs).
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39miteration: 7, time: 7 seconds
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39miteration: 8, time: 8 seconds
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39miteration: 9, time: 9 seconds
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39miteration: 10, time: 10 seconds
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39miteration: 11, time: 11 seconds
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mSimulation is stopping after running for 15.510 ms.
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mModel iteration 12 equals or exceeds stop iteration 12.
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39miteration: 12, time: 12 seconds

Use pickup=iteration, where iteration is an Integer, to pick up from a specific iteration. Or, use pickup=filepath, where filepath is a string, to pickup from a specific file located at filepath.

The set! function can also be used to restore from a checkpoint without immediately running the simulation:

julia

set!(simulation; checkpoint="path/to/file.jld2")  # restore from specific file
set!(simulation; checkpoint=:latest)              # restore from latest checkpoint (requires Checkpointer)
set!(simulation; iteration=12345)                 # restore from specific iteration (requires Checkpointer)

Checkpointing on wall-clock time

For cluster jobs with time limits, use WallTimeInterval to checkpoint based on elapsed wall-clock time rather than simulation time or iterations:

julia

julia> # Checkpoint every 30 minutes of wall-clock time
       Checkpointer(model, schedule=WallTimeInterval(30minute), prefix="checkpoint")
ERROR: UndefVarError: `minute` not defined in `Main`
Suggestion: check for spelling errors or missing imports.
Hint: a global variable of this name also exists in Dates.

This ensures checkpoints are saved regularly even if individual time steps vary significantly.

Manual checkpointing

Use checkpoint to manually save the simulation state at any point:

julia

checkpoint(simulation)                            # uses Checkpointer settings if available
checkpoint(simulation, filepath="my_state.jld2")  # write to specific file

If a Checkpointer is configured in simulation.output_writers, it will be used (respecting its dir, prefix, and other settings). Otherwise, the checkpoint is written to the specified filepath, or to checkpoint_iteration{N}.jld2 in the current directory.

Automatic checkpointing at end

Use checkpoint_at_end=true to automatically checkpoint the simulation when it finishes:

julia

run!(simulation, checkpoint_at_end=true)  # Checkpoints when done

This ensures the final simulation state is saved, even if the simulation stops due to wall time limits or other callbacks.

If a Checkpointer is configured, it will be used. Otherwise, a file named checkpoint_iteration{N}.jld2 is created in the current directory.

What gets checkpointed

Checkpointing saves the prognostic state which is data that evolves during simulation. This includes prognostic model fields (velocities, tracers, diffusivities, etc.), the clock, the state of the time stepper, output writer state, turbulence closure state, free surface state, and Lagrangian particle properties.

Static configuration is not checkpointed. This includes the grid, boundary conditions, forcing functions, closure parameters, model options, and callbacks.

This means your script must recreate the simulation with identical configuration before restoring from a checkpoint.

`NonhydrostaticModel`

`HydrostaticFreeSurfaceModel`

`ShallowWaterModel`

Checkpointing

Picking up a simulation from a checkpoint file

Checkpointing on wall-clock time

Manual checkpointing

Automatic checkpointing at end

What gets checkpointed

Checkpointing ​

Picking up a simulation from a checkpoint file ​

Checkpointing on wall-clock time ​

Manual checkpointing ​

Automatic checkpointing at end ​

What gets checkpointed ​

Checkpointing

Picking up a simulation from a checkpoint file

Checkpointing on wall-clock time

Manual checkpointing

Automatic checkpointing at end

What gets checkpointed