Checkpointing

A Checkpointer can be used to serialize the entire model state to a file from which the model can be restored at any time. This is useful if you'd like to periodically checkpoint when running long simulations in case of crashes or hitting cluster time limits, but also if you'd like to restore from a checkpoint and try out multiple scenarios.

For example, to periodically checkpoint the model state to disk every 1,000,000 seconds of simulation time to files of the form model_checkpoint_iteration12500.jld2 where 12500 is the iteration number (automatically filled in).

Here's an example where we checkpoint every 5 iterations. This is far more often than appropriate for typical applications: we only do it here for illustration purposes.

julia> using Oceananigans
julia> model = NonhydrostaticModel(grid=RectilinearGrid(size=(8, 8, 8), extent=(1, 1, 1)))NonhydrostaticModel{CPU, RectilinearGrid}(time = 0 seconds, iteration = 0) ├── grid: 8×8×8 RectilinearGrid{Float64, Periodic, Periodic, Bounded} on CPU with 3×3×3 halo ├── timestepper: RungeKutta3TimeStepper ├── advection scheme: Centered(order=2) ├── tracers: () ├── closure: Nothing ├── buoyancy: Nothing └── coriolis: Nothing
julia> simulation = Simulation(model, Δt=1, stop_iteration=8)Simulation of NonhydrostaticModel{CPU, RectilinearGrid}(time = 0 seconds, iteration = 0) ├── Next time step: 1 second ├── Elapsed wall time: 0 seconds ├── Wall time per iteration: NaN days ├── Stop time: Inf days ├── Stop iteration: 8.0 ├── Wall time limit: Inf ├── Minimum relative step: 0.0 ├── Callbacks: OrderedDict with 4 entries: │ ├── stop_time_exceeded => 4 │ ├── stop_iteration_exceeded => - │ ├── wall_time_limit_exceeded => e │ └── nan_checker => } ├── Output writers: OrderedDict with no entries └── Diagnostics: OrderedDict with no entries
julia> simulation.output_writers[:checkpointer] = Checkpointer(model, schedule=IterationInterval(5), prefix="model_checkpoint")Checkpointer{IterationInterval, Vector{Symbol}}(IterationInterval(5, 0), ".", "model_checkpoint", [:grid, :clock, :timestepper], false, false, false)

Again, for illustration purposes of this example, we also add another callback so we can see the iteration of the simulation

julia> show_iteration(sim) = @info "iteration: $(iteration(sim)); time: $(prettytime(sim.model.clock.time))"show_iteration (generic function with 1 method)
julia> add_callback!(simulation, show_iteration, name=:info, IterationInterval(1))

Now let's run

julia> run!(simulation)[ Info: Initializing simulation...
[ Info: iteration: 0; time: 0 seconds
[ Info:     ... simulation initialization complete (2.724 seconds)
[ Info: Executing initial time step...
[ Info: iteration: 1; time: 1 second
[ Info:     ... initial time step complete (7.506 seconds).
[ Info: iteration: 2; time: 2.000 seconds
[ Info: iteration: 3; time: 3 seconds
[ Info: iteration: 4; time: 4 seconds
[ Info: iteration: 5; time: 5 seconds
[ Info: iteration: 6; time: 6 seconds
[ Info: iteration: 7; time: 7 seconds
[ Info: Simulation is stopping after running for 10.669 seconds.
[ Info: Model iteration 8 equals or exceeds stop iteration 8.
[ Info: iteration: 8; time: 8 seconds

The default options should provide checkpoint files that are easy to restore from (in most cases). For more advanced options and features, see Checkpointer.

Picking up a simulation from a checkpoint file

Picking up a simulation from a checkpoint requires the original script that was used to generate the checkpoint data. Change the first instance of run! in the script to take pickup=true.

When pickup=true is provided to run! then it finds the latest checkpoint file in the current working directory, loads prognostic fields and their tendencies from file, resets the model clock and iteration, to the clock time and iteration that the checkpoint corresponds to, and updates the model auxiliary state. After that, the time-stepping loop. In this simple example, although the simulation run up to iteration 8, the latest checkpoint is associated with iteration 5.

julia> simulation.stop_iteration = 1212
julia> run!(simulation, pickup=true)┌ Warning: Particles do not exist in checkpoint and could not be restored. @ Oceananigans.Models ~/Oceananigans.jl-26302/src/Models/set_model.jl:46 [ Info: Initializing simulation... [ Info: ... simulation initialization complete (242.664 μs) [ Info: Executing initial time step... [ Info: iteration: 6; time: 6 seconds [ Info: ... initial time step complete (836.244 μs). [ Info: iteration: 7; time: 7 seconds [ Info: iteration: 8; time: 8 seconds [ Info: iteration: 9; time: 9 seconds [ Info: iteration: 10; time: 10 seconds [ Info: iteration: 11; time: 11 seconds [ Info: Simulation is stopping after running for 8.332 ms. [ Info: Model iteration 12 equals or exceeds stop iteration 12. [ Info: iteration: 12; time: 12 seconds

Use pickup=iteration, where iteration is an Integer, to pick up from a specific iteration. Or, use pickup=filepath, where filepath is a string, to pickup from a specific file located at filepath.