Checkpointing

A checkpointer can be used to serialize the entire model state to a file from which the model can be restored at any time. This is useful if you'd like to periodically checkpoint when running long simulations in case of crashes or cluster time limits, but also if you'd like to restore from a checkpoint and try out multiple scenarios.

For example, to periodically checkpoint the model state to disk every 1,000,000 seconds of simulation time to files of the form model_checkpoint_iteration12500.jld2 where 12500 is the iteration number (automatically filled in)

julia> using Oceananigans, Oceananigans.Units
julia> model = NonhydrostaticModel(grid=RectilinearGrid(size=(16, 16, 16), extent=(1, 1, 1)))NonhydrostaticModel{CPU, RectilinearGrid}(time = 0 seconds, iteration = 0) ├── grid: 16×16×16 RectilinearGrid{Float64, Periodic, Periodic, Bounded} on CPU with 3×3×3 halo ├── timestepper: QuasiAdamsBashforth2TimeStepper ├── tracers: () ├── closure: Nothing ├── buoyancy: Nothing └── coriolis: Nothing
julia> simulation = Simulation(model, Δt=1, stop_iteration=1)Simulation of NonhydrostaticModel{CPU, RectilinearGrid}(time = 0 seconds, iteration = 0) ├── Next time step: 1 second ├── Elapsed wall time: 0 seconds ├── Wall time per iteration: NaN years ├── Stop time: Inf years ├── Stop iteration : 1.0 ├── Wall time limit: Inf ├── Callbacks: OrderedDict with 4 entries: │ ├── stop_time_exceeded => Callback of stop_time_exceeded on IterationInterval(1) │ ├── stop_iteration_exceeded => Callback of stop_iteration_exceeded on IterationInterval(1) │ ├── wall_time_limit_exceeded => Callback of wall_time_limit_exceeded on IterationInterval(1) │ └── nan_checker => Callback of NaNChecker for u on IterationInterval(100) ├── Output writers: OrderedDict with no entries └── Diagnostics: OrderedDict with no entries
julia> simulation.output_writers[:checkpointer] = Checkpointer(model, schedule=TimeInterval(5years), prefix="model_checkpoint")Checkpointer{TimeInterval, Vector{Symbol}}(TimeInterval(1.5768e8, 0.0), ".", "model_checkpoint", [:architecture, :grid, :clock, :coriolis, :buoyancy, :closure, :timestepper, :particles], false, false, false)
julia> run!(simulation)[ Info: Initializing simulation... [ Info: ... simulation initialization complete (4.886 seconds) [ Info: Executing initial time step... [ Info: ... initial time step complete (21.648 seconds). [ Info: Simulation is stopping. Model iteration 1 has hit or exceeded simulation stop iteration 1.

The default options should provide checkpoint files that are easy to restore from in most cases. For more advanced options and features, see Checkpointer.

Picking up a simulation from a checkpoint file

Picking up a simulation from a checkpoint requires the original script that was used to generate the checkpoint data. Change the first instance of run! in the script to take pickup=true:

julia> simulation.stop_iteration = 22
julia> run!(simulation, pickup=true)[ Info: Initializing simulation... [ Info: ... simulation initialization complete (79.214 ms) [ Info: Executing initial time step... [ Info: ... initial time step complete (2.090 ms). [ Info: Simulation is stopping. Model iteration 2 has hit or exceeded simulation stop iteration 2.

which finds the latest checkpoint file in the current working directory (in this trivial case, this is the checkpoint associated with iteration 0), loads prognostic fields and their tendencies from file, resets the model clock and iteration, and updates the model auxiliary state before starting the time-stepping loop.

Use pickup=iteration, where iteration is an Integer, to pick up from a specific iteration. Or, use pickup=filepath, where filepath is a string, to pickup from a specific file located at filepath.