Checkpointing

A checkpointer can be used to serialize the entire model state to a file from which the model can be restored at any time. This is useful if you'd like to periodically checkpoint when running long simulations in case of crashes or cluster time limits, but also if you'd like to restore from a checkpoint and try out multiple scenarios.

For example, to periodically checkpoint the model state to disk every 1,000,000 seconds of simulation time to files of the form model_checkpoint_iteration12500.jld2 where 12500 is the iteration number (automatically filled in)

julia> using Oceananigans, Oceananigans.Units
julia> model = NonhydrostaticModel(grid=RegularRectilinearGrid(size=(16, 16, 16), extent=(1, 1, 1)))NonhydrostaticModel{CPU, Float64}(time = 0 seconds, iteration = 0) ├── grid: RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded}(Nx=16, Ny=16, Nz=16) ├── tracers: (:T, :S) ├── closure: Nothing ├── buoyancy: SeawaterBuoyancy{Float64, LinearEquationOfState{Float64}, Nothing, Nothing} └── coriolis: Nothing
julia> simulation = Simulation(model, Δt=1, stop_iteration=1)Simulation{typename(NonhydrostaticModel){typename(CPU), Float64}} ├── Model clock: time = 0 seconds, iteration = 0 ├── Next time step (Int64): 1 second ├── Iteration interval: 1 ├── Stop criteria: Any[Oceananigans.Simulations.iteration_limit_exceeded, Oceananigans.Simulations.stop_time_exceeded, Oceananigans.Simulations.wall_time_limit_exceeded] ├── Run time: 0 seconds, wall time limit: Inf ├── Stop time: Inf years, stop iteration: 1 ├── Diagnostics: typename(OrderedCollections.OrderedDict) with 1 entry: │ └── nan_checker => typename(NaNChecker) └── Output writers: typename(OrderedCollections.OrderedDict) with no entries
julia> simulation.output_writers[:checkpointer] = Checkpointer(model, schedule=TimeInterval(5years), prefix="model_checkpoint")Checkpointer{TimeInterval, Vector{Symbol}}(TimeInterval(1.5768e8, 0.0), ".", "model_checkpoint", [:architecture, :grid, :clock, :coriolis, :buoyancy, :closure, :velocities, :tracers, :timestepper, :particles], false, false, false)
julia> run!(simulation)[ Info: Updating model auxiliary state before the first time step... [ Info: ... updated in 1.428 ms. [ Info: Executing first time step... [ Info: Simulation is stopping. Model iteration 1 has hit or exceeded simulation stop iteration 1.

The default options should provide checkpoint files that are easy to restore from in most cases. For more advanced options and features, see Checkpointer.

Picking up a simulation from a checkpoint file

Picking up a simulation from a checkpoint requires the original script that was used to generate the checkpoint data. Change the first instance of run! in the script to take pickup=true:

julia> simulation.stop_iteration = 22
julia> run!(simulation, pickup=true)[ Info: Updating model auxiliary state before the first time step... [ Info: ... updated in 670.209 μs. [ Info: Executing first time step... [ Info: Simulation is stopping. Model iteration 2 has hit or exceeded simulation stop iteration 2.

which finds the latest checkpoint file in the current working directory (in this trivial case, this is the checkpoint associated with iteration 0), loads prognostic fields and their tendencies from file, resets the model clock and iteration, and updates the model auxiliary state before starting the time-stepping loop.

Use pickup=iteration, where iteration is an Integer, to pick up from a specific iteration. Or, use pickup=filepath, where filepath is a string, to pickup from a specific file located at filepath.