Checkpointing
A Checkpointer
can be used to serialize the entire model state to a file from which the model can be restored at any time. This is useful if you'd like to periodically checkpoint when running long simulations in case of crashes or hitting cluster time limits, but also if you'd like to restore from a checkpoint and try out multiple scenarios.
For example, to periodically checkpoint the model state to disk every 1,000,000 seconds of simulation time to files of the form model_checkpoint_iteration12500.jld2
where 12500
is the iteration number (automatically filled in).
Here's an example where we checkpoint every 5 iterations. This is far more often than appropriate for typical applications: we only do it here for illustration purposes.
julia> using Oceananigans
julia> model = NonhydrostaticModel(grid=RectilinearGrid(size=(8, 8, 8), extent=(1, 1, 1)))
NonhydrostaticModel{CPU, RectilinearGrid}(time = 0 seconds, iteration = 0) ├── grid: 8×8×8 RectilinearGrid{Float64, Periodic, Periodic, Bounded} on CPU with 3×3×3 halo ├── timestepper: RungeKutta3TimeStepper ├── advection scheme: Centered(order=2) ├── tracers: () ├── closure: Nothing ├── buoyancy: Nothing └── coriolis: Nothing
julia> simulation = Simulation(model, Δt=1, stop_iteration=8)
Simulation of NonhydrostaticModel{CPU, RectilinearGrid}(time = 0 seconds, iteration = 0) ├── Next time step: 1 second ├── Elapsed wall time: 0 seconds ├── Wall time per iteration: NaN days ├── Stop time: Inf days ├── Stop iteration : 8.0 ├── Wall time limit: Inf ├── Callbacks: OrderedDict with 4 entries: │ ├── stop_time_exceeded => Callback of stop_time_exceeded on IterationInterval(1) │ ├── stop_iteration_exceeded => Callback of stop_iteration_exceeded on IterationInterval(1) │ ├── wall_time_limit_exceeded => Callback of wall_time_limit_exceeded on IterationInterval(1) │ └── nan_checker => Callback of NaNChecker for u on IterationInterval(100) ├── Output writers: OrderedDict with no entries └── Diagnostics: OrderedDict with no entries
julia> simulation.output_writers[:checkpointer] = Checkpointer(model, schedule=IterationInterval(5), prefix="model_checkpoint")
Checkpointer{IterationInterval, Vector{Symbol}}(IterationInterval(5, 0), ".", "model_checkpoint", [:grid, :timestepper, :particles, :clock, :coriolis, :buoyancy, :closure], false, false, false)
Again, for illustration purposes of this example, we also add another callback so we can see the iteration of the simulation
julia> show_iteration(sim) = @info "iteration: $(iteration(sim)); time: $(prettytime(sim.model.clock.time))"
show_iteration (generic function with 1 method)
julia> add_callback!(simulation, show_iteration, name=:info, IterationInterval(1))
Now let's run
julia> run!(simulation)
[ Info: Initializing simulation... [ Info: iteration: 0; time: 0 seconds [ Info: ... simulation initialization complete (4.652 seconds) [ Info: Executing initial time step... [ Info: ... initial time step complete (4.192 seconds). [ Info: iteration: 1; time: 1 second [ Info: iteration: 2; time: 2.000 seconds [ Info: iteration: 3; time: 3 seconds [ Info: iteration: 4; time: 4 seconds [ Info: iteration: 5; time: 5 seconds [ Info: iteration: 6; time: 6 seconds [ Info: iteration: 7; time: 7 seconds [ Info: Simulation is stopping after running for 8.883 seconds. [ Info: Model iteration 8 equals or exceeds stop iteration 8. [ Info: iteration: 8; time: 8 seconds
The default options should provide checkpoint files that are easy to restore from (in most cases). For more advanced options and features, see Checkpointer
.
Picking up a simulation from a checkpoint file
Picking up a simulation from a checkpoint requires the original script that was used to generate the checkpoint data. Change the first instance of run!
in the script to take pickup=true
.
When pickup=true
is provided to run!
then it finds the latest checkpoint file in the current working directory, loads prognostic fields and their tendencies from file, resets the model clock and iteration, to the clock time and iteration that the checkpoint corresponds to, and updates the model auxiliary state. After that, the time-stepping loop. In this simple example, although the simulation run up to iteration 8, the latest checkpoint is associated with iteration 5.
julia> simulation.stop_iteration = 12
12
julia> run!(simulation, pickup=true)
[ Info: Initializing simulation... [ Info: ... simulation initialization complete (1.121 ms) [ Info: Executing initial time step... [ Info: ... initial time step complete (4.752 ms). [ Info: iteration: 6; time: 6 seconds [ Info: iteration: 7; time: 7 seconds [ Info: iteration: 8; time: 8 seconds [ Info: iteration: 9; time: 9 seconds [ Info: iteration: 10; time: 10 seconds [ Info: iteration: 11; time: 11 seconds [ Info: Simulation is stopping after running for 38.144 ms. [ Info: Model iteration 12 equals or exceeds stop iteration 12. [ Info: iteration: 12; time: 12 seconds
Use pickup=iteration
, where iteration
is an Integer
, to pick up from a specific iteration. Or, use pickup=filepath
, where filepath
is a string, to pickup from a specific file located at filepath
.