Debugging
ClimaCore
comes with a set of debugging tools for a variety of situations. These tools are collected in the ClimaCore.DebugOnly
modules and are presented in this page.
Finding where NaN
arise
One of the most challenging tasks that users have is: debug a large simulation that is breaking, e.g., yielding NaN
s somewhere. This is especially complex for large models with many terms and implicit time-stepping with all the bells and whistles that the CliMA ecosystem offers.
Because so much data (for example, the solution state, and many cached fields) is typically contained in ClimaCore
data structures, we offer a hook to inspect this data after any operation that ClimaCore
performs.
Example
Print NaNs
when they are found
In this example, we add a callback that simply prints NaNs found
every instance when they are detected in a ClimaCore
operation.
To do this, we need two ingredients:
First, we need to enable the callback system:
import ClimaCore
ClimaCore.DebugOnly.call_post_op_callback() = true
The line ClimaCore.DebugOnly.call_post_op_callback() = true
means that at the end of every ClimaCore
operation, the function ClimaCore.DebugOnly.post_op_callback
is called. By default, this function does nothing. So, the second ingredient is to define a method:
function ClimaCore.DebugOnly.post_op_callback(result, args...; kwargs...)
has_nans = result isa Number ? isnan(result) : any(isnan, parent(result))
has_inf = result isa Number ? isinf(result) : any(isinf, parent(result))
if has_nans || has_inf
has_nans && println("NaNs found!")
has_inf && println("Infs found!")
end
end
If needed, multiple methods of post_op_callback
can be defined, but here, we define a general method that checks if NaN
s are in the given object.
Note that we need post_op_callback
to be called for a wide variety of inputs because it is called by many different functions with many different objects. Therefore, we recommend that you define post_op_callback
with a very general method signature, like the one above and perhaps use Infiltrator
to inspect the arguments.
Now, let us put everything together and demonstrate a complete example:
import ClimaCore
ClimaCore.DebugOnly.call_post_op_callback() = true
function ClimaCore.DebugOnly.post_op_callback(result, args...; kwargs...)
has_nans = result isa Number ? isnan(result) : any(isnan, parent(result))
has_inf = result isa Number ? isinf(result) : any(isinf, parent(result))
if has_nans || has_inf
has_nans && println("NaNs found!")
has_inf && println("Infs found!")
end
end
FT = Float64
data = ClimaCore.DataLayouts.VIJFH{FT}(Array{FT}, zeros; Nv=5, Nij=2, Nh=2)
@. data = NaN
NaNs found!
NaNs found!
NaNs found!
This example should print NaN
on your standard output.
As you see, this only tells us that a NaN
was found, but not which function triggered the NaN
. A simple way to find that is to extend this example with Infiltrator
.
Infiltrating and Exfiltrating
Infiltrator.jl is a simple debugging tool for Julia packages.
Here is an example, where we can use Infiltrator.jl to find where NaN
s is coming from interactively.
Infiltrating
Suppose you have a NaN
in your simulation and what to look at the variables right before the expression caused the NaN
. One of the challenges in this is that you probably have a tower of function being called, many of which belong to other packages.
Let us simulate this case (note this is a toy example optimized for clarity and not for performance: the functions below have unnecessary allocations)
import ClimaCore
using ClimaCore.CommonSpaces
space = CubedSphereSpace(; radius = 10, n_quad_points = 4, h_elem = 10)
myrho = ones(space)
myP = ones(space)
myu = ones(space)
kinetic_energy(field) = field .* field ./ 2
other_energy(field) = field ./ sum(field)
offset_by_one(field) = field .- 1
function specific_energy(rho, P, u)
density_without_restmass = offset_by_one(rho)
return (kinetic_energy(u) .+ other_energy(P)) ./ density_without_restmass
end
function renormalized_energy(rho, P, u)
energy = specific_energy(rho, P, u)
return energy ./ sum(energy)
end
any(isnan, renormalized_energy(myrho, myP, myu)) # true
To debug this, we first need to identify where the first NaN
is produced. We use DebugOnly.call_post_op_callback
and infiltrate.
import Infiltrator # must be in your default environment
ClimaCore.DebugOnly.call_post_op_callback() = true
function ClimaCore.DebugOnly.post_op_callback(result, args...; kwargs...)
has_nans = result isa Number ? isnan(result) : any(isnan, parent(result))
has_inf = result isa Number ? isinf(result) : any(isinf, parent(result))
@infiltrate has_nans || has_inf
end
"Infiltrating" means being dropped into a new REPL where in the scope of the @infiltrate
macro. @infiltrate condition
means that we want to infiltrate only when the condition
is true (in this case has_nans || has_inf
).
Now, when we run our example, we will see
julia> renormalized_energy(myrho, myP, myu)
Infiltrating post_op_callback(::ClimaCore.DataLayouts.IJFH{Float64, 4, Array{Float64, 4}}, ::ClimaCore.DataLayouts.IJFH{Float64, 4, Array{Float64, 4}}, ::Vararg{Any}; kwargs::@Kwargs{})
at REPL[40]:4
infil>
Here, we are dropped into a new REPL with full access to the variables in the scope where the NaN
occurred. However, because of how post_op_callback
, this is at a low level within ClimaCore
, which is typically not useful. Hence, the next step is to type @trace
, which prints out
[1] post_op_callback(::ClimaCore.DataLayouts.IJFH{…}, ::ClimaCore.DataLayouts.IJFH{…}, ::Vararg{…}; kwargs::@Kwargs{})
at REPL[40]:4
[2] post_op_callback
at REPL[40]:1
[3] copyto!
at ClimaCore.jl/src/DataLayouts/copyto.jl:18
[4] copyto!
at ClimaCore.jl/src/Fields/broadcast.jl:190
[5] copy
at ClimaCore.jl/src/Fields/broadcast.jl:97
[6] materialize
at .julia/juliaup/julia-1.11.4+0.x64.linux.gnu/share/julia/base/broadcast.jl:872
[7] specific_energy(rho::ClimaCore.Fields.Field{…}, P::ClimaCore.Fields.Field{…}, u::ClimaCore.Fields.Field{…})
at REPL[31]:2
[8] renormalized_energy(rho::ClimaCore.Fields.Field{…}, P::ClimaCore.Fields.Field{…}, u::ClimaCore.Fields.Field{…})
at REPL[36]:2
[9] top-level scope
@trace
returns a type-limited stacktrace that we can read backwards until we see our functions. In this case, we see that the first NaN
is in specific_energy
, so we will investigate that function. We leave the Infiltrator REPL with @exit
, disable the call_post_op_callback
, and move our infiltrate
call within the target function:
ClimaCore.DebugOnly.call_post_op_callback() = false
function specific_energy(rho, P, u)
@infiltrate
density_without_restmass = offset_by_one(rho)
return (kinetic_energy(u) .+ other_energy(P)) ./ density_without_restmass
end
Now, when we evaluate our problematic expression (the one at the top level, in this case renormalized_energy(myrho, myP, myu)
), we will be dropped in a REPL inside specific_energy
. Here, we have access to density_without_restmass
, and we notice that it can be zero, leading to the NaN
.
The infiltrator REPL is different from the normal Julia repl. Type ?
for some useful commands. You can fetch objects defined in the main REPL by prepending their name with Main
. Similarly, if you want to infiltrate inside a module, prepend @infiltrate
with Main
(Main.@infiltrate
).
Looking at this code, we could have probably guess that the NaN
comes from a division from 0, and the real question is how do we get that?. In more complex cases, the functions are spread across different packages and involve several different variables. This approach allows one to systematically identify where things go wrong.
Exfiltrating and StructuredPrinting
Let's now see a different way to use Infiltrator, where we move the variables in specific scope to the Main scope in the REPL and do some analysis on it.
import ClimaCore
import Infiltrator # must be in your default environment
ClimaCore.DebugOnly.call_post_op_callback() = true
function ClimaCore.DebugOnly.post_op_callback(result, args...; kwargs...)
has_nans = result isa Number ? isnan(result) : any(isnan, parent(result))
has_inf = result isa Number ? isinf(result) : any(isinf, parent(result))
if has_nans || has_inf
has_nans && println("NaNs found!")
has_inf && println("Infs found!")
# Let's define the stack trace so that we know where this came from
st = stacktrace()
# Let's use Infiltrator.jl to exfiltrate to drop into the REPL.
# Now, `Infiltrator.safehouse` will be a NamedTuple
# containing `result`, `args`, `kwargs` and `st`.
Infiltrator.@exfiltrate
# sometimes code execution doesn't stop, I'm not sure why. Let's
# make sure we exfiltrate immediately with the data we want.
error("Exfiltrating.")
end
end
FT = Float64
data = ClimaCore.DataLayouts.VIJFH{FT}(Array{FT}, zeros; Nv=5, Nij=2, Nh=2)
x = ClimaCore.DataLayouts.VIJFH{FT}(Array{FT}, zeros; Nv=5, Nij=2, Nh=2)
parent(x)[1] = NaN # emulate incorrect initialization
@. data = x + 1
# Let's see what happened
(;result, args, kwargs, st) = Infiltrator.safehouse;
# You can print the stack trace, to see where the NaNs were found:
ClimaCore.DebugOnly.print_depth_limited_stack_trace(st; maxtypedepth=1)
# Once there, you can see that the call lead you to `copyto!`,
# Inspecting `args` shows that the `Broadcasted` object used to populate the
# result was:
julia> args[2]
Base.Broadcast.Broadcasted{ClimaCore.DataLayouts.VIJFHStyle{5, 2, Array{Float64}}}(+, (ClimaCore.DataLayouts.VIJFH{Float64, 5, 2, Array{Float64, 5}}
[NaN, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 1))
# And there's your problem, NaNs is on the right-hand-side of that assignment.
If your broadcasted object is very long, it can be a bit overwhelming to figure out which part of the expression contains NaNs (if any). To make this process more manageable, we can use StructuredPrinting.jl
to highlight which parts of the broadcasted object contains NaNs:
using StructuredPrinting
import ClimaCore: DataLayouts
highlight_nans(x::DataLayouts.AbstractData) = any(y->isnan(y), parent(x));
highlight_nans(_) = false;
bc = Infiltrator.safehouse.args[2]; # we know that argument 2 is the broadcasted object
(; result) = Infiltrator.safehouse; # get the result
@structured_print bc Options(; highlight = x->highlight_nans(x))
This last line results in:
julia> @structured_print bc Options(; highlight = x->highlight_nans(x))
bc
bc.style::ClimaCore.DataLayouts.VIJFHStyle{5, 2, Array{Float64}}
bc.f::typeof(+)
bc.args::Tuple{ClimaCore.DataLayouts.VIJFH{Float64, 5, 2, Array{…}}, Int64}
bc.args.1::ClimaCore.DataLayouts.VIJFH{Float64, 5, 2, Array{Float64, 5}} # highlighted in RED
bc.args.2::Int64
bc.axes::NTuple{5, Base.OneTo{Int64}}
bc.axes.1::Base.OneTo{Int64}
bc.axes.1.stop::Int64
bc.axes.2::Base.OneTo{Int64}
bc.axes.2.stop::Int64
bc.axes.3::Base.OneTo{Int64}
bc.axes.3.stop::Int64
bc.axes.4::Base.OneTo{Int64}
bc.axes.4.stop::Int64
bc.axes.5::Base.OneTo{Int64}
bc.axes.5.stop::Int64
Caveats
While post_op_callback
may be helpful, it's not bullet proof. NaNs can infiltrate user data any time internals are used. For example parent (data) .= NaN
will not be caught by ClimaCore.DebugOnly, and errors can be observed later than expected.
post_op_callback
is called in many places, so this is a performance-critical code path and expensive operations performed in post_op_callback
may significantly slow down your code.
It is highly recommended to use post_op_callback
without @testset
, as Test.jl may continue running through code execution, until all of the tests in a given @testset
are complete, and the result will be that you will get the last observed instance of NaN
or Inf
.
Faster explorations when the initialization is expensive
Sometimes, we want to start from a given simulation state and explore different ideas. For example, we want to run a simulation for 10 days, and then test how different approaches affect its stability. Sometimes, checkpoints offer a way to do this, but not everything can be checkpointed.
A simple way to "checkpoint" a simulation is to deepcopy
its state. This allows one to step the copy instead of the original one, which can be re-used to make new copies, allowing for various explorations. ClimaCore
does not support this workflow out-of-the-box. The reason for this is that ClimaCore
uses pointers to perform certain safety checks, and deepcopies return new pointers (by definition). To enable this, override the DebugOnly.allow_mismatched_spaces_unsafe
function so that it returns true. When DebugOnly.allow_mismatched_spaces_unsafe
returns true, ClimaCore
can mix fields defined on space that are not identically the same.
Let us look at an example of this.
import ClimaCore
using ClimaCore.CommonSpaces
other_space = deepcopy(space)
one = ones(space)
other_one = ones(other_space)
one .+ other_one # This throws an error
ClimaCore.DebugOnly.allow_mismatched_spaces_unsafe() = true
one .+ other_one # Now it's fine!