Simulation tips

In Oceananigans we try to do most of the optimizing behind the scenes, that way the average user doesn't have to worry about details when setting up a simulation. However, there's just so much optimization that can be done in the source code. Because of Oceananigans script-based interface, the user has to be aware of some things when writing the simulation script in order to take full advantage of Julia's speed. Furthermore, in case of more complex GPU runs, some details could sometimes prevent your simulation from running altogether. While Julia knowledge is obviously desirable here, a user that is unfamiliar with Julia can get away with efficient simulations by learning a few rules of thumb. It is nonetheless recommended that users go through Julia's performance tips, which contains more in-depth explanations of some of the aspects discussed here.

General (CPU/GPU) simulation tips

Avoid global variables whenever possible

In general using a global variable (which can be loosely defined as a variable defined in the main script) inside functions slows down the code. One way to circumvent this is to always use local variables or pass them as arguments to functions. This helps the compiler optimize the code.

Another way around this is to define global variables as constants whenever possible. One thing to keep in mind when doing this is that when a const is defined, its value can't be changed until you restart the Julia session. So this latter approach is good for production-ready code, but may be undesirable in the early stages of development while you still have to change the parameters of the simulation for exploration.

It is especially important to avoid global variables in functions that are meant to be executed in GPU kernels (such as functions defining boundary conditions and forcings). Otherwise the Julia GPU compiler can fail with obscure errors. This is explained in more detail in the GPU simulation tips section below.

Consider inlining small functions

Inlining is when the compiler replaces a function call with the body of the function that is being called before compiling. The advantage of inlining (which in julia can be done with the @inline macro) is that gets rid of the time spent calling the function. The Julia compiler automatically makes some calls as to what functions it should or shouldn't inline, but you can force a function to be inlined by including the macro @inline before its definition. This is more suited for small functions that are called often. Here's an example of an implementation of the Heaviside function that forces it to be inlined:

@inline heaviside(X) = ifelse(X < 0, zero(X), one(X))

In practice it's hard to say whether inlining a function will bring runtime benefits with certainty, since Julia and KernelAbstractions.jl (needed for GPU runs) already inline some functions automatically. However, it is generally a good idea to at least investigate this aspect in your code as the benefits can potentially be significant.

GPU simulation tips

Running on GPUs can be very different from running on CPUs. Oceananigans makes most of the necessary changes in the background, so that for very simple simulations changing between CPUs and GPUs is just a matter of changing the architecture argument in the model from CPU() to GPU(). However, for more complex simulations some care needs to be taken on the part of the user. While knowledge of GPU computing (and Julia) is again desirable, an inexperienced user can also achieve high efficiency in GPU simulations by following a few simple principles.

Variables that need to be used in GPU computations need to be defined as constants

Any global variable that needs to be accessed by the GPU needs to be a constant or the simulation will crash. This includes any variables used in forcing functions and boundary conditions. For example, if you define a boundary condition like the example below and run your simulation on a GPU you'll get an error.

dTdz = 0.01 # K m⁻¹
T_bcs = TracerBoundaryConditions(grid,
                                 bottom = GradientBoundaryCondition(dTdz))

However, if you define dTdz as a constant by replacing the first line with const dTdz = 0.01, then (provided everything else is done properly) your run will be successful.

Complex diagnostics using `ComputedField`s may not work on GPUs

ComputedFields are the most convenient way to calculate diagnostics for your simulation. They will always work on CPUs, but when their complexity is high (in terms of number of abstract operations) the compiler can't translate them into GPU code and they fail for GPU runs. (This limitation is discussed in this Github issue and contributors are welcome.) For example, in the example below, calculating u² works in both CPUs and GPUs, but calculating ε will not compile on GPUs when we call the command compute!:

u, v, w = model.velocities
ν = model.closure.ν
u² = ComputedField(u^2)
ε = ComputedField(ν*(∂x(u)^2 + ∂x(v)^2 + ∂x(w)^2 + ∂y(u)^2 + ∂y(v)^2 + ∂y(w)^2 + ∂z(u)^2 + ∂z(v)^2 + ∂z(w)^2))
compute!(u²)
compute!(ε)

There are two approaches to bypass this issue. The first is to nest ComputedFields. For example, we can make KE be successfully computed on GPUs by defining it as

ddx² = ComputedField(∂x(u)^2 + ∂x(v)^2 + ∂x(w)^2)
ddy² = ComputedField(∂y(u)^2 + ∂y(v)^2 + ∂y(w)^2)
ddz² = ComputedField(∂z(u)^2 + ∂z(v)^2 + ∂z(w)^2)
ε = ComputedField(ν*(ddx² + ddy² + ddz²))
compute!(ε)

This is a simple workaround that is especially suited for the development stage of a simulation. However, when running this, the code will iterate over the whole domain 4 times to calculate ε (one for each computed field defined), which is not very efficient.

A different way to calculate ε is by using KernelComputedFields, where the user manually specifies the computing kernel to the compiler. The advantage of this method is that it's more efficient (the code will only iterate once over the domain in order to calculate ε), but the disadvantage is that this requires that the has some knowledge of Oceananigans operations and how they should be performed on a C-grid. For example calculating ε with this approach would look like this:

using Oceananigans.Operators
using KernelAbstractions: @index, @kernel
using Oceananigans.Grids: Center, Face
using Oceananigans.Fields: KernelComputedField

@inline fψ_plus_gφ²(i, j, k, grid, f, ψ, g, φ) = @inbounds (f(i, j, k, grid, ψ) + g(i, j, k, grid, φ))^2
@kernel function isotropic_viscous_dissipation_rate_ccc!(ϵ, grid, u, v, w, ν)
    i, j, k = @index(Global, NTuple)

    Σˣˣ² = ∂xᶜᵃᵃ(i, j, k, grid, u)^2
    Σʸʸ² = ∂yᵃᶜᵃ(i, j, k, grid, v)^2
    Σᶻᶻ² = ∂zᵃᵃᶜ(i, j, k, grid, w)^2

    Σˣʸ² = ℑxyᶜᶜᵃ(i, j, k, grid, fψ_plus_gφ², ∂yᵃᶠᵃ, u, ∂xᶠᵃᵃ, v) / 4
    Σˣᶻ² = ℑxzᶜᵃᶜ(i, j, k, grid, fψ_plus_gφ², ∂zᵃᵃᶠ, u, ∂xᶠᵃᵃ, w) / 4
    Σʸᶻ² = ℑyzᵃᶜᶜ(i, j, k, grid, fψ_plus_gφ², ∂zᵃᵃᶠ, v, ∂yᵃᶠᵃ, w) / 4

    @inbounds ϵ[i, j, k] = ν[i, j, k] * 2 * (Σˣˣ² + Σʸʸ² + Σᶻᶻ² + 2 * (Σˣʸ² + Σˣᶻ² + Σʸᶻ²))
end
ε = KernelComputedField(Center, Center, Center, isotropic_viscous_dissipation_rate_ccc!, model;
                         computed_dependencies=(u, v, w, ν))

It may be useful to know that there are some kernels already defined for commonly-used diagnostics in packages that are companions to Oceananigans. For example Oceanostics.jl and LESbrary.jl. Users should first look there before writing any kernel by hand and are always welcome to start an issue on Github if they need help to write a different kernel.

Try to decrease the memory-use of your runs

GPU runs are generally memory-limited. As an example, a state-of-the-art Tesla V100 GPU has 32GB of memory, which is enough to fit, on average, a simulation with about 100 million points –- a bit smaller than a 512-cubed simulation. (The precise number depends on many other things, such as the number of tracers simulated, as well as the diagnostics that are calculated.) This means that it is especially important to be mindful of the size of your runs when running Oceananigans on GPUs and it is generally good practice to decrease the memory required for your runs. Below are some useful tips to achieve this

Use the nvidia-smi command line utility to monitor the memory usage of the GPU. It should tell you how much memory there is on your GPU and how much of it you're using.
Try to use higher-order advection schemes. In general when you use a higher-order scheme you need fewer grid points to achieve the same accuracy that you would with a lower-order one. Oceananigans provides two high-order advection schemes: 5th-order WENO method (WENO5) and 3rd-order upwind.
Manually define scratch space to be reused in diagnostics. By default, every time a user-defined diagnostic is calculated the compiler reserves a new chunk of memory for that calculation, usually called scratch space. In general, the more diagnostics, the more scratch space needed and the bigger the memory requirements. However, if you explicitly create a scratch space and pass that same scratch space for as many diagnostics as you can, you minimize the memory requirements of your calculations by reusing the same memory chunk. As an example, you can see scratch space being created here and then being used in calculations here.

Arrays in GPUs are usually different from arrays in CPUs

On the CPU Oceananigans.jl uses regular Arrays, but on the GPU it has to use CuArrays from the CUDA.jl package. While deep down both are arrays, their implementations are different and both can behave very differently. Something to keep in mind when working with CuArrays is that you do not want to access elements of a CuArray outside of a kernel. Doing so invokes scalar operations in which individual elements are copied from or to the GPU for processing. This is very slow and can result in huge slowdowns. For this reason, Oceananigans.jl disables CUDA scalar operations by default. See the scalar indexing section of the CUDA.jl documentation for more information on scalar indexing.

For example if can be difficult to just view a CuArray since Julia needs to access its elements to do that. Consider the example below:

julia> using Oceananigans; using Adapt

julia> grid = RegularRectilinearGrid(size=(1,1,1), extent=(1,1,1))
RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded}
                   domain: x ∈ [0.0, 1.0], y ∈ [0.0, 1.0], z ∈ [-1.0, 0.0]
                 topology: (Periodic, Periodic, Bounded)
  resolution (Nx, Ny, Nz): (1, 1, 1)
   halo size (Hx, Hy, Hz): (1, 1, 1)
grid spacing (Δx, Δy, Δz): (1.0, 1.0, 1.0)

julia> model = IncompressibleModel(grid=grid, architecture=GPU())
IncompressibleModel{GPU, Float64}(time = 0 seconds, iteration = 0) 
├── grid: RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded}(Nx=1, Ny=1, Nz=1)
├── tracers: (:T, :S)
├── closure: IsotropicDiffusivity{Float64,NamedTuple{(:T, :S),Tuple{Float64,Float64}}}
├── buoyancy: SeawaterBuoyancy{Float64,LinearEquationOfState{Float64},Nothing,Nothing}
└── coriolis: Nothing

julia> typeof(model.velocities.u.data)
OffsetArrays.OffsetArray{Float64,3,CUDA.CuArray{Float64,3}}

julia> adapt(Array, model.velocities.u.data)
3×3×3 OffsetArray(::Array{Float64,3}, 0:2, 0:2, 0:2) with eltype Float64 with indices 0:2×0:2×0:2:
[:, :, 0] =
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

[:, :, 1] =
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

[:, :, 2] =
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

Notice that in order to view the CuArray that stores values for u we needed to transform it into a regular Array first using Adapt.adapt. If we naively try to view the CuArray without that step we get an error:

julia> model.velocities.u.data
3×3×3 OffsetArray(::CUDA.CuArray{Float64,3}, 0:2, 0:2, 0:2) with eltype Float64 with indices 0:2×0:2×0:2:
[:, :, 0] =
Error showing value of type OffsetArrays.OffsetArray{Float64,3,CUDA.CuArray{Float64,3}}:
ERROR: scalar getindex is disallowed

Here CUDA.jl throws an error because scalar getindex is not allowed. Another way around this limitation is to allow scalar operations on CuArrays. We can temporarily do that with the CUDA.@allowscalar macro or by calling CUDA.allowscalar(true).

julia> using CUDA; CUDA.allowscalar(true)

julia> model.velocities.u.data
3×3×3 OffsetArray(::CuArray{Float64,3}, 0:2, 0:2, 0:2) with eltype Float64 with indices 0:2×0:2×0:2:
[:, :, 0] =
┌ Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
└ @ GPUArrays ~/.julia/packages/GPUArrays/WV76E/src/host/indexing.jl:43
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

[:, :, 1] =
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

[:, :, 2] =
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

Notice the warning we get when we do this. Scalar operations on GPUs can be very slow, so it is advised to only use this last method when using the REPL or prototyping –- never in production-ready scripts.

You might also need to keep these differences in mind when using arrays to define initial conditions, boundary conditions or forcing functions on a GPU. To learn more about working with CuArrays, see the array programming section of the CUDA.jl documentation.