Simulation tips

In Oceananigans we try to do most of the optimizing behind the scenes, that way the average user doesn't have to worry about details when setting up a simulation. However, there's just so much optimization that can be done in the source code. Because of Oceananigans script-based interface, the user has to be aware of some things when writing the simulation script in order to take full advantage of Julia's speed. Furthermore, in case of more complex GPU runs, some details could sometimes prevent your simulation from running altogether. While Julia knowledge is obviously desirable here, a user that is unfamiliar with Julia can get away with efficient simulations by learning a few rules of thumb. It is nonetheless recommended that users go through Julia's performance tips, which contains more in-depth explanations of some of the aspects discussed here.

General (CPU/GPU) simulation tips

Avoid global variables whenever possible

In general using a global variable (which can be loosely defined as a variable defined in the main script) inside functions slows down the code. One way to circumvent this is to always use local variables or pass them as arguments to functions. This helps the compiler optimize the code.

Another way around this is to define global variables as constants whenever possible. One thing to keep in mind when doing this is that when a const is defined, its value can't be changed until you restart the Julia session. So this latter approach is good for production-ready code, but may be undesirable in the early stages of development while you still have to change the parameters of the simulation for exploration.

It is especially important to avoid global variables in functions that are meant to be executed in GPU kernels (such as functions defining boundary conditions and forcings). Otherwise the Julia GPU compiler can fail with obscure errors. This is explained in more detail in the GPU simulation tips section below.

Consider inlining small functions

Inlining is when the compiler replaces a function call with the body of the function that is being called before compiling. The advantage of inlining (which in julia can be done with the @inline macro) is that gets rid of the time spent calling the function. The Julia compiler automatically makes some calls as to what functions it should or shouldn't inline, but you can force a function to be inlined by including the macro @inline before its definition. This is more suited for small functions that are called often. Here's an example of an implementation of the Heaviside function that forces it to be inlined:

@inline heaviside(X) = ifelse(X < 0, zero(X), one(X))

In practice it's hard to say whether inlining a function will bring runtime benefits with certainty, since Julia and KernelAbstractions.jl (needed for GPU runs) already inline some functions automatically. However, it is generally a good idea to at least investigate this aspect in your code as the benefits can potentially be significant.

GPU simulation tips

Running on GPUs can be very different from running on CPUs. Oceananigans makes most of the necessary changes in the background, so that for very simple simulations changing between CPUs and GPUs is just a matter of changing the architecture argument in the model from CPU() to GPU(). However, for more complex simulations some care needs to be taken on the part of the user. While knowledge of GPU computing (and Julia) is again desirable, an inexperienced user can also achieve high efficiency in GPU simulations by following a few simple principles.

Variables that need to be used in GPU computations need to be defined as constants

Any global variable that needs to be accessed by the GPU needs to be a constant or the simulation will crash. This includes any variables used in forcing functions and boundary conditions. For example, if you define a boundary condition like the example below and run your simulation on a GPU you'll get an error.

dTdz = 0.01 # K m⁻¹
T_bcs = TracerBoundaryConditions(grid,
                                 bottom = GradientBoundaryCondition(dTdz))

However, if you define dTdz as a constant by replacing the first line with const dTdz = 0.01, then (provided everything else is done properly) your run will be successful.

Complex diagnostics using `ComputedField`s may not work on GPUs

ComputedFields are the most convenient way to calculate diagnostics for your simulation. They will always work on CPUs, but when their complexity is high (in terms of number of abstract operations) the compiler can't translate them into GPU code and they fail for GPU runs. (This limitation is discussed in this Github issue and contributors are welcome.) For example, in the example below, calculating u² works in both CPUs and GPUs, but calculating KE only works on CPUs:

u, v, w = model.velocities
u² = ComputedField(u^2)
KE = ComputedField((u^2 + v^2 + w^2)/2)
compute!(u²)
compute!(KE)

There are two approaches to bypass this issue. The first is to nest ComputedFields. For example, we can make KE be successfully computed on GPUs by defining it as

u, v, w = model.velocities
u² = ComputedField(u^2)
v² = ComputedField(v^2)
w² = ComputedField(w^2)
u²plusv² = ComputedField(u² + v²)
KE = ComputedField((u²plusv² + w²)/2)
compute!(KE)

This is a simple workaround that is especially suited for the development stage of a simulation. However, when running this, the code will iterate over the whole domain 5 times to calculate KE (one for each computed field defined), which is not very efficient.

A different way to calculate KE is by using KernelComputedFields, where the user manually specifies the computing kernel to the compiler. The advantage of this method is that it's more efficient (the code will only iterate once over the domain in order to calculate KE), but the disadvantage is that this requires that the has some knowledge of Oceananigans operations and how they should be performed on a C-grid. For example calculating KE with this approach would look like this:

using Oceananigans.Operators
using KernelAbstractions: @index, @kernel
using Oceananigans.Grids: Center, Face
using Oceananigans.Fields: KernelComputedField

@inline ψ²(i, j, k, grid, ψ) = @inbounds ψ[i, j, k]^2

@kernel function kinetic_energy_ccc!(tke, grid, u, v, w)
    i, j, k = @index(Global, NTuple)
    @inbounds tke[i, j, k] = (
                              ℑxᶜᵃᵃ(i, j, k, grid, ψ², u) + # Calculates u^2 using function ψ² and then interpolates in x to grid center
                              ℑyᵃᶜᵃ(i, j, k, grid, ψ², v) + # Calculates v^2 using function ψ² and then interpolates in y to grid center
                              ℑzᵃᵃᶜ(i, j, k, grid, ψ², w)   # Calculates w^2 using function ψ² and then interpolates in z to grid center
                             ) / 2
end

KE = KernelComputedField(Center, Center, Center, kinetic_energy_ccc!, model;
                         computed_dependencies=(u, v, w))

It may be useful to know that there are some kernels already defined for commonly-used diagnostics in packages that are companions to Oceananigans. For example Oceanostics.jl and LESbrary.jl. Users should first look there before writing any kernel by hand and are always welcome to start an issue on Github if they need help to write a different kernel.

Try to decrease the memory-use of your runs

GPU runs are generally memory-limited. As an example, a state-of-the-art Tesla V100 GPU has 32GB of memory, which is enough to fit, on average, a simulation with about 100 million points –- a bit smaller than a 512-cubed simulation. (The precise number depends on many other things, such as the number of tracers simulated, as well as the diagnostics that are calculated.) This means that it is especially important to be mindful of the size of your runs when running Oceananigans on GPUs and it is generally good practice to decrease the memory required for your runs. Below are some useful tips to achieve this

Use the nvidia-smi command line utility to monitor the memory usage of the GPU. It should tell you how much memory there is on your GPU and how much of it you're using.
Try to use higher-order advection schemes. In general when you use a higher-order scheme you need fewer grid points to achieve the same accuracy that you would with a lower-order one. Oceananigans provides two high-order advection schemes: 5th-order WENO method (WENO5) and 3rd-order upwind.
Manually define scratch space to be reused in diagnostics. By default, every time a user-defined diagnostic is calculated the compiler reserves a new chunk of memory for that calculation, usually called scratch space. In general, the more diagnostics, the more scratch space needed and the bigger the memory requirements. However, if you explicitly create a scratch space and pass that same scratch space for as many diagnostics as you can, you minimize the memory requirements of your calculations by reusing the same memory chunk. As an example, you can see scratch space being created here and then being used in calculations here.

Arrays in GPUs are usually different from arrays in CPUs

On the CPU Oceananigans.jl uses regular Arrays, but on the GPU it has to use CuArrays from the CUDA.jl package. You might need to keep this difference in mind when using arrays to set! initial conditions or when using arrays to provide boundary conditions and forcing functions.

To learn more about working with CuArrays, see the array programming section of the CUDA.jl documentation.

Something to keep in mind when working with CuArrays is that you do not want to set or get/access elements of a CuArray outside of a kernel. Doing so invokes scalar operations in which individual elements are copied from or to the GPU for processing. This is very slow and can result in huge slowdowns. For this reason, Oceananigans.jl disables CUDA scalar operations by default.

See the scalar indexing section of the CUDA.jl documentation for more information on scalar indexing.

Sometimes you need to perform scalar operations on CuArrays in which case you may want to temporarily allow scalar operations with the CUDA.@allowscalar macro or by calling CUDA.allowscalar(true).