Performance benchmarks

The performance benchmarking scripts in the benchmarks directory of the git repository can be run to benchmark Oceananigans.jl on your machine. They use TimerOutputs.jl to nicely format the benchmarks.

Static ocean

This is a benchmark of a simple "static ocean" configuration. The time stepping and Poisson solver still takes the same amount of time whether the ocean is static or active, so it should be indicative of actual performance. It tests the performance of a bare-bones model.

Julia Version 1.3.0
Commit 46ce4d7933 (2019-11-26 06:09 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, broadwell)
  GPU: Tesla V100-PCIE-32GB

 ──────────────────────────────────────────────────────────────────────────────────────
        Static ocean benchmarks                Time                   Allocations      
                                       ──────────────────────   ───────────────────────
           Tot / % measured:                 153s / 77.9%           7.36GiB / 0.91%    

 Section                       ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────────
  32× 32× 32  (CPU, Float32)       10   78.7ms  0.07%  7.87ms    768KiB  1.09%  76.8KiB
  32× 32× 32  (CPU, Float64)       10   79.0ms  0.07%  7.90ms    768KiB  1.09%  76.8KiB
  32× 32× 32  (GPU, Float32)       10   41.3ms  0.03%  4.13ms   7.83MiB  11.4%   802KiB
  32× 32× 32  (GPU, Float64)       10   42.6ms  0.04%  4.26ms   7.84MiB  11.4%   803KiB
  64× 64× 64  (CPU, Float32)       10    685ms  0.58%  68.5ms    768KiB  1.09%  76.8KiB
  64× 64× 64  (CPU, Float64)       10    674ms  0.57%  67.4ms    768KiB  1.09%  76.8KiB
  64× 64× 64  (GPU, Float32)       10   44.1ms  0.04%  4.41ms   7.84MiB  11.4%   802KiB
  64× 64× 64  (GPU, Float64)       10   43.4ms  0.04%  4.34ms   7.84MiB  11.4%   803KiB
 128×128×128  (CPU, Float32)       10    5.72s  4.82%   572ms    768KiB  1.09%  76.8KiB
 128×128×128  (CPU, Float64)       10    5.59s  4.70%   559ms    768KiB  1.09%  76.8KiB
 128×128×128  (GPU, Float32)       10   54.0ms  0.05%  5.40ms   7.84MiB  11.4%   802KiB
 128×128×128  (GPU, Float64)       10   54.6ms  0.05%  5.46ms   7.84MiB  11.4%   803KiB
 256×256×256  (CPU, Float32)       10    54.3s  45.7%   5.43s    768KiB  1.09%  76.8KiB
 256×256×256  (CPU, Float64)       10    50.8s  42.8%   5.08s    768KiB  1.09%  76.8KiB
 256×256×256  (GPU, Float32)       10    305ms  0.26%  30.5ms   7.84MiB  11.4%   802KiB
 256×256×256  (GPU, Float64)       10    303ms  0.26%  30.3ms   7.84MiB  11.4%   803KiB
 ──────────────────────────────────────────────────────────────────────────────────────

CPU Float64 -> Float32 speedup:
 32× 32× 32 : 1.004
 64× 64× 64 : 0.985
128×128×128 : 0.976
256×256×256 : 0.936

GPU Float64 -> Float32 speedup:
 32× 32× 32 : 1.031
 64× 64× 64 : 0.985
128×128×128 : 1.012
256×256×256 : 0.994

CPU -> GPU speedup:
 32× 32× 32  (Float32): 1.904
 32× 32× 32  (Float64): 1.853
 64× 64× 64  (Float32): 15.531
 64× 64× 64  (Float64): 15.527
128×128×128  (Float32): 106.054
128×128×128  (Float64): 102.323
256×256×256  (Float32): 177.938
256×256×256  (Float64): 167.630

Eddying channel

This benchmark tests the channel model configuration which can be slower due to the use of a more complicated algorithm for the pressure solver in the current version of Oceananigans.

Julia Version 1.3.0
Commit 46ce4d7933 (2019-11-26 06:09 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, broadwell)
  GPU: Tesla V100-PCIE-32GB

 ──────────────────────────────────────────────────────────────────────────────────────
       Eddying channel benchmarks              Time                   Allocations      
                                       ──────────────────────   ───────────────────────
           Tot / % measured:                 112s / 61.5%           9.67GiB / 0.38%    

 Section                       ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────────
  32× 32× 32  (CPU, Float32)        5   45.1ms  0.07%  9.03ms    389KiB  1.02%  77.8KiB
  32× 32× 32  (CPU, Float64)        5   48.4ms  0.07%  9.68ms    389KiB  1.02%  77.8KiB
  32× 32× 32  (GPU, Float32)        5   33.1ms  0.05%  6.62ms   4.07MiB  10.9%   834KiB
  32× 32× 32  (GPU, Float64)        5   32.1ms  0.05%  6.42ms   4.08MiB  10.9%   835KiB
  64× 64× 64  (CPU, Float32)        5    377ms  0.55%  75.5ms    389KiB  1.02%  77.8KiB
  64× 64× 64  (CPU, Float64)        5    379ms  0.55%  75.7ms    389KiB  1.02%  77.8KiB
  64× 64× 64  (GPU, Float32)        5   44.7ms  0.06%  8.93ms   4.15MiB  11.1%   850KiB
  64× 64× 64  (GPU, Float64)        5   44.1ms  0.06%  8.82ms   4.15MiB  11.1%   850KiB
 128×128×128  (CPU, Float32)        5    3.17s  4.60%   635ms    389KiB  1.02%  77.8KiB
 128×128×128  (CPU, Float64)        5    3.19s  4.62%   637ms    389KiB  1.02%  77.8KiB
 128×128×128  (GPU, Float32)        5   75.2ms  0.11%  15.0ms   4.29MiB  11.5%   880KiB
 128×128×128  (GPU, Float64)        5   75.1ms  0.11%  15.0ms   4.30MiB  11.5%   880KiB
 256×256×256  (CPU, Float32)        5    31.5s  45.7%   6.30s    389KiB  1.02%  77.8KiB
 256×256×256  (CPU, Float64)        5    29.2s  42.3%   5.83s    389KiB  1.02%  77.8KiB
 256×256×256  (GPU, Float32)        5    391ms  0.57%  78.1ms   4.59MiB  12.3%   940KiB
 256×256×256  (GPU, Float64)        5    368ms  0.53%  73.6ms   4.59MiB  12.3%   940KiB
 ──────────────────────────────────────────────────────────────────────────────────────

CPU Float64 -> Float32 speedup:
 32× 32× 32 : 1.072
 64× 64× 64 : 1.003
128×128×128 : 1.004
256×256×256 : 0.926

GPU Float64 -> Float32 speedup:
 32× 32× 32 : 0.970
 64× 64× 64 : 0.987
128×128×128 : 0.999
256×256×256 : 0.943

CPU -> GPU speedup:
 32× 32× 32  (Float32): 1.364
 32× 32× 32  (Float64): 1.508
 64× 64× 64  (Float32): 8.449
 64× 64× 64  (Float64): 8.588
128×128×128  (Float32): 42.209
128×128×128  (Float64): 42.411
256×256×256  (Float32): 80.638
256×256×256  (Float64): 79.211

Tracers

This benchmark tests the performance impacts of running with various amounts of active and passive tracers.

Julia Version 1.3.0
Commit 46ce4d7933 (2019-11-26 06:09 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, broadwell)
  GPU: Tesla V100-PCIE-32GB

 ───────────────────────────────────────────────────────────────────────────────────────────────────────────
                      Tracer benchmarks                             Time                   Allocations      
                                                            ──────────────────────   ───────────────────────
                      Tot / % measured:                          37.6s / 9.69%           7.64GiB / 1.12%    

 Section                                            ncalls     time   %tot     avg     alloc   %tot      avg
 ───────────────────────────────────────────────────────────────────────────────────────────────────────────
  32× 32× 32 0 active +  0 passive (CPU, Float64)       10   60.0ms  1.65%  6.00ms    574KiB  0.64%  57.4KiB
  32× 32× 32 0 active +  1 passive (CPU, Float64)       10   68.4ms  1.88%  6.84ms    667KiB  0.74%  66.7KiB
  32× 32× 32 0 active +  2 passive (CPU, Float64)       10   76.8ms  2.11%  7.68ms    768KiB  0.85%  76.8KiB
  32× 32× 32 1 active +  0 passive (CPU, Float64)       10   69.2ms  1.90%  6.92ms    667KiB  0.74%  66.7KiB
  32× 32× 32 2 active +  0 passive (CPU, Float64)       10   78.7ms  2.16%  7.87ms    768KiB  0.85%  76.8KiB
  32× 32× 32 2 active +  3 passive (CPU, Float64)       10    104ms  2.86%  10.4ms   1.03MiB  1.17%   106KiB
  32× 32× 32 2 active +  5 passive (CPU, Float64)       10    123ms  3.38%  12.3ms   1.22MiB  1.39%   125KiB
  32× 32× 32 2 active + 10 passive (CPU, Float64)       10    177ms  4.86%  17.7ms   1.69MiB  1.92%   173KiB
 256×256×256 0 active +  0 passive (GPU, Float64)       10    237ms  6.50%  23.7ms   5.43MiB  6.17%   556KiB
 256×256×256 0 active +  1 passive (GPU, Float64)       10    266ms  7.29%  26.6ms   6.62MiB  7.52%   678KiB
 256×256×256 0 active +  2 passive (GPU, Float64)       10    297ms  8.16%  29.7ms   7.83MiB  8.89%   801KiB
 256×256×256 1 active +  0 passive (GPU, Float64)       10    268ms  7.35%  26.8ms   6.62MiB  7.52%   678KiB
 256×256×256 2 active +  0 passive (GPU, Float64)       10    303ms  8.32%  30.3ms   7.84MiB  8.91%   803KiB
 256×256×256 2 active +  3 passive (GPU, Float64)       10    403ms  11.1%  40.3ms   11.5MiB  13.1%  1.15MiB
 256×256×256 2 active +  5 passive (GPU, Float64)       10    472ms  13.0%  47.2ms   14.1MiB  16.0%  1.41MiB
 256×256×256 2 active + 10 passive (GPU, Float64)       10    641ms  17.6%  64.1ms   20.8MiB  23.6%  2.08MiB
 ───────────────────────────────────────────────────────────────────────────────────────────────────────────

Turbulence closures

This benchmark tests the performance impacts of various turbulence closures and large eddy simulation (LES) models.

Julia Version 1.3.0
Commit 46ce4d7933 (2019-11-26 06:09 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, broadwell)
  GPU: Tesla V100-PCIE-32GB

 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                    Turbulence closure benchmarks                            Time                   Allocations      
                                                                     ──────────────────────   ───────────────────────
                          Tot / % measured:                               31.0s / 78.5%           1.31GiB / 3.92%    

 Section                                                     ncalls     time   %tot     avg     alloc   %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  32× 32× 32 ConstantAnisotropicDiffusivity (CPU, Float64)       10   78.1ms  0.32%  7.81ms    769KiB  1.42%  76.9KiB
  32× 32× 32 ConstantAnisotropicDiffusivity (GPU, Float64)       10   43.0ms  0.18%  4.30ms   7.86MiB  14.9%   805KiB
  32× 32× 32 ConstantIsotropicDiffusivity (CPU, Float64)         10   78.7ms  0.32%  7.87ms    768KiB  1.42%  76.8KiB
  32× 32× 32 ConstantIsotropicDiffusivity (GPU, Float64)         10   44.5ms  0.18%  4.45ms   7.84MiB  14.9%   803KiB
  32× 32× 32 SmagorinskyLilly (CPU, Float64)                     10    189ms  0.78%  18.9ms    778KiB  1.44%  77.8KiB
  32× 32× 32 SmagorinskyLilly (GPU, Float64)                     10   45.7ms  0.19%  4.57ms   8.43MiB  16.0%   863KiB
 128×128×128 ConstantAnisotropicDiffusivity (CPU, Float64)       10    5.54s  22.8%   554ms    769KiB  1.42%  76.9KiB
 128×128×128 ConstantAnisotropicDiffusivity (GPU, Float64)       10   53.5ms  0.22%  5.35ms   7.86MiB  14.9%   805KiB
 128×128×128 ConstantIsotropicDiffusivity (CPU, Float64)         10    5.53s  22.7%   553ms    768KiB  1.42%  76.8KiB
 128×128×128 ConstantIsotropicDiffusivity (GPU, Float64)         10   54.1ms  0.22%  5.41ms   7.84MiB  14.9%   803KiB
 128×128×128 SmagorinskyLilly (CPU, Float64)                     10    12.6s  51.8%   1.26s    778KiB  1.44%  77.8KiB
 128×128×128 SmagorinskyLilly (GPU, Float64)                     10   75.6ms  0.31%  7.56ms   8.43MiB  16.0%   863KiB
 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────