New Benchmarks #3005

hennyg888 · 2021-05-28T17:24:42Z

hennyg888
May 28, 2021
Collaborator

@francispoulin and I recently ran some of the benchmark scripts with Julia 1.6.0 and Oceananigans v0.58.1.
If these benchmarks differ enough from the ones currently shown on benchmarks.md then I'll make a PR to update them. The hardware these new benchmarks were run on are mostly the exact as the old benchmarks save for a few that were ran on Titan V GPUs but are now run on Tesla V100 GPUs.

The shallow water model benchmarks were run without problems. With CPU, when the grid size exceeded 2048 x 2048, only one sample could be benchmarked. Trying to get more samples benchmarked by increasing the sampling time limit resulted in out of memory exceptions.

Oceananigans v0.58.1
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
  EBVERSIONJULIA = 1.6.0
  JULIA_DEPOT_PATH = :
  EBROOTJULIA = /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/julia/1.6.0
  EBDEVELJULIA = /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/julia/1.6.0/easybuild/avx2-Core-julia-1.6.0-easybuild-devel
  JULIA_LOAD_PATH = :
  GPU: Tesla V100-SXM2-32GB

                                             Shallow water model benchmarks
┌───────────────┬─────────────┬──────┬────────────┬────────────┬────────────┬────────────┬──────────┬────────┬─────────┐
│ Architectures │ Float_types │   Ns │        min │     median │       mean │        max │   memory │ allocs │ samples │
├───────────────┼─────────────┼──────┼────────────┼────────────┼────────────┼────────────┼──────────┼────────┼─────────┤
│           CPU │     Float64 │   32 │   2.041 ms │   2.154 ms │   2.246 ms │   3.207 ms │ 1.36 MiB │   2253 │      10 │
│           CPU │     Float64 │   64 │   3.224 ms │   3.367 ms │   3.408 ms │   4.031 ms │ 1.36 MiB │   2255 │      10 │
│           CPU │     Float64 │  128 │   7.495 ms │   7.620 ms │   7.661 ms │   8.193 ms │ 1.36 MiB │   2255 │      10 │
│           CPU │     Float64 │  256 │  23.927 ms │  24.030 ms │  24.651 ms │  28.232 ms │ 1.36 MiB │   2255 │      10 │
│           CPU │     Float64 │  512 │  91.065 ms │  93.878 ms │  93.733 ms │  97.092 ms │ 1.36 MiB │   2315 │      10 │
│           CPU │     Float64 │ 1024 │ 388.387 ms │ 389.332 ms │ 390.035 ms │ 392.166 ms │ 1.36 MiB │   2315 │      10 │
│           CPU │     Float64 │ 2048 │    1.584 s │    1.584 s │    1.584 s │    1.585 s │ 1.36 MiB │   2315 │       4 │
│           CPU │     Float64 │ 4096 │    6.337 s │    6.337 s │    6.337 s │    6.337 s │ 1.36 MiB │   2315 │       1 │
│           CPU │     Float64 │ 8192 │   25.696 s │   25.696 s │   25.696 s │   25.696 s │ 1.36 MiB │   2313 │       1 │
│           CPU │     Float64 │16384 │  106.702 s │  106.702 s │  106.702 s │  106.702 s │ 1.36 MiB │   2313 │       1 │
│           GPU │     Float64 │   32 │   3.207 ms │   3.391 ms │   3.415 ms │   3.856 ms │ 1.84 MiB │   5723 │      10 │
│           GPU │     Float64 │   64 │   3.269 ms │   3.420 ms │   3.505 ms │   4.417 ms │ 1.84 MiB │   5723 │      10 │
│           GPU │     Float64 │  128 │   3.186 ms │   3.399 ms │   3.517 ms │   4.875 ms │ 1.84 MiB │   5723 │      10 │
│           GPU │     Float64 │  256 │   3.193 ms │   3.410 ms │   3.556 ms │   5.172 ms │ 1.84 MiB │   5723 │      10 │
│           GPU │     Float64 │  512 │   3.319 ms │   3.455 ms │   3.618 ms │   5.163 ms │ 1.84 MiB │   5807 │      10 │
│           GPU │     Float64 │ 1024 │   3.332 ms │   3.508 ms │   3.740 ms │   6.095 ms │ 1.84 MiB │   5841 │      10 │
│           GPU │     Float64 │ 2048 │   8.878 ms │   8.896 ms │   9.130 ms │  11.253 ms │ 1.96 MiB │  13861 │      10 │
│           GPU │     Float64 │ 4096 │  34.154 ms │  34.178 ms │  34.613 ms │  38.594 ms │ 2.50 MiB │  49235 │      10 │
│           GPU │     Float64 │ 8192 │ 135.517 ms │ 135.546 ms │ 135.830 ms │ 138.345 ms │ 4.81 MiB │ 200327 │      10 │
│           GPU │     Float64 │16384 │ 552.752 ms │ 552.786 ms │ 552.883 ms │ 553.533 ms │14.14 MiB │ 811831 │      10 
└───────────────┴─────────────┴──────┴────────────┴────────────┴────────────┴────────────┴──────────┴────────┴─────────┘

       Shallow water model CPU to GPU speedup
┌─────────────┬──────┬──────────┬─────────┬─────────┐
│ Float_types │   Ns │  speedup │  memory │  allocs │
├─────────────┼──────┼──────────┼─────────┼─────────┤
│     Float64 │   32 │ 0.635088 │ 1.34893 │ 2.54017 │
│     Float64 │   64 │ 0.984322 │  1.3489 │ 2.53792 │
│     Float64 │  128 │   2.2418 │  1.3489 │ 2.53792 │
│     Float64 │  256 │  7.04691 │  1.3489 │ 2.53792 │
│     Float64 │  512 │  27.1702 │ 1.34893 │ 2.50842 │
│     Float64 │ 1024 │  110.972 │ 1.34931 │ 2.52311 │
│     Float64 │ 2048 │  178.096 │ 1.43913 │ 5.98747 │
│     Float64 │ 4096 │  185.419 │ 1.83529 │ 21.2678 │
│     Float64 │ 8192 │  189.573 │ 3.52748 │ 86.6092 │
│     Float64 │16384 │  193.026 │ 10.376  │ 350.986 │
└─────────────┴──────┴──────────┴─────────┴─────────┘

Benchmarkable incompressible model:

Oceananigans v0.58.1
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
  EBVERSIONJULIA = 1.6.0
  JULIA_DEPOT_PATH = :
  EBROOTJULIA = /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/julia/1.6.0
  EBDEVELJULIA = /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/julia/1.6.0/easybuild/avx2-Core-julia-1.6.0-easybuild-devel
  JULIA_LOAD_PATH = :
  GPU: Tesla V100-SXM2-32GB
                                            Incompressible model benchmarks
┌───────────────┬─────────────┬─────┬────────────┬────────────┬────────────┬────────────┬──────────┬────────┬─────────┐
│ Architectures │ Float_types │  Ns │        min │     median │       mean │        max │   memory │ allocs │ samples │
├───────────────┼─────────────┼─────┼────────────┼────────────┼────────────┼────────────┼──────────┼────────┼─────────┤
│           CPU │     Float32 │  32 │   6.121 ms │   6.225 ms │   6.393 ms │   7.995 ms │ 1.38 MiB │   2301 │      10 │
│           CPU │     Float32 │  64 │  39.352 ms │  39.706 ms │  39.950 ms │  42.529 ms │ 1.38 MiB │   2301 │      10 │
│           CPU │     Float32 │ 128 │ 323.897 ms │ 325.763 ms │ 325.708 ms │ 327.232 ms │ 1.38 MiB │   2301 │      10 │
│           CPU │     Float64 │  32 │   6.517 ms │   6.661 ms │   6.892 ms │   9.248 ms │ 1.77 MiB │   2301 │      10 │
│           CPU │     Float64 │  64 │  45.077 ms │  45.430 ms │  45.821 ms │  49.910 ms │ 1.77 MiB │   2301 │      10 │
│           CPU │     Float64 │ 128 │ 390.134 ms │ 390.948 ms │ 391.456 ms │ 395.682 ms │ 1.77 MiB │   2301 │      10 │
│           GPU │     Float32 │  32 │   3.543 ms │   3.679 ms │   3.813 ms │   5.210 ms │ 2.23 MiB │   6920 │      10 │
│           GPU │     Float32 │  64 │   3.732 ms │   3.782 ms │   3.907 ms │   4.982 ms │ 2.22 MiB │   6939 │      10 │
│           GPU │     Float32 │ 128 │   3.940 ms │   4.023 ms │   4.239 ms │   6.376 ms │ 2.23 MiB │   7561 │      10 │
│           GPU │     Float64 │  32 │   3.611 ms │   3.855 ms │   3.889 ms │   4.438 ms │ 2.80 MiB │   6914 │      10 │
│           GPU │     Float64 │  64 │   3.858 ms │   4.107 ms │   4.445 ms │   7.551 ms │ 2.78 MiB │   6961 │      10 │
│           GPU │     Float64 │ 128 │   4.460 ms │   4.642 ms │   6.368 ms │  21.963 ms │ 2.80 MiB │   8131 │      10 │
└───────────────┴─────────────┴─────┴────────────┴────────────┴────────────┴────────────┴──────────┴────────┴─────────┘

      Incompressible model CPU to GPU speedup
┌─────────────┬─────┬─────────┬─────────┬─────────┐
│ Float_types │  Ns │ speedup │  memory │  allocs │
├─────────────┼─────┼─────────┼─────────┼─────────┤
│     Float32 │  32 │ 1.69208 │ 1.61247 │ 3.00739 │
│     Float32 │  64 │ 10.4998 │ 1.60157 │ 3.01565 │
│     Float32 │ 128 │ 80.9842 │ 1.60842 │ 3.28596 │
│     Float64 │  32 │ 1.72783 │ 1.58487 │ 3.00478 │
│     Float64 │  64 │ 11.0611 │ 1.57428 │ 3.02521 │
│     Float64 │ 128 │ 84.2143 │  1.5844 │ 3.53368 │

Tracers, with grid size being 256 x 256 x 128:

Oceananigans v0.58.1
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
  EBVERSIONJULIA = 1.6.0
  JULIA_DEPOT_PATH = :
  EBROOTJULIA = /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/julia/1.6.0
  EBDEVELJULIA = /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/julia/1.6.0/easybuild/avx2-Core-julia-1.6.0-easybuild-devel
  JULIA_LOAD_PATH = :
  GPU: Tesla V100-SXM2-32GB

                                       Arbitrary tracers benchmarks
┌───────────────┬─────────┬───────────┬───────────┬───────────┬───────────┬────────────┬────────┬─────────┐
│ Architectures │ tracers │       min │    median │      mean │       max │     memory │ allocs │ samples │
├───────────────┼─────────┼───────────┼───────────┼───────────┼───────────┼────────────┼────────┼─────────┤
│           CPU │  (0, 0) │   1.439 s │   1.440 s │   1.440 s │   1.441 s │ 908.03 KiB │   1656 │       4 │
│           CPU │  (0, 1) │   1.539 s │   1.574 s │   1.575 s │   1.613 s │   1.24 MiB │   1942 │       4 │
│           CPU │  (0, 2) │   1.668 s │   1.669 s │   1.670 s │   1.671 s │   1.76 MiB │   2291 │       3 │
│           CPU │  (1, 0) │   1.527 s │   1.532 s │   1.532 s │   1.536 s │   1.24 MiB │   1942 │       4 │
│           CPU │  (2, 0) │   1.690 s │   1.697 s │   1.695 s │   1.698 s │   1.77 MiB │   2301 │       3 │
│           CPU │  (2, 3) │   2.234 s │   2.239 s │   2.241 s │   2.251 s │   3.59 MiB │   3928 │       3 │
│           CPU │  (2, 5) │   2.755 s │   2.838 s │   2.838 s │   2.921 s │   5.18 MiB │   4908 │       2 │
│           CPU │ (2, 10) │   3.588 s │   3.748 s │   3.748 s │   3.908 s │  10.39 MiB │   7682 │       2 │
│           GPU │  (0, 0) │  9.702 ms │ 12.755 ms │ 12.458 ms │ 12.894 ms │   1.59 MiB │  12321 │      10 │
│           GPU │  (0, 1) │ 13.863 ms │ 13.956 ms │ 14.184 ms │ 16.297 ms │   2.20 MiB │  14294 │      10 │
│           GPU │  (0, 2) │ 15.166 ms │ 15.230 ms │ 15.700 ms │ 19.893 ms │   2.93 MiB │  15967 │      10 │
│           GPU │  (1, 0) │ 13.740 ms │ 13.838 ms │ 14.740 ms │ 22.940 ms │   2.20 MiB │  14278 │      10 │
│           GPU │  (2, 0) │ 15.103 ms │ 15.199 ms │ 16.265 ms │ 25.906 ms │   2.93 MiB │  15913 │      10 │
│           GPU │  (2, 3) │ 13.981 ms │ 18.856 ms │ 18.520 ms │ 20.519 ms │   5.56 MiB │  17974 │      10 │
│           GPU │  (2, 5) │ 15.824 ms │ 21.211 ms │ 21.064 ms │ 24.897 ms │   7.86 MiB │  23938 │      10 │
│           GPU │ (2, 10) │ 22.085 ms │ 27.236 ms │ 28.231 ms │ 38.295 ms │  15.02 MiB │  31086 │      10 │
└───────────────┴─────────┴───────────┴───────────┴───────────┴───────────┴────────────┴────────┴─────────┘

  Arbitrary tracers CPU to GPU speedup
┌─────────┬─────────┬─────────┬─────────┐
│ tracers │ speedup │  memory │  allocs │
├─────────┼─────────┼─────────┼─────────┤
│  (0, 0) │ 112.881 │ 1.78792 │ 7.44022 │
│  (0, 1) │ 112.761 │ 1.77743 │ 7.36045 │
│  (0, 2) │ 109.618 │  1.6627 │ 6.96945 │
│  (1, 0) │ 110.717 │ 1.77723 │ 7.35221 │
│  (2, 0) │ 111.678 │ 1.66267 │ 6.91569 │
│  (2, 3) │ 118.737 │ 1.55043 │ 4.57587 │
│  (2, 5) │ 133.803 │  1.5155 │ 4.87734 │
│ (2, 10) │ 137.615 │ 1.44535 │  4.0466 │
└─────────┴─────────┴─────────┴─────────┘

       Arbitrary tracers relative performance (CPU)
┌───────────────┬─────────┬──────────┬─────────┬─────────┐
│ Architectures │ tracers │ slowdown │  memory │  allocs │
├───────────────┼─────────┼──────────┼─────────┼─────────┤
│           CPU │  (0, 0) │      1.0 │     1.0 │     1.0 │
│           CPU │  (0, 1) │  1.09293 │ 1.39873 │ 1.17271 │
│           CPU │  (0, 2) │  1.15948 │ 1.99019 │ 1.38345 │
│           CPU │  (1, 0) │  1.06409 │ 1.39873 │ 1.17271 │
│           CPU │  (2, 0) │  1.17887 │ 1.99054 │ 1.38949 │
│           CPU │  (2, 3) │  1.55493 │ 4.04677 │ 2.37198 │
│           CPU │  (2, 5) │  1.97115 │ 5.84537 │ 2.96377 │
│           CPU │ (2, 10) │   2.6031 │ 11.7179 │ 4.63889 │
└───────────────┴─────────┴──────────┴─────────┴─────────┘

       Arbitrary tracers relative performance (GPU)
┌───────────────┬─────────┬──────────┬─────────┬─────────┐
│ Architectures │ tracers │ slowdown │  memory │  allocs │
├───────────────┼─────────┼──────────┼─────────┼─────────┤
│           GPU │  (0, 0) │      1.0 │     1.0 │     1.0 │
│           GPU │  (0, 1) │   1.0941 │ 1.39053 │ 1.16013 │
│           GPU │  (0, 2) │  1.19399 │ 1.85081 │ 1.29592 │
│           GPU │  (1, 0) │  1.08489 │ 1.39037 │ 1.15883 │
│           GPU │  (2, 0) │  1.19157 │ 1.85109 │ 1.29153 │
│           GPU │  (2, 3) │  1.47824 │ 3.50924 │ 1.45881 │
│           GPU │  (2, 5) │  1.66293 │ 4.95474 │ 1.94286 │
│           GPU │ (2, 10) │  2.13524 │ 9.47276 │ 2.52301 │
└───────────────┴─────────┴──────────┴─────────┴─────────┘

Some errors were encountered running the turbulence closure benchmark script with grid size 256 x 256 x 128.
There was an issue with the Nothing closure which was avoided by removing that type of closure from the closure array.


Oceananigans v0.58.1
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
  EBVERSIONJULIA = 1.6.0
  JULIA_DEPOT_PATH = :
  EBROOTJULIA = /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/julia/1.6.0
  EBDEVELJULIA = /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/julia/1.6.0/easybuild/avx2-Core-julia-1.6.0-easybuild-devel
  JULIA_LOAD_PATH = :
  GPU: Tesla V100-SXM2-32GB

                                                  Turbulence closure benchmarks
┌───────────────┬──────────────────────────────────┬───────────┬───────────┬───────────┬───────────┬──────────┬────────┬─────────┐
│ Architectures │                         Closures │       min │    median │      mean │       max │   memory │ allocs │ samples │
├───────────────┼──────────────────────────────────┼───────────┼───────────┼───────────┼───────────┼──────────┼────────┼─────────┤
│           CPU │ AnisotropicBiharmonicDiffusivity │   3.634 s │   3.637 s │   3.637 s │   3.639 s │ 1.77 MiB │   2316 │       2 │
│           CPU │           AnisotropicDiffusivity │   2.045 s │   2.052 s │   2.059 s │   2.079 s │ 1.77 MiB │   2316 │       3 │
│           CPU │    AnisotropicMinimumDissipation │   3.240 s │   3.240 s │   3.240 s │   3.241 s │ 2.09 MiB │   2763 │       2 │
│           CPU │             IsotropicDiffusivity │   2.342 s │   2.344 s │   2.344 s │   2.345 s │ 1.77 MiB │   2316 │       3 │
│           CPU │                 SmagorinskyLilly │   3.501 s │   3.504 s │   3.504 s │   3.507 s │ 2.03 MiB │   2486 │       2 │
│           CPU │              TwoDimensionalLeith │   4.813 s │   4.820 s │   4.820 s │   4.828 s │ 1.88 MiB │   2481 │       2 │
│           GPU │ AnisotropicBiharmonicDiffusivity │ 24.699 ms │ 24.837 ms │ 26.946 ms │ 46.029 ms │ 3.16 MiB │  29911 │      10 │
│           GPU │           AnisotropicDiffusivity │ 16.115 ms │ 16.184 ms │ 16.454 ms │ 18.978 ms │ 2.97 MiB │  17169 │      10 │
│           GPU │    AnisotropicMinimumDissipation │ 15.858 ms │ 25.856 ms │ 24.874 ms │ 26.014 ms │ 3.57 MiB │  24574 │      10 │
│           GPU │             IsotropicDiffusivity │ 14.442 ms │ 17.415 ms │ 17.134 ms │ 17.513 ms │ 2.99 MiB │  19135 │      10 │
│           GPU │                 SmagorinskyLilly │ 16.315 ms │ 23.969 ms │ 23.213 ms │ 24.059 ms │ 3.86 MiB │  24514 │      10 │
│           GPU │              TwoDimensionalLeith │ 34.470 ms │ 34.628 ms │ 35.535 ms │ 43.798 ms │ 3.56 MiB │  45291 │      10 │
└───────────────┴──────────────────────────────────┴───────────┴───────────┴───────────┴───────────┴──────────┴────────┴─────────┘

              Turbulence closure CPU to GPU speedup
┌──────────────────────────────────┬─────────┬─────────┬─────────┐
│                         Closures │ speedup │  memory │  allocs │
├──────────────────────────────────┼─────────┼─────────┼─────────┤
│ AnisotropicBiharmonicDiffusivity │ 146.428 │ 1.78781 │ 12.9149 │
│           AnisotropicDiffusivity │ 126.804 │ 1.67787 │ 7.41321 │
│    AnisotropicMinimumDissipation │ 125.324 │ 1.70856 │ 8.89396 │
│             IsotropicDiffusivity │ 134.607 │ 1.69269 │ 8.26209 │
│                 SmagorinskyLilly │ 146.187 │ 1.89602 │ 9.86082 │
│              TwoDimensionalLeith │ 139.196 │ 1.89218 │ 18.2551 │
└──────────────────────────────────┴─────────┴─────────┴─────────┘

glwagner · 2021-05-28T18:39:11Z

glwagner
May 28, 2021
Maintainer

That's neat! Certainly not necessary but it might be useful to create some plots in the script too?

0 replies

francispoulin · 2021-05-28T18:43:53Z

francispoulin
May 28, 2021
Collaborator

Plots are a good idea. My first throught is for the shallow water model, for example, is to have a plot with markers for the CPU and GPUs results. That is easy to do. Then we could have a plot of the speed up on the same figure but with the y axis on the right. We will talk about doing this using Plots.jl.

0 replies

navidcy · 2021-05-28T19:25:55Z

navidcy
May 28, 2021
Maintainer

I closed #1676 in favor of this.

0 replies

francispoulin · 2021-05-31T13:48:55Z

francispoulin
May 31, 2021
Collaborator

This is a plot of the shallow water benchmark times: cpu vs gpu. What do you think?

Absolute Times

Speed Up

In theory, it should be easy to include the code to create this image in the benchmark script. However, because the garbage collector does not clear the memory, we actually have to run the script separately for the high resolution runs. Any advice on how to resolve this issue?

0 replies

glwagner · 2021-05-31T22:14:18Z

glwagner
May 31, 2021
Maintainer

That is an awesome plot. I would refer to using the GPU as "speed up" or "acceleration" (rather than a "slow down" incurred by using the CPU). Is the resolution the total resolution, or the resolution in a single direction in a two-dimensional grid? Quite enlightening to see how the GPU isn't fully utilized until the resolution is high enough.

0 replies

francispoulin · 2021-05-31T23:06:16Z

francispoulin
May 31, 2021
Collaborator

Thanks @glwagner and I agree, speed up is better. WIll fix that that.

The x-axis is the number of points in each direction. We could square it to get the total degrees of freedom, and that might be nice, but I keep on thinking that looking at the number of points in each direction on a square grid would be easier for the user to learn what to expect. But both are easy enough to produce. But I will keep this in mind when we are doing the 'IncompressibleModel, where we would cube it. They should yield similar behaviour, I would think. But maybe the pressure solver makes things different. I hope to know better by tomorrow.

I agree about this being enlightening. It seems to me that for high resolutions, the cpus and gpus have pretty much the same slopes, it just takes a lot for the gpu to start to increase. I didn't know this before and am happy to have learned it.

0 replies

glwagner · 2021-06-01T20:28:20Z

glwagner
Jun 1, 2021
Maintainer

Happy with any measure of resolution --- just asking for clarification. Perhaps instead of Resolution the plot can be labeled Nx (or some other word that indicates the meaning of the axis clearly.

0 replies

francispoulin · 2021-06-01T20:30:31Z

francispoulin
Jun 1, 2021
Collaborator

Thanks. We will change it to Nx then.

0 replies

hennyg888 · 2021-06-02T17:22:31Z

hennyg888
Jun 2, 2021
Collaborator Author

Here are some plots for the incompressible model's benchmarks. Note that I also added a benchmark for Nx=256 however, anything larger (e.g. Nx=512) resulted in an out of memory error even when ran by itself.

times

speedups going from cpu to gpu

Also, a small change done to the shallow water graph above. The y-axis label has been changed to "Time (ms)" for more clarity.

0 replies

navidcy · 2021-06-02T23:29:42Z

navidcy
Jun 2, 2021
Maintainer

anything larger (e.g. Nx=512) resulted in an out of memory error even when ran by itself

Sure, I doubt that 512³ fits on a single GPU...

0 replies

francispoulin · 2021-06-04T14:31:35Z

francispoulin
Jun 4, 2021
Collaborator

When @hennyg888 did benchmarks for shallow water with WENO5 he found that the speed up went down from 180 with 5th order upwindng to about 120. This is in contrast to what I found previously when WENO5 went up to about 380. I sent those results to @ali-ramadhan but sadly those have disappeared into the slack universe.

Probably not a major problem but any ideas what might have changed?

Also, we can update as many of the benchmarks from above as people like, and add in some pictures where we have them.

0 replies

francispoulin · 2021-06-08T20:12:29Z

francispoulin
Jun 8, 2021
Collaborator

Here are some results for weak and strong scaling of distributed shallow water model on one node with 32 cores.

The efficiency for both goes down to 80% on 32 cores. This is comparable to what @ali-ramadhan found a while back, but not sure if that made it on an issue or a PR.

I'm now trying to go to 64 cores on 2 nodes, and hope to have some results to show soon, after I figure out some weird behavour.

                                 Shallow water model weak scaling benchmark
┌──────────────┬─────────┬────────────┬────────────┬────────────┬────────────┬──────────┬────────┬─────────┐
│         size │   ranks │        min │     median │       mean │        max │   memory │ allocs │ samples │
├──────────────┼─────────┼────────────┼────────────┼────────────┼────────────┼──────────┼────────┼─────────┤
│  (4096, 256) │  (1, 1) │ 357.509 ms │ 357.738 ms │ 357.932 ms │ 359.397 ms │ 1.60 MiB │   2774 │      10 │
│  (4096, 512) │  (1, 2) │ 371.114 ms │ 371.639 ms │ 372.070 ms │ 381.467 ms │ 1.49 MiB │   3116 │      20 │
│ (4096, 1024) │  (1, 4) │ 371.134 ms │ 372.336 ms │ 372.644 ms │ 379.017 ms │ 1.49 MiB │   3116 │      40 │
│ (4096, 2048) │  (1, 8) │ 376.120 ms │ 376.283 ms │ 378.017 ms │ 409.322 ms │ 1.49 MiB │   3116 │      80 │
│ (4096, 4096) │ (1, 16) │ 388.076 ms │ 394.677 ms │ 396.207 ms │ 426.799 ms │ 1.49 MiB │   3116 │     160 │
│ (4096, 8192) │ (1, 32) │ 428.043 ms │ 444.197 ms │ 445.236 ms │ 479.791 ms │ 1.49 MiB │   3116 │     320 │
└──────────────┴─────────┴────────────┴────────────┴────────────┴────────────┴──────────┴────────┴─────────┘
[2021/06/08 11:49:56.194] INFO  Writing Shallow_water_model_weak_scaling_benchmark.html...
               Shallow water model weak scaling speedup
┌──────────────┬─────────┬──────────┬────────────┬──────────┬─────────┐
│         size │   ranks │ slowdown │ efficiency │   memory │  allocs │
├──────────────┼─────────┼──────────┼────────────┼──────────┼─────────┤
│  (4096, 256) │  (1, 1) │      1.0 │        1.0 │      1.0 │     1.0 │
│  (4096, 512) │  (1, 2) │  1.03886 │   0.962595 │ 0.930602 │ 1.12329 │
│ (4096, 1024) │  (1, 4) │  1.04081 │   0.960794 │ 0.930602 │ 1.12329 │
│ (4096, 2048) │  (1, 8) │  1.05184 │   0.950714 │ 0.930602 │ 1.12329 │
│ (4096, 4096) │ (1, 16) │  1.10326 │   0.906407 │ 0.930602 │ 1.12329 │
│ (4096, 8192) │ (1, 32) │  1.24168 │   0.805358 │ 0.930602 │ 1.12329 │
└──────────────┴─────────┴──────────┴────────────┴──────────┴─────────┘

    
                                Shallow water model strong scaling benchmark
┌──────────────┬─────────┬────────────┬────────────┬────────────┬────────────┬──────────┬────────┬─────────┐
│         size │   ranks │        min │     median │       mean │        max │   memory │ allocs │ samples │
├──────────────┼─────────┼────────────┼────────────┼────────────┼────────────┼──────────┼────────┼─────────┤
│ (4096, 4096) │  (1, 1) │    5.705 s │    5.705 s │    5.705 s │    5.705 s │ 1.60 MiB │   2804 │       1 │
│ (4096, 4096) │  (1, 2) │    2.860 s │    2.860 s │    2.863 s │    2.873 s │ 1.49 MiB │   3146 │       4 │
│ (4096, 4096) │  (1, 4) │    1.416 s │    1.422 s │    1.421 s │    1.422 s │ 1.49 MiB │   3146 │      16 │
│ (4096, 4096) │  (1, 8) │ 739.330 ms │ 745.829 ms │ 747.065 ms │ 771.605 ms │ 1.49 MiB │   3146 │      56 │
│ (4096, 4096) │ (1, 16) │ 387.654 ms │ 397.035 ms │ 398.369 ms │ 425.472 ms │ 1.49 MiB │   3116 │     160 │
│ (4096, 4096) │ (1, 32) │ 205.907 ms │ 218.745 ms │ 219.552 ms │ 253.574 ms │ 1.49 MiB │   3116 │     320 │
└──────────────┴─────────┴────────────┴────────────┴────────────┴────────────┴──────────┴────────┴─────────┘
[2021/06/08 12:24:17.096] INFO  Writing Shallow_water_model_strong_scaling_benchmark.html...
              Shallow water model strong scaling speedup
┌──────────────┬─────────┬─────────┬────────────┬──────────┬─────────┐
│         size │   ranks │ speedup │ efficiency │   memory │  allocs │
├──────────────┼─────────┼─────────┼────────────┼──────────┼─────────┤
│ (4096, 4096) │  (1, 1) │     1.0 │        1.0 │      1.0 │     1.0 │
│ (4096, 4096) │  (1, 2) │ 1.99454 │   0.997271 │ 0.930621 │ 1.12197 │
│ (4096, 4096) │  (1, 4) │ 4.01252 │    1.00313 │ 0.930621 │ 1.12197 │
│ (4096, 4096) │  (1, 8) │ 7.64927 │   0.956159 │ 0.930621 │ 1.12197 │
│ (4096, 4096) │ (1, 16) │ 14.3691 │    0.89807 │ 0.930336 │ 1.11127 │
│ (4096, 4096) │ (1, 32) │ 26.0808 │   0.815025 │ 0.930336 │ 1.11127 │
└──────────────┴─────────┴─────────┴────────────┴──────────┴─────────┘

0 replies

glwagner · 2021-06-16T20:51:46Z

glwagner
Jun 16, 2021
Maintainer

Probably not a major problem but any ideas what might have changed?

That is mysterious. Is this with ShallowWaterModel or with IncompressibleModel?

The main thing that's changed is our update to julia 1.6 and CUDA 3.0. But I can't recall if there have been changes to the way that the ShallowWaterModel implements advection.

0 replies

hennyg888 · 2021-06-16T21:01:48Z

hennyg888
Jun 16, 2021
Collaborator Author

It's the shallow water model.

0 replies

francispoulin · 2021-06-16T21:47:45Z

francispoulin
Jun 16, 2021
Collaborator

The advection equation for ShallowWaterModel has not changed I don't believe, but the O(400) seemed a bit too good to be true, so maybe it was.

0 replies

glwagner · 2021-06-16T23:55:36Z

glwagner
Jun 16, 2021
Maintainer

I believe we also observed bigger speed ups for WENO when it was first implemented. This is plausible because WENO invokes the same memory access pattern as the UpwindBiasedFifthOrder scheme, but has much more compute, which gives the GPU more of an edge.

0 replies

hennyg888 · 2021-06-20T15:37:03Z

hennyg888
Jun 20, 2021
Collaborator Author

@francispoulin and I ran some of the strong and weak scaling scripts recently up to 128 CPU cores. An extra bit of code was added into the files that handled the plotting. Also added was a small but vital configuration adjustment for the @benchmark macro which allowed for more than 64 cores to be benchmarked without what is perceived as deadlocking from occurring. I will PR my all my changes made to the benchmarking scripts shortly. Here are the results:

weak scaling shallow water model, with grid size being 8192 x 512R where R is the number of cores:

┌───────────────┬──────────┬─────────┬─────────┬─────────┬─────────┬──────────┬────────┬─────────┐
│          size │    ranks │     min │  median │    mean │     max │   memory │ allocs │ samples │
├───────────────┼──────────┼─────────┼─────────┼─────────┼─────────┼──────────┼────────┼─────────┤
│   (8192, 512) │   (1, 1) │ 1.464 s │ 1.464 s │ 1.465 s │ 1.466 s │ 1.60 MiB │   2804 │       4 │
│  (8192, 1024) │   (1, 2) │ 1.475 s │ 1.475 s │ 1.477 s │ 1.486 s │ 1.49 MiB │   3146 │       8 │
│  (8192, 2048) │   (1, 4) │ 1.472 s │ 1.475 s │ 1.477 s │ 1.509 s │ 1.49 MiB │   3146 │      16 │
│  (8192, 4096) │   (1, 8) │ 1.500 s │ 1.503 s │ 1.508 s │ 1.537 s │ 1.49 MiB │   3146 │      32 │
│  (8192, 8192) │  (1, 16) │ 1.545 s │ 1.578 s │ 1.593 s │ 1.682 s │ 1.49 MiB │   3146 │      64 │
│ (8192, 16384) │  (1, 32) │ 1.744 s │ 1.803 s │ 1.805 s │ 1.894 s │ 1.49 MiB │   3146 │      96 │
│ (8192, 32768) │  (1, 64) │ 1.723 s │ 1.792 s │ 1.795 s │ 1.868 s │ 1.49 MiB │   3155 │     192 │
│ (8192, 65536) │ (1, 128) │ 1.679 s │ 1.809 s │ 1.811 s │ 1.907 s │ 1.49 MiB │   3155 │     384 │
└───────────────┴──────────┴─────────┴─────────┴─────────┴─────────┴──────────┴────────┴─────────┘

                Shallow water model weak scaling speedup
┌───────────────┬──────────┬──────────┬────────────┬──────────┬─────────┐
│          size │    ranks │ slowdown │ efficiency │   memory │  allocs │
├───────────────┼──────────┼──────────┼────────────┼──────────┼─────────┤
│   (8192, 512) │   (1, 1) │      1.0 │        1.0 │      1.0 │     1.0 │
│  (8192, 1024) │   (1, 2) │  1.00706 │    0.99299 │ 0.930621 │ 1.12197 │
│  (8192, 2048) │   (1, 4) │  1.00728 │   0.992773 │ 0.930621 │ 1.12197 │
│  (8192, 4096) │   (1, 8) │  1.02627 │   0.974405 │ 0.930621 │ 1.12197 │
│  (8192, 8192) │  (1, 16) │  1.07789 │   0.927741 │ 0.930621 │ 1.12197 │
│ (8192, 16384) │  (1, 32) │  1.23116 │    0.81224 │ 0.930621 │ 1.12197 │
│ (8192, 32768) │  (1, 64) │  1.22349 │   0.817333 │ 0.930707 │ 1.12518 │
│ (8192, 65536) │ (1, 128) │    1.235 │   0.809715 │ 0.930707 │ 1.12518 │
└───────────────┴──────────┴──────────┴────────────┴──────────┴─────────┘

strong scaling shallow water, with grid size being 8192 x 8129:

┌──────────────┬──────────┬────────────┬────────────┬────────────┬────────────┬──────────┬────────┬─────────┐
│         size │    ranks │        min │     median │       mean │        max │   memory │ allocs │ samples │
├──────────────┼──────────┼────────────┼────────────┼────────────┼────────────┼──────────┼────────┼─────────┤
│ (8192, 8192) │   (1, 1) │   22.797 s │   22.797 s │   22.797 s │   22.797 s │ 1.60 MiB │   2804 │       1 │
│ (8192, 8192) │   (1, 2) │   11.566 s │   11.568 s │   11.568 s │   11.569 s │ 1.49 MiB │   3146 │       2 │
│ (8192, 8192) │   (1, 4) │    5.796 s │    5.849 s │    5.841 s │    5.871 s │ 1.49 MiB │   3146 │       4 │
│ (8192, 8192) │   (1, 8) │    2.953 s │    2.955 s │    2.962 s │    2.979 s │ 1.49 MiB │   3146 │      16 │
│ (8192, 8192) │  (1, 16) │    1.551 s │    1.567 s │    1.573 s │    1.648 s │ 1.49 MiB │   3146 │      64 │
│ (8192, 8192) │  (1, 32) │ 799.022 ms │ 883.958 ms │ 906.183 ms │    1.107 s │ 1.49 MiB │   3116 │     192 │
│ (8192, 8192) │  (1, 64) │ 406.639 ms │ 462.947 ms │ 464.609 ms │ 604.969 ms │ 1.49 MiB │   3125 │     640 │
│ (8192, 8192) │ (1, 128) │ 201.949 ms │ 227.470 ms │ 234.144 ms │ 428.185 ms │ 1.49 MiB │   3125 │    1280 │
└──────────────┴──────────┴────────────┴────────────┴────────────┴────────────┴──────────┴────────┴─────────┘

              Shallow water model strong scaling speedup
┌──────────────┬──────────┬─────────┬────────────┬──────────┬─────────┐
│         size │    ranks │ speedup │ efficiency │   memory │  allocs │
├──────────────┼──────────┼─────────┼────────────┼──────────┼─────────┤
│ (8192, 8192) │   (1, 1) │     1.0 │        1.0 │      1.0 │     1.0 │
│ (8192, 8192) │   (1, 2) │ 1.97079 │   0.985395 │ 0.930621 │ 1.12197 │
│ (8192, 8192) │   (1, 4) │ 3.89793 │   0.974483 │ 0.930621 │ 1.12197 │
│ (8192, 8192) │   (1, 8) │ 7.71355 │   0.964194 │ 0.930621 │ 1.12197 │
│ (8192, 8192) │  (1, 16) │ 14.5457 │   0.909106 │ 0.930621 │ 1.12197 │
│ (8192, 8192) │  (1, 32) │ 25.7898 │   0.805933 │ 0.930336 │ 1.11127 │
│ (8192, 8192) │  (1, 64) │ 49.2435 │    0.76943 │ 0.930421 │ 1.11448 │
│ (8192, 8192) │ (1, 128) │  100.22 │   0.782972 │ 0.930421 │ 1.11448 │
└──────────────┴──────────┴─────────┴────────────┴──────────┴─────────┘

0 replies

hennyg888 · 2021-06-20T15:47:08Z

hennyg888
Jun 20, 2021
Collaborator Author

@francispoulin and I also tried to increase the grid size to see if that would saturate the CPUs more and thus improve efficiency. Grid size was doubled, and the strong scaling shallow water model benchmarking script ran into some problems. However, the results from the weak scaling benchmark is sufficient enough to show that doubling grid size did indeed improve the larger ranked efficiencies from around 75% to above 80%.

weak scaling shallow water model, with grid size 16384 x 1024R where R is the number of cores:

┌─────────────────┬──────────┬─────────┬─────────┬─────────┬─────────┬──────────┬────────┬─────────┐
│            size │    ranks │     min │  median │    mean │     max │   memory │ allocs │ samples │
├─────────────────┼──────────┼─────────┼─────────┼─────────┼─────────┼──────────┼────────┼─────────┤
│   (16384, 1024) │   (1, 1) │ 5.827 s │ 5.827 s │ 5.827 s │ 5.827 s │ 1.60 MiB │   2804 │       1 │
│   (16384, 2048) │   (1, 2) │ 5.863 s │ 5.866 s │ 5.866 s │ 5.870 s │ 1.49 MiB │   3146 │       2 │
│   (16384, 4096) │   (1, 4) │ 5.825 s │ 5.850 s │ 5.845 s │ 5.855 s │ 1.49 MiB │   3146 │       4 │
│   (16384, 8192) │   (1, 8) │ 5.945 s │ 5.991 s │ 5.989 s │ 6.014 s │ 1.49 MiB │   3146 │       8 │
│  (16384, 16384) │  (1, 16) │ 6.271 s │ 6.352 s │ 6.336 s │ 6.368 s │ 1.49 MiB │   3146 │      16 │
│  (16384, 32768) │  (1, 32) │ 7.150 s │ 7.316 s │ 7.300 s │ 7.391 s │ 1.49 MiB │   3146 │      32 │
│  (16384, 65536) │  (1, 64) │ 6.816 s │ 7.189 s │ 7.178 s │ 7.344 s │ 1.49 MiB │   3155 │      64 │
│ (16384, 131072) │ (1, 128) │ 6.874 s │ 7.096 s │ 7.123 s │ 7.468 s │ 1.49 MiB │   3155 │     128 │
└─────────────────┴──────────┴─────────┴─────────┴─────────┴─────────┴──────────┴────────┴─────────┘

                 Shallow water model weak scaling speedup
┌─────────────────┬──────────┬──────────┬────────────┬──────────┬─────────┐
│            size │    ranks │ slowdown │ efficiency │   memory │  allocs │
├─────────────────┼──────────┼──────────┼────────────┼──────────┼─────────┤
│   (16384, 1024) │   (1, 1) │      1.0 │        1.0 │      1.0 │     1.0 │
│   (16384, 2048) │   (1, 2) │  1.00682 │    0.99323 │ 0.930621 │ 1.12197 │
│   (16384, 4096) │   (1, 4) │  1.00397 │   0.996046 │ 0.930621 │ 1.12197 │
│   (16384, 8192) │   (1, 8) │  1.02827 │    0.97251 │ 0.930621 │ 1.12197 │
│  (16384, 16384) │  (1, 16) │   1.0902 │   0.917267 │ 0.930621 │ 1.12197 │
│  (16384, 32768) │  (1, 32) │  1.25555 │   0.796462 │ 0.930621 │ 1.12197 │
│  (16384, 65536) │  (1, 64) │  1.23385 │    0.81047 │ 0.930707 │ 1.12518 │
│ (16384, 131072) │ (1, 128) │  1.21789 │   0.821091 │ 0.930707 │ 1.12518 │
└─────────────────┴──────────┴──────────┴────────────┴──────────┴─────────┘

We also ran the incompressible model's strong scaling script. No weak scaling script existed for this model. The grid size was 256 x 256 x 256.

┌─────────────────┬────────────┬────────────┬────────────┬────────────┬────────────┬──────────┬────────┬─────────┐
│            size │      ranks │        min │     median │       mean │        max │   memory │ allocs │ samples │
├─────────────────┼────────────┼────────────┼────────────┼────────────┼────────────┼──────────┼────────┼─────────┤
│ (256, 256, 256) │  (1, 1, 1) │    4.412 s │    4.486 s │    4.486 s │    4.560 s │ 1.89 MiB │   2623 │       2 │
│ (256, 256, 256) │  (1, 2, 1) │    2.275 s │    2.303 s │    2.297 s │    2.312 s │ 1.74 MiB │   2979 │       6 │
│ (256, 256, 256) │  (1, 4, 1) │    1.095 s │    1.149 s │    1.152 s │    1.280 s │ 1.74 MiB │   3019 │      20 │
│ (256, 256, 256) │  (1, 8, 1) │ 583.413 ms │ 667.475 ms │ 663.315 ms │ 924.523 ms │ 1.74 MiB │   3099 │      64 │
│ (256, 256, 256) │ (1, 16, 1) │ 341.718 ms │ 383.898 ms │ 405.727 ms │ 771.915 ms │ 1.76 MiB │   3259 │     160 │
└─────────────────┴────────────┴────────────┴────────────┴────────────┴────────────┴──────────┴────────┴─────────┘

                Incompressible model strong scaling speedup
┌─────────────────┬────────────┬─────────┬────────────┬──────────┬─────────┐
│            size │      ranks │ speedup │ efficiency │   memory │  allocs │
├─────────────────┼────────────┼─────────┼────────────┼──────────┼─────────┤
│ (256, 256, 256) │  (1, 1, 1) │     1.0 │        1.0 │      1.0 │     1.0 │
│ (256, 256, 256) │  (1, 2, 1) │ 1.94757 │   0.973786 │ 0.918534 │ 1.13572 │
│ (256, 256, 256) │  (1, 4, 1) │ 3.90488 │   0.976221 │  0.91981 │ 1.15097 │
│ (256, 256, 256) │  (1, 8, 1) │ 6.72045 │   0.840057 │ 0.922588 │ 1.18147 │
│ (256, 256, 256) │ (1, 16, 1) │ 11.6847 │   0.730294 │ 0.928952 │ 1.24247 │
└─────────────────┴────────────┴─────────┴────────────┴──────────┴─────────┘

The overall trend looks like that efficiency plateaus off at around 75% when using 32 or more cores. We'll be trying to benchmark the GPUs' scaling performance next.

0 replies

hennyg888 · 2021-06-26T03:47:24Z

hennyg888
Jun 26, 2021
Collaborator Author

Ran benchmark_shallow_water_model.jl again with the fixes introduced in #1770. The CPU to GPU speedup is now at around 400 times as @francispoulin expected.

However, it should be noted that though there is a notable increase in speedup, it is actually caused by the cases using CPU architecture taking more time. Similarly, the cases with GPU architecture take more time as well, but not at as large a percentage as the additional time incurred by the CPU architecture cases.

Please see #1722 (comment) for benchmark results without WENO5.

Oceananigans v0.58.1
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
  EBVERSIONJULIA = 1.6.0
  JULIA_DEPOT_PATH = :
  EBROOTJULIA = /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/julia/1.6.0
  EBDEVELJULIA = /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/julia/1.6.0/easybuild/avx2-Core-julia-1.6.0-easybuild-devel
  JULIA_LOAD_PATH = :
  GPU: Tesla V100-SXM2-32GB

                                              Shallow water model benchmarks
┌───────────────┬─────────────┬───────┬────────────┬────────────┬────────────┬────────────┬───────────┬────────┬─────────┐
│ Architectures │ Float_types │    Ns │        min │     median │       mean │        max │    memory │ allocs │ samples │
├───────────────┼─────────────┼───────┼────────────┼────────────┼────────────┼────────────┼───────────┼────────┼─────────┤
│           CPU │     Float64 │    32 │   2.677 ms │   2.876 ms │   3.047 ms │   4.806 ms │  1.36 MiB │   2253 │      10 │
│           CPU │     Float64 │    64 │   5.795 ms │   5.890 ms │   6.073 ms │   7.770 ms │  1.36 MiB │   2255 │      10 │
│           CPU │     Float64 │   128 │  16.979 ms │  17.350 ms │  17.578 ms │  19.993 ms │  1.36 MiB │   2255 │      10 │
│           CPU │     Float64 │   256 │  62.543 ms │  63.222 ms │  63.544 ms │  67.347 ms │  1.36 MiB │   2255 │      10 │
│           CPU │     Float64 │   512 │ 250.149 ms │ 251.023 ms │ 251.092 ms │ 252.389 ms │  1.36 MiB │   2315 │      10 │
│           CPU │     Float64 │  1024 │ 990.901 ms │ 993.115 ms │ 993.360 ms │ 996.091 ms │  1.36 MiB │   2315 │       6 │
│           CPU │     Float64 │  2048 │    4.002 s │    4.004 s │    4.004 s │    4.007 s │  1.36 MiB │   2315 │       2 │
│           CPU │     Float64 │  4096 │   16.371 s │   16.371 s │   16.371 s │   16.371 s │  1.36 MiB │   2315 │       1 │
│           CPU │     Float64 │  8192 │   64.657 s │   64.657 s │   64.657 s │   64.657 s │  1.36 MiB │   2315 │       1 │
│           CPU │     Float64 │ 16384 │  290.423 s │  290.423 s │  290.423 s │  290.423 s │  1.36 MiB │   2315 │       1 │
│           GPU │     Float64 │    32 │   3.468 ms │   3.656 ms │   3.745 ms │   4.695 ms │  1.82 MiB │   5687 │      10 │
│           GPU │     Float64 │    64 │   3.722 ms │   3.903 ms │   4.050 ms │   5.671 ms │  1.82 MiB │   5687 │      10 │
│           GPU │     Float64 │   128 │   3.519 ms │   3.808 ms │   4.042 ms │   6.372 ms │  1.82 MiB │   5687 │      10 │
│           GPU │     Float64 │   256 │   3.822 ms │   4.153 ms │   4.288 ms │   5.810 ms │  1.82 MiB │   5687 │      10 │
│           GPU │     Float64 │   512 │   4.637 ms │   4.932 ms │   4.961 ms │   5.728 ms │  1.82 MiB │   5765 │      10 │
│           GPU │     Float64 │  1024 │   3.240 ms │   3.424 ms │   3.527 ms │   4.553 ms │  1.82 MiB │   5799 │      10 │
│           GPU │     Float64 │  2048 │  10.783 ms │  10.800 ms │  11.498 ms │  17.824 ms │  1.98 MiB │  16305 │      10 │
│           GPU │     Float64 │  4096 │  41.880 ms │  41.911 ms │  42.485 ms │  47.627 ms │  2.67 MiB │  61033 │      10 │
│           GPU │     Float64 │  8192 │ 166.751 ms │ 166.800 ms │ 166.847 ms │ 167.129 ms │  5.21 MiB │ 227593 │      10 │
│           GPU │     Float64 │ 16384 │ 681.129 ms │ 681.249 ms │ 681.301 ms │ 681.583 ms │ 16.59 MiB │ 973627 │       8 │
└───────────────┴─────────────┴───────┴────────────┴────────────┴────────────┴────────────┴───────────┴────────┴─────────┘

        Shallow water model CPU to GPU speedup
┌─────────────┬───────┬──────────┬─────────┬─────────┐
│ Float_types │    Ns │  speedup │  memory │  allocs │
├─────────────┼───────┼──────────┼─────────┼─────────┤
│     Float64 │    32 │ 0.786715 │ 1.33777 │ 2.52419 │
│     Float64 │    64 │  1.50931 │ 1.33774 │ 2.52195 │
│     Float64 │   128 │  4.55587 │ 1.33774 │ 2.52195 │
│     Float64 │   256 │  15.2238 │ 1.33774 │ 2.52195 │
│     Float64 │   512 │  50.8995 │ 1.33771 │ 2.49028 │
│     Float64 │  1024 │  290.085 │ 1.33809 │ 2.50497 │
│     Float64 │  2048 │  370.777 │ 1.45575 │  7.0432 │
│     Float64 │  4096 │  390.617 │ 1.95667 │ 26.3641 │
│     Float64 │  8192 │  387.632 │ 3.82201 │ 98.3123 │
│     Float64 │ 16384 │   426.31 │  12.177 │ 420.573 │
└─────────────┴───────┴──────────┴─────────┴─────────┘

0 replies

francispoulin · 2021-06-26T15:11:22Z

francispoulin
Jun 26, 2021
Collaborator

That is great @hennyg888 , thank you for doing this.

It occurs to me it would be of interest to see how IncompressibleMode speeds up with WENO5(). Could you give that a try sometime and post the results here?

0 replies

hennyg888 · 2021-07-09T20:29:01Z

hennyg888
Jul 9, 2021
Collaborator Author

Thanks to @glwagner's #1821 fix, WENO5() now works for benchmark_incompressible_model.jl for both Float32 and Float64.
Very notable improvements on speedup. We're up from 150 to 450 times speedup for Float32, and up from 120 to 350 times speedup for Float64 on a 256^3 grid.
Here are the results of the latest benchmark with the latest master:

                                            Incompressible model benchmarks
┌───────────────┬─────────────┬─────┬────────────┬────────────┬────────────┬────────────┬──────────┬────────┬─────────┐
│ Architectures │ Float_types │  Ns │        min │     median │       mean │        max │   memory │ allocs │ samples │
├───────────────┼─────────────┼─────┼────────────┼────────────┼────────────┼────────────┼──────────┼────────┼─────────┤
│           CPU │     Float32 │  32 │  34.822 ms │  34.872 ms │  35.278 ms │  38.143 ms │ 1.38 MiB │   2302 │      10 │
│           CPU │     Float32 │  64 │ 265.408 ms │ 265.571 ms │ 265.768 ms │ 267.765 ms │ 1.38 MiB │   2302 │      10 │
│           CPU │     Float32 │ 128 │    2.135 s │    2.135 s │    2.136 s │    2.138 s │ 1.38 MiB │   2302 │       3 │
│           CPU │     Float32 │ 256 │   17.405 s │   17.405 s │   17.405 s │   17.405 s │ 1.38 MiB │   2302 │       1 │
│           CPU │     Float64 │  32 │  37.022 ms │  37.179 ms │  37.335 ms │  39.017 ms │ 1.77 MiB │   2302 │      10 │
│           CPU │     Float64 │  64 │ 287.944 ms │ 288.154 ms │ 288.469 ms │ 290.838 ms │ 1.77 MiB │   2302 │      10 │
│           CPU │     Float64 │ 128 │    2.326 s │    2.326 s │    2.326 s │    2.327 s │ 1.77 MiB │   2302 │       3 │
│           CPU │     Float64 │ 256 │   19.561 s │   19.561 s │   19.561 s │   19.561 s │ 1.77 MiB │   2302 │       1 │
│           GPU │     Float32 │  32 │   4.154 ms │   4.250 ms │   4.361 ms │   5.557 ms │ 2.13 MiB │   6033 │      10 │
│           GPU │     Float32 │  64 │   3.383 ms │   3.425 ms │   3.889 ms │   8.028 ms │ 2.13 MiB │   6077 │      10 │
│           GPU │     Float32 │ 128 │   5.564 ms │   5.580 ms │   6.095 ms │  10.725 ms │ 2.15 MiB │   7477 │      10 │
│           GPU │     Float32 │ 256 │  38.685 ms │  38.797 ms │  39.548 ms │  46.442 ms │ 2.46 MiB │  27721 │      10 │
│           GPU │     Float64 │  32 │   3.309 ms │   3.634 ms │   3.802 ms │   5.844 ms │ 2.68 MiB │   6033 │      10 │
│           GPU │     Float64 │  64 │   3.330 ms │   3.648 ms │   4.008 ms │   7.808 ms │ 2.68 MiB │   6071 │      10 │
│           GPU │     Float64 │ 128 │   7.209 ms │   7.323 ms │   8.313 ms │  17.259 ms │ 2.71 MiB │   8515 │      10 │
│           GPU │     Float64 │ 256 │  46.614 ms │  56.444 ms │  55.461 ms │  56.563 ms │ 3.17 MiB │  38253 │      10 │
└───────────────┴─────────────┴─────┴────────────┴────────────┴────────────┴────────────┴──────────┴────────┴─────────┘

      Incompressible model CPU to GPU speedup
┌─────────────┬─────┬─────────┬─────────┬─────────┐
│ Float_types │  Ns │ speedup │  memory │  allocs │
├─────────────┼─────┼─────────┼─────────┼─────────┤
│     Float32 │  32 │ 8.20434 │ 1.53786 │ 2.62076 │
│     Float32 │  64 │ 77.5308 │ 1.53835 │ 2.63988 │
│     Float32 │ 128 │ 382.591 │ 1.55378 │ 3.24805 │
│     Float32 │ 256 │ 448.619 │ 1.77688 │ 12.0421 │
│     Float64 │  32 │ 10.2308 │ 1.51613 │ 2.62076 │
│     Float64 │  64 │ 78.9952 │ 1.51646 │ 2.63727 │
│     Float64 │ 128 │ 317.663 │ 1.53759 │ 3.69896 │
│     Float64 │ 256 │ 346.554 │ 1.79466 │ 16.6173 │
└─────────────┴─────┴─────────┴─────────┴─────────┘

0 replies

navidcy · 2021-07-09T20:33:34Z

navidcy
Jul 9, 2021
Maintainer

Do these numbers get automatically ported in the Docs?

0 replies

hennyg888 · 2021-07-09T20:41:26Z

hennyg888
Jul 9, 2021
Collaborator Author

Thanks for the reminder @navidcy. I don't think the .md's in Docs are automatically updated. I will update https://github.com/CliMA/Oceananigans.jl/blob/master/docs/src/appendix/benchmarks.md with the latest benchmark results.

On a second note, do we want to show the benchmark results with WENO5 or with no specified advection scheme?

0 replies

navidcy · 2021-07-09T20:44:21Z

navidcy
Jul 9, 2021
Maintainer

Yeap. Perhaps we can point out which version of Oceananigans was used for these numbers (ideally a tagged version).

0 replies

navidcy · 2021-07-09T20:47:04Z

navidcy
Jul 9, 2021
Maintainer

On a second note, do we want to show the benchmark results with WENO5 or with no specified advection scheme?

Oh now I saw that. I think anything is good, but just make sure you clarify how these results were made and on what machines and point to the script that produced them.

0 replies

francispoulin · 2021-07-09T21:35:26Z

francispoulin
Jul 9, 2021
Collaborator

In case people don't know, @hennyg888 ran all the benchmark scrips and I beileve he has posted the results here. Thank you Henry!

I think the scripts have evolved in that some of the outputs are formatted different than what currently appears. I'm not sure if people want to change everything to the current benchmark scripts that we have?

0 replies

New Benchmarks #3005

Uh oh!

hennyg888 May 28, 2021 Collaborator

Replies: 26 comments

Uh oh!

glwagner May 28, 2021 Maintainer

Uh oh!

francispoulin May 28, 2021 Collaborator

Uh oh!

navidcy May 28, 2021 Maintainer

Uh oh!

Uh oh!

francispoulin May 31, 2021 Collaborator

Uh oh!

glwagner May 31, 2021 Maintainer

Uh oh!

francispoulin May 31, 2021 Collaborator

Uh oh!

glwagner Jun 1, 2021 Maintainer

Uh oh!

Uh oh!

francispoulin Jun 1, 2021 Collaborator

Uh oh!

Uh oh!

hennyg888 Jun 2, 2021 Collaborator Author

Uh oh!

navidcy Jun 2, 2021 Maintainer

Uh oh!

francispoulin Jun 4, 2021 Collaborator

Uh oh!

francispoulin Jun 8, 2021 Collaborator

Uh oh!

glwagner Jun 16, 2021 Maintainer

Uh oh!

hennyg888 Jun 16, 2021 Collaborator Author

Uh oh!

francispoulin Jun 16, 2021 Collaborator

Uh oh!

glwagner Jun 16, 2021 Maintainer

Uh oh!

hennyg888 Jun 20, 2021 Collaborator Author

Uh oh!

hennyg888 Jun 20, 2021 Collaborator Author

Uh oh!

Uh oh!

hennyg888 Jun 26, 2021 Collaborator Author

Uh oh!

francispoulin Jun 26, 2021 Collaborator

Uh oh!

Uh oh!

hennyg888 Jul 9, 2021 Collaborator Author

Uh oh!

navidcy Jul 9, 2021 Maintainer

Uh oh!

Uh oh!

hennyg888 Jul 9, 2021 Collaborator Author

Uh oh!

navidcy Jul 9, 2021 Maintainer

Uh oh!

navidcy Jul 9, 2021 Maintainer

Uh oh!

francispoulin Jul 9, 2021 Collaborator

hennyg888
May 28, 2021
Collaborator

glwagner
May 28, 2021
Maintainer

francispoulin
May 28, 2021
Collaborator

navidcy
May 28, 2021
Maintainer

francispoulin
May 31, 2021
Collaborator

glwagner
May 31, 2021
Maintainer

francispoulin
May 31, 2021
Collaborator

glwagner
Jun 1, 2021
Maintainer

francispoulin
Jun 1, 2021
Collaborator

hennyg888
Jun 2, 2021
Collaborator Author

navidcy
Jun 2, 2021
Maintainer

francispoulin
Jun 4, 2021
Collaborator

francispoulin
Jun 8, 2021
Collaborator

glwagner
Jun 16, 2021
Maintainer

hennyg888
Jun 16, 2021
Collaborator Author

francispoulin
Jun 16, 2021
Collaborator

glwagner
Jun 16, 2021
Maintainer

hennyg888
Jun 20, 2021
Collaborator Author

hennyg888
Jun 20, 2021
Collaborator Author

hennyg888
Jun 26, 2021
Collaborator Author

francispoulin
Jun 26, 2021
Collaborator

hennyg888
Jul 9, 2021
Collaborator Author

navidcy
Jul 9, 2021
Maintainer

hennyg888
Jul 9, 2021
Collaborator Author

navidcy
Jul 9, 2021
Maintainer

navidcy
Jul 9, 2021
Maintainer

francispoulin
Jul 9, 2021
Collaborator