No performance difference fp32 vs. fp62 #1829

Fabian188 · 2025-04-16T21:11:43Z

Fabian188
Apr 16, 2025

I'm really happy about the ginkgo solver, thank you very much. I'm quite happy with the performance on CPU and GPU (RTX 5080).

I was curious about the fp32 performance and use ginkgo with double or float. I get slightly different results and number of iterations, so I assume my implementation is right.

Interestingly I see almost no performance differences on CPU (Apple Silicon) and CUDA (RTX 5080). I'm happy with the performance I have, I'm just curious if this behavior is expected or if I do something wrong.

Answered by yhmtsai

Apr 22, 2025

To use float, you need to use gko::preconditioner::Jacobi<float>, solver gko::solver::Cg<float>, also the criterion for example ResidualNorm<float>

View full answer

pratikvn · 2025-04-22T11:29:38Z

pratikvn
Apr 22, 2025
Maintainer

Which performance are you comparing ? CPU FP32 v/s CPU FP64 or CPU FP32 v/s CUDA FP64 ? Or some other variant ?

11 replies

upsj Apr 22, 2025
Maintainer

If you are looking for performance data, you can attach a logger to the executor at the beginning of your program that will give you the details when it is destroyed:

exec->add_logger(gko::log::ProfilerHook::create_nested_summary());

That gives a breakdown into individual components, and allows us to compare the two. Value types/conversion overheads are indeed a concern, since the default behavior for mixed precision is to create a temporary copy of the entire vector and copy the values back afterwards.

Fabian188 Apr 22, 2025
Author

Thank you for the hint with the logger! Can it be configured to not use vtune on Linux (it currently doesn't link for me on Linux). But it works on my Mac and obviously there is more code double than expected.

I have a templated csr matrix and templated vectors, which are indeed float, and generate preconditioners like

      precond = gko::preconditioner::Jacobi<>::build().with_max_block_size(mbs).on(exec);
...
      factory = gko::solver::Cg<>::build().with_criteria(iter_crit.on(exec),norm_crit.on(exec)).with_preconditioner(precond).on(exec);

How do I make sure, also preconditioner and solver are fp32?

ginkgo_mac_fp64.txt
ginkgo_mac_fp32.txt

Fabian188 Apr 22, 2025
Author

This is the content of ginkgo_mac_fp32.txt

Runtime summary
Overhead estimate 6.1 ms
|                               name                                |  total   | fraction | count |   avg    |
|-------------------------------------------------------------------|---------:|---------:|------:|---------:|
| total                                                             |   7.5 s  |  100.0 % |     1 |   7.5 s  |
|   apply(gko::solver::Cg<double>)                                  |   5.6 s  |   74.3 % |     1 |   5.6 s  |
|     iteration                                                     |   5.6 s  |  100.0 % |   423 |  13.2 ms |
|       apply(gko::matrix::Csr<float, int>)                         |   4.9 s  |   87.5 % |   423 |  11.6 ms |
|         csr::spmv                                                 |   4.7 s  |   96.0 % |   423 |  11.1 ms |
|         dense::copy                                               | 115.0 ms |    2.4 % |  1269 |  90.6 us |
|         free                                                      |  67.6 ms |    1.4 % |   846 |  79.9 us |
|         (self)                                                    |  12.9 ms |    0.3 % |   423 |  30.5 us |
|         allocate                                                  | 648.8 us |    0.0 % |   846 | 766.0 ns |
|       dense::compute_conj_dot_dispatch                            | 220.9 ms |    4.0 % |   847 | 260.8 us |
|         (self)                                                    | 220.9 ms |  100.0 % |   847 | 260.8 us |
|         allocate                                                  | 375.0 ns |    0.0 % |     1 | 375.0 ns |
|       cg::step_2                                                  | 158.8 ms |    2.8 % |   423 | 375.3 us |
|       check(gko::stop::Combined)                                  | 143.6 ms |    2.6 % |   424 | 338.7 us |
|         check(gko::stop::ResidualNorm<double>)                    | 142.0 ms |   98.9 % |   424 | 334.9 us |
|           dense::compute_norm2_dispatch                           | 108.9 ms |   76.7 % |   424 | 256.9 us |
|             (self)                                                | 108.9 ms |  100.0 % |   424 | 256.9 us |
|             allocate                                              | 291.0 ns |    0.0 % |     1 | 291.0 ns |
|           residual_norm::residual_norm                            |  30.5 ms |   21.5 % |   424 |  71.9 us |
|           (self)                                                  |   2.6 ms |    1.8 % |   424 |   6.1 us |
|         (self)                                                    |   1.3 ms |    0.9 % |   424 |   3.2 us |
|         check(gko::stop::Iteration)                               | 242.1 us |    0.2 % |   424 | 570.0 ns |
|       cg::step_1                                                  | 110.9 ms |    2.0 % |   423 | 262.2 us |
|       apply(gko::preconditioner::Jacobi<double, int>)             |  42.0 ms |    0.8 % |   424 |  99.0 us |
|         jacobi::simple_scalar_apply                               |  39.1 ms |   93.1 % |   424 |  92.2 us |
|         (self)                                                    |   2.9 ms |    6.9 % |   424 |   6.8 us |
|       advanced_apply(gko::matrix::Csr<float, int>)                |  12.3 ms |    0.2 % |     1 |  12.3 ms |
|         csr::advanced_spmv                                        |  11.8 ms |   95.9 % |     1 |  11.8 ms |
|         dense::copy                                               | 390.3 us |    3.2 % |     5 |  78.1 us |
|         (self)                                                    | 106.5 us |    0.9 % |     1 | 106.5 us |
|         allocate                                                  |   3.1 us |    0.0 % |     4 | 781.0 ns |
|         free                                                      |   3.0 us |    0.0 % |     4 | 760.0 ns |
|       (self)                                                      |   8.1 ms |    0.1 % |   423 |  19.2 us |
|       cg::initialize                                              | 795.7 us |    0.0 % |     1 | 795.7 us |
|       dense::copy                                                 | 433.4 us |    0.0 % |     3 | 144.5 us |
|       copy(gko::matrix::Dense<double>,gko::matrix::Dense<double>) | 238.3 us |    0.0 % |     2 | 119.1 us |
|         dense::copy                                               | 221.3 us |   92.9 % |     2 | 110.7 us |
|         (self)                                                    |  15.4 us |    6.5 % |     2 |   7.7 us |
|         allocate                                                  |   1.5 us |    0.6 % |     2 | 771.0 ns |
|       dense::compute_norm2_dispatch                               | 229.8 us |    0.0 % |     1 | 229.8 us |
|         (self)                                                    | 229.7 us |  100.0 % |     1 | 229.7 us |
|         allocate                                                  |  83.0 ns |    0.0 % |     1 |  83.0 ns |
|       free                                                        |  72.4 us |    0.0 % |    10 |   7.2 us |
|       dense::fill                                                 |  19.4 us |    0.0 % |     3 |   6.5 us |
|       allocate                                                    |   4.7 us |    0.0 % |    19 | 245.0 ns |
|       copy                                                        |  84.0 ns |    0.0 % |     1 |  84.0 ns |
|     (self)                                                        | 181.0 us |    0.0 % |     1 | 181.0 us |
|   (self)                                                          |   1.9 s  |   25.5 % |     1 |   1.9 s  |
|   generate(gko::solver::Cg<double>::Factory)                      |  11.9 ms |    0.2 % |     1 |  11.9 ms |
|     generate(gko::preconditioner::Jacobi<double, int>::Factory)   |  11.9 ms |   99.9 % |     1 |  11.9 ms |
|       csr::extract_diagonal                                       |  11.0 ms |   92.6 % |     1 |  11.0 ms |
|       components::convert_precision                               | 290.6 us |    2.4 % |     2 | 145.3 us |
|       jacobi::invert_diagonal                                     | 215.0 us |    1.8 % |     1 | 215.0 us |
|       (self)                                                      | 141.9 us |    1.2 % |     1 | 141.9 us |
|       components::fill_array                                      | 141.8 us |    1.2 % |     1 | 141.8 us |
|       free                                                        |  86.9 us |    0.7 % |     2 |  43.4 us |
|       allocate                                                    |   4.9 us |    0.0 % |     3 |   1.6 us |
|     (self)                                                        |  16.7 us |    0.1 % |     1 |  16.7 us |
|   free                                                            | 653.6 us |    0.0 % |    18 |  36.3 us |
|   copy                                                            | 128.2 us |    0.0 % |     1 | 128.2 us |
|   copy(gko::matrix::Dense<double>,gko::matrix::Dense<double>)     |  41.6 us |    0.0 % |     1 |  41.6 us |
|     dense::copy                                                   |  31.5 us |   75.9 % |     1 |  31.5 us |
|     (self)                                                        |   9.7 us |   23.2 % |     1 |   9.7 us |
|     allocate                                                      | 375.0 ns |    0.9 % |     1 | 375.0 ns |
|   allocate                                                        | 542.0 ns |    0.0 % |     2 | 271.0 ns |

The code is https://gitlab.com/openCFS/cfs/-/blob/master/source/OLAS/external/ginkgo/GinkgoSolver.cc?ref_type=heads

yhmtsai Apr 22, 2025
Collaborator

To use float, you need to use gko::preconditioner::Jacobi<float>, solver gko::solver::Cg<float>, also the criterion for example ResidualNorm<float>

Answer selected by Fabian188

Fabian188 Apr 22, 2025
Author

Thanks! It is now all float but fails with an error. Do you have a hint?

>> Error: /Users/fwein/code/cfs/release/cfsdeps/ginkgo/src/ginkgo/core/distributed/helpers.hpp:143: Operation vector_dispatch does not support parameters of type gko::matrix::Dense<float>
WARNING: Unfinished events remaining in summary gko::log::ProfilerHook.
This probably means the logger was created outside a Ginkgo operation but removed and destroyed inside.
To fix this, move the logger creation to the outermost scope where Ginkgo is used!
The profiler output will most likely be incorrect.
Popping unfinished event "iteration"
Popping unfinished event "apply(gko::solver::Cg<float>)"
Runtime summary
Overhead estimate 5.3 ms
|                              name                               |  total   | fraction | count |   avg    |
|-----------------------------------------------------------------|---------:|---------:|------:|---------:|
| total                                                           |   8.3 s  |  100.0 % |     1 |   8.3 s  |
|   apply(gko::solver::Cg<float>)                                 |   7.3 s  |   87.9 % |     1 |   7.3 s  |
|     iteration                                                   |   7.3 s  |  100.0 % |   587 |  12.4 ms |
|       apply(gko::matrix::Csr<float, int>)                       |   6.4 s  |   88.2 % |   587 |  10.9 ms |
|         csr::spmv                                               |   6.4 s  |   99.9 % |   587 |  10.9 ms |
|         (self)                                                  |   8.1 ms |    0.1 % |   587 |  13.8 us |
|       dense::compute_conj_dot_dispatch                          | 275.9 ms |    3.8 % |  1175 | 234.8 us |
|         (self)                                                  | 275.9 ms |  100.0 % |  1175 | 234.8 us |
|         allocate                                                | 500.0 ns |    0.0 % |     1 | 500.0 ns |
|       cg::step_2                                                | 191.6 ms |    2.6 % |   587 | 326.4 us |
|       check(gko::stop::Combined)                                | 136.6 ms |    1.9 % |   588 | 232.4 us |
|         check(gko::stop::ResidualNorm<float>)                   | 134.5 ms |   98.4 % |   588 | 228.7 us |
|           dense::compute_norm2_dispatch                         | 123.6 ms |   91.9 % |   588 | 210.2 us |
|             (self)                                              | 123.6 ms |  100.0 % |   588 | 210.2 us |
|             allocate                                            | 125.0 ns |    0.0 % |     1 | 125.0 ns |
|           residual_norm::residual_norm                          |   7.9 ms |    5.9 % |   588 |  13.5 us |
|           (self)                                                |   3.0 ms |    2.2 % |   588 |   5.0 us |
|         (self)                                                  |   1.8 ms |    1.3 % |   588 |   3.1 us |
|         check(gko::stop::Iteration)                             | 325.6 us |    0.2 % |   588 | 553.0 ns |
|       cg::step_1                                                | 132.6 ms |    1.8 % |   587 | 225.8 us |
|       (self)                                                    |  64.7 ms |    0.9 % |   587 | 110.2 us |
|       apply(gko::preconditioner::Jacobi<float, int>)            |  44.0 ms |    0.6 % |   588 |  74.8 us |
|         jacobi::simple_scalar_apply                             |  40.1 ms |   91.3 % |   588 |  68.3 us |
|         (self)                                                  |   3.8 ms |    8.7 % |   588 |   6.5 us |
|       advanced_apply(gko::matrix::Csr<float, int>)              |  11.7 ms |    0.2 % |     1 |  11.7 ms |
|         csr::advanced_spmv                                      |  11.7 ms |   99.4 % |     1 |  11.7 ms |
|         (self)                                                  |  75.1 us |    0.6 % |     1 |  75.1 us |
|       cg::initialize                                            | 488.4 us |    0.0 % |     1 | 488.4 us |
|       free                                                      | 239.9 us |    0.0 % |    23 |  10.4 us |
|       copy(gko::matrix::Dense<float>,gko::matrix::Dense<float>) | 169.7 us |    0.0 % |     2 |  84.9 us |
|         dense::copy                                             | 114.0 us |   67.2 % |     2 |  57.0 us |
|         (self)                                                  |  51.4 us |   30.3 % |     2 |  25.7 us |
|         allocate                                                |   4.3 us |    2.5 % |     2 |   2.1 us |
|       dense::fill                                               |  53.6 us |    0.0 % |     3 |  17.9 us |
|       allocate                                                  |   2.7 us |    0.0 % |    16 | 169.0 ns |
|       copy                                                      | 208.0 ns |    0.0 % |     1 | 208.0 ns |
|     (self)                                                      | 178.8 us |    0.0 % |     1 | 178.8 us |
|   (self)                                                        | 988.9 ms |   11.9 % |     1 | 988.9 ms |
|   generate(gko::solver::Cg<float>::Factory)                     |  11.2 ms |    0.1 % |     1 |  11.2 ms |
|     generate(gko::preconditioner::Jacobi<float, int>::Factory)  |  11.2 ms |   99.6 % |     1 |  11.2 ms |
|       csr::extract_diagonal                                     |  10.6 ms |   95.0 % |     1 |  10.6 ms |
|       (self)                                                    | 231.7 us |    2.1 % |     1 | 231.7 us |
|       components::fill_array                                    | 196.6 us |    1.8 % |     1 | 196.6 us |
|       jacobi::invert_diagonal                                   |  86.7 us |    0.8 % |     1 |  86.7 us |
|       free                                                      |  45.6 us |    0.4 % |     1 |  45.6 us |
|       allocate                                                  |   3.1 us |    0.0 % |     2 |   1.5 us |
|     (self)                                                      |  44.0 us |    0.4 % |     1 |  44.0 us |
|   copy                                                          | 121.7 us |    0.0 % |     1 | 121.7 us |
|   allocate                                                      | 708.0 ns |    0.0 % |     2 | 354.0 ns |

Fabian188 Apr 22, 2025
Author

BTW: the total time increased slightly.

yhmtsai Apr 22, 2025
Collaborator

you have the Convergence<double>, I assume you need to change the template parameter to float. Also change when accessing the data under logger.

Fabian188 Apr 22, 2025
Author

Thanks to all for your help!

I solve a linear elasticity FEM problem with 50x50x50 linear hexahedrons and for double I need less cg iterations (420) vs. float (590 iterations).

On a Apple Silicon M1 with only two threads this makes fp32 a little slower (7.1 sec vs. 5.6 sec).

On a RTX5080 with 20 omp threads it is 503 vs 420 iterations but now fp32 is faster (0.21 sec vs. 0.29 sec).

No performance difference fp32 vs. fp62 #1829

Uh oh!

Fabian188 Apr 16, 2025

Replies: 1 comment · 11 replies

Uh oh!

pratikvn Apr 22, 2025 Maintainer

Uh oh!

Uh oh!

upsj Apr 22, 2025 Maintainer

Uh oh!

Fabian188 Apr 22, 2025 Author

Uh oh!

Fabian188 Apr 22, 2025 Author

Uh oh!

yhmtsai Apr 22, 2025 Collaborator

Uh oh!

Fabian188 Apr 22, 2025 Author

Uh oh!

Fabian188 Apr 22, 2025 Author

Uh oh!

yhmtsai Apr 22, 2025 Collaborator

Uh oh!

Fabian188 Apr 22, 2025 Author

Fabian188
Apr 16, 2025

Replies: 1 comment 11 replies

pratikvn
Apr 22, 2025
Maintainer

upsj Apr 22, 2025
Maintainer

Fabian188 Apr 22, 2025
Author

Fabian188 Apr 22, 2025
Author

yhmtsai Apr 22, 2025
Collaborator

Fabian188 Apr 22, 2025
Author

Fabian188 Apr 22, 2025
Author

yhmtsai Apr 22, 2025
Collaborator

Fabian188 Apr 22, 2025
Author