openEMS prefers local CPU cache size over cache locality and memory bandwidth #163

KJ7LNW · 2023-12-14T04:05:04Z

KJ7LNW
Dec 14, 2023

When I first started running openEMS, I made the assumption that memory bandwidth was more important than CPU cache size. I also expected that inter-socket communication latency would slow down the computation. After running some tests on an old quad socket E7-4890 v2 server, it turns out that I was completely wrong and much better performance is available across sockets to utilize as much cache as possible.

Test 1, initial single-socket assumption

With the assumption above, I bound an openEMS job to a single 15 core socket (socket 3) using taskset and numactl:

[root@zdesktop ~]# lscpu|grep node
NUMA node(s):        4
NUMA node0 CPU(s):   0-14,60-74
NUMA node1 CPU(s):   15-29,75-89
NUMA node2 CPU(s):   30-44,90-104
NUMA node3 CPU(s):   45-59,105-119

# taskset --cpu-list 45-59 numactl --membind=3  python3 yagi_LuboJ_modified2.py
...
[@       13s] Timestep:          483 || Speed:   70.1 MC/s (2.895e-02 s/TS)  70.1 (MC/s/core) || Energy: ~5.66e-22 (- 0.00dB)
[@       22s] Timestep:          966 || Speed:  118.9 MC/s (1.707e-02 s/TS)  59.4 (MC/s/core) || Energy: ~8.04e-22 (- 0.00dB)
[@       28s] Timestep:         1449 || Speed:  156.4 MC/s (1.297e-02 s/TS)  52.1 (MC/s/core) || Energy: ~1.41e-21 (- 0.00dB)
[@       34s] Timestep:         1932 || Speed:  169.9 MC/s (1.194e-02 s/TS)  42.5 (MC/s/core) || Energy: ~2.95e-21 (- 0.00dB)
[@       39s] Timestep:         2415 || Speed:  191.3 MC/s (1.061e-02 s/TS)  38.3 (MC/s/core) || Energy: ~9.77e-21 (- 0.00dB)
[@       44s] Timestep:         2898 || Speed:  194.2 MC/s (1.045e-02 s/TS)  32.4 (MC/s/core) || Energy: ~2.82e-20 (- 0.00dB)
[@       49s] Timestep:         3381 || Speed:  202.8 MC/s (1.000e-02 s/TS)  29.0 (MC/s/core) || Energy: ~5.72e-20 (- 0.00dB)
[@       54s] Timestep:         3864 || Speed:  200.0 MC/s (1.014e-02 s/TS)  25.0 (MC/s/core) || Energy: ~8.99e-20 (- 0.00dB)
Multithreaded Engine: Best performance found using 7 threads.
[@       59s] Timestep:         4347 || Speed:  202.4 MC/s (1.003e-02 s/TS)  28.9 (MC/s/core) || Energy: ~1.40e-19 (- 0.00dB)
[@     1m03s] Timestep:         4830 || Speed:  202.4 MC/s (1.003e-02 s/TS)  28.9 (MC/s/core) || Energy: ~2.94e-19 (- 0.00dB)

Test 2, multi-socket with single-socket memory

On a whim during the simulation above, I used taskset to spread the cores across multiple sockets while it was running. The memory was still bound the single socket-3, so I was surprised to see such a big difference. Here is an almost 50% increase in performance with that equivalent test:

# taskset --cpu-list 0,15,30,45,1,16,31,46 numactl --membind=3  python3 yagi_LuboJ_modified2.py
...
[@       25s] Timestep:          483 || Speed:   38.0 MC/s (5.342e-02 s/TS)  38.0 (MC/s/core) || Energy: ~5.66e-22 (- 0.00dB)
[@       41s] Timestep:          966 || Speed:   62.5 MC/s (3.249e-02 s/TS)  31.2 (MC/s/core) || Energy: ~8.04e-22 (- 0.00dB)
[@       50s] Timestep:         1449 || Speed:  106.8 MC/s (1.900e-02 s/TS)  35.6 (MC/s/core) || Energy: ~1.41e-21 (- 0.00dB)
[@       54s] Timestep:         1932 || Speed:  229.8 MC/s (8.831e-03 s/TS)  57.4 (MC/s/core) || Energy: ~2.95e-21 (- 0.00dB)
[@     1m01s] Timestep:         2898 || Speed:  286.7 MC/s (7.078e-03 s/TS)  57.3 (MC/s/core) || Energy: ~2.82e-20 (- 0.00dB)
[@     1m09s] Timestep:         3864 || Speed:  262.0 MC/s (7.746e-03 s/TS)  43.7 (MC/s/core) || Energy: ~8.99e-20 (- 0.00dB)
Multithreaded Engine: Best performance found using 5 threads.
[@     1m15s] Timestep:         4830 || Speed:  291.3 MC/s (6.966e-03 s/TS)  58.3 (MC/s/core) || Energy: ~2.94e-19 (- 0.00dB)
[@     1m23s] Timestep:         5796 || Speed:  278.8 MC/s (7.278e-03 s/TS)  55.8 (MC/s/core) || Energy: ~1.67e-18 (- 0.00dB)
[@     1m29s] Timestep:         6762 || Speed:  291.8 MC/s (6.953e-03 s/TS)  58.4 (MC/s/core) || Energy: ~4.45e-18 (- 0.00dB)
[@     1m36s] Timestep:         7728 || Speed:  296.1 MC/s (6.852e-03 s/TS)  59.2 (MC/s/core) || Energy: ~1.25e-17 (- 0.00dB)

Test 3, multi-socket with interleaved memory

I expected that since memory was bound to a single socket in test-2, the performance of memory interleaving would suffer. I set up NUMA interleaving to spread memory across the sockets. This was better than test 1, but it was worse the test-2. Thus, CPU cache is more important than memory bandwidth:

# taskset -c 0,15,30,45,1,16,31,46 numactl --interleave=all python3 yagi_LuboJ_modified2.py
...
[@       23s] Timestep:          483 || Speed:   42.2 MC/s (4.804e-02 s/TS)  42.2 (MC/s/core) || Energy: ~5.66e-22 (- 0.00dB)
[@       37s] Timestep:          966 || Speed:   67.6 MC/s (3.001e-02 s/TS)  33.8 (MC/s/core) || Energy: ~8.04e-22 (- 0.00dB)
[@       46s] Timestep:         1449 || Speed:  117.5 MC/s (1.727e-02 s/TS)  39.2 (MC/s/core) || Energy: ~1.41e-21 (- 0.00dB)
[@       50s] Timestep:         1932 || Speed:  233.4 MC/s (8.692e-03 s/TS)  58.4 (MC/s/core) || Energy: ~2.95e-21 (- 0.00dB)
[@       54s] Timestep:         2415 || Speed:  237.6 MC/s (8.542e-03 s/TS)  47.5 (MC/s/core) || Energy: ~9.77e-21 (- 0.00dB)
[@     1m01s] Timestep:         3381 || Speed:  275.5 MC/s (7.365e-03 s/TS)  45.9 (MC/s/core) || Energy: ~5.72e-20 (- 0.00dB)
[@     1m05s] Timestep:         3864 || Speed:  240.4 MC/s (8.441e-03 s/TS)  34.3 (MC/s/core) || Energy: ~8.99e-20 (- 0.00dB)
Multithreaded Engine: Best performance found using 6 threads.
[@     1m12s] Timestep:         4830 || Speed:  270.8 MC/s (7.493e-03 s/TS)  45.1 (MC/s/core) || Energy: ~2.94e-19 (- 0.00dB)
[@     1m16s] Timestep:         5313 || Speed:  241.7 MC/s (8.397e-03 s/TS)  40.3 (MC/s/core) || Energy: ~7.48e-19 (- 0.00dB)

Test 4, multi-socket with socket-local memory

The fourth test spread computation across multiple sockets, but tells us that the process prefers memory allocated from the local node. Since interleaving turn out to be worse than being memory bound to a single node, I was surprised to find out that this worked only about is well as when memory was bound to a single socket:

# taskset -c 0,15,30,45,1,16,31,46 numactl --localalloc python3 yagi_LuboJ_modified2.py
...
[@       16s] Timestep:          483 || Speed:   59.7 MC/s (3.396e-02 s/TS)  59.7 (MC/s/core) || Energy: ~5.66e-22 (- 0.00dB)
[@       25s] Timestep:          966 || Speed:  112.4 MC/s (1.805e-02 s/TS)  56.2 (MC/s/core) || Energy: ~8.04e-22 (- 0.00dB)
[@       31s] Timestep:         1449 || Speed:  163.0 MC/s (1.245e-02 s/TS)  54.3 (MC/s/core) || Energy: ~1.41e-21 (- 0.00dB)
[@       36s] Timestep:         1932 || Speed:  201.2 MC/s (1.009e-02 s/TS)  50.3 (MC/s/core) || Energy: ~2.95e-21 (- 0.00dB)
[@       42s] Timestep:         2898 || Speed:  285.1 MC/s (7.118e-03 s/TS)  57.0 (MC/s/core) || Energy: ~2.82e-20 (- 0.00dB)
[@       50s] Timestep:         3864 || Speed:  244.3 MC/s (8.305e-03 s/TS)  40.7 (MC/s/core) || Energy: ~8.99e-20 (- 0.00dB)
Multithreaded Engine: Best performance found using 5 threads.
[@       57s] Timestep:         4830 || Speed:  300.3 MC/s (6.758e-03 s/TS)  60.1 (MC/s/core) || Energy: ~2.94e-19 (- 0.00dB)
[@     1m01s] Timestep:         5313 || Speed:  227.8 MC/s (8.906e-03 s/TS)  45.6 (MC/s/core) || Energy: ~7.48e-19 (- 0.00dB)
[@     1m08s] Timestep:         6279 || Speed:  298.4 MC/s (6.801e-03 s/TS)  59.7 (MC/s/core) || Energy: ~2.96e-18 (- 0.00dB)

Test 5, limit execution to 2 sockets.

This is the fastest test, but read the analysis section below.

# taskset -c 15,45,16,46,17,47,18,48 numactl --localalloc python3 yagi_LuboJ_modified2.py
...
[@       20s] Timestep:          483 || Speed:   48.6 MC/s (4.179e-02 s/TS)  48.6 (MC/s/core) || Energy: ~5.66e-22 (- 0.00dB)
[@       28s] Timestep:          966 || Speed:  113.2 MC/s (1.792e-02 s/TS)  56.6 (MC/s/core) || Energy: ~8.04e-22 (- 0.00dB)
[@       35s] Timestep:         1449 || Speed:  150.2 MC/s (1.351e-02 s/TS)  50.1 (MC/s/core) || Energy: ~1.41e-21 (- 0.00dB)
[@       40s] Timestep:         1932 || Speed:  182.8 MC/s (1.110e-02 s/TS)  45.7 (MC/s/core) || Energy: ~2.95e-21 (- 0.00dB)
[@       45s] Timestep:         2415 || Speed:  217.7 MC/s (9.323e-03 s/TS)  43.5 (MC/s/core) || Energy: ~9.77e-21 (- 0.00dB)
[@       52s] Timestep:         3381 || Speed:  268.4 MC/s (7.561e-03 s/TS)  44.7 (MC/s/core) || Energy: ~5.72e-20 (- 0.00dB)
[@       58s] Timestep:         4347 || Speed:  312.9 MC/s (6.486e-03 s/TS)  44.7 (MC/s/core) || Energy: ~1.40e-19 (- 0.00dB)
[@     1m06s] Timestep:         5313 || Speed:  265.0 MC/s (7.658e-03 s/TS)  33.1 (MC/s/core) || Energy: ~7.48e-19 (- 0.00dB)
Multithreaded Engine: Best performance found using 7 threads.
[@     1m12s] Timestep:         6279 || Speed:  312.5 MC/s (6.494e-03 s/TS)  44.6 (MC/s/core) || Energy: ~2.96e-18 (- 0.00dB)
[@     1m18s] Timestep:         7245 || Speed:  315.0 MC/s (6.443e-03 s/TS)  45.0 (MC/s/core) || Energy: ~6.70e-18 (- 0.00dB)

Test 6, limit execution to 3 sockets.

Surprisingly slow:

# taskset -c 3,15,30,4,16,31,4 numactl --localalloc python3 yagi_LuboJ_modified2.py
...
[@       26s] Timestep:          483 || Speed:   37.0 MC/s (5.488e-02 s/TS)  37.0 (MC/s/core) || Energy: ~5.66e-22 (- 0.00dB)
[@       41s] Timestep:          966 || Speed:   65.8 MC/s (3.082e-02 s/TS)  32.9 (MC/s/core) || Energy: ~8.04e-22 (- 0.00dB)
[@       49s] Timestep:         1449 || Speed:  117.2 MC/s (1.732e-02 s/TS)  39.1 (MC/s/core) || Energy: ~1.41e-21 (- 0.00dB)
[@       55s] Timestep:         1932 || Speed:  161.5 MC/s (1.257e-02 s/TS)  40.4 (MC/s/core) || Energy: ~2.95e-21 (- 0.00dB)
[@     1m00s] Timestep:         2415 || Speed:  190.8 MC/s (1.064e-02 s/TS)  38.2 (MC/s/core) || Energy: ~9.77e-21 (- 0.00dB)
[@     1m06s] Timestep:         2898 || Speed:  190.8 MC/s (1.064e-02 s/TS)  31.8 (MC/s/core) || Energy: ~2.82e-20 (- 0.00dB)
[@     1m10s] Timestep:         3381 || Speed:  207.1 MC/s (9.798e-03 s/TS)  29.6 (MC/s/core) || Energy: ~5.72e-20 (- 0.00dB)
[@     1m18s] Timestep:         3864 || Speed:  128.3 MC/s (1.582e-02 s/TS)  16.0 (MC/s/core) || Energy: ~8.99e-20 (- 0.00dB)
Multithreaded Engine: Best performance found using 7 threads.
[@     1m23s] Timestep:         4347 || Speed:  207.7 MC/s (9.770e-03 s/TS)  29.7 (MC/s/core) || Energy: ~1.40e-19 (- 0.00dB)
[@     1m34s] Timestep:         5313 || Speed:  201.1 MC/s (1.009e-02 s/TS)  28.7 (MC/s/core) || Energy: ~7.48e-19 (- 0.00dB)

Analysis

Since binding to a single node (--membind=3) actually worked about the same as memory allocation preferring a local node (--localalloc), this tells us that the computation is operating almost completely from CPU cache. The cache on these sockets is enormous (for 2014) as you can see here:

# lscpu|grep cache
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            38400K

By spreading the cache across all 4 sockets, 150 megabytes of cache were available. When I ran top during the calculation it appears to only use to ~250 megabytes of RAM in resident memory. Presumably the working set for the problem is less than that, so it is possible that the whole thing fits in cache.

There is one outlier: test 5 when limited to only two sockets performed the best, however it was only better by ~3% which could be within them margin of error. This indicates that there is more going on then just cache benefits. However, the system is not completely idle. There are other services running that I tried to work around which may introduce jitter. If I get the opportunity to run these tests on a completely idle systems and I will. I believe that these tests were performed on cores that were unallocated to any other process.

These tests for perform using @LubomirJagos's optimized Yagi that he cleaned up in #158. The only differences that I reduced the wire size from 5mm to 2mm for this simulation.

openEMS NUMA Awareness

While more investigation maybe necessary to get a complete picture of this story, I think that the core scheduling algorithm for NUMA awareness in openEMS is pretty straight forward: allocate threads in cache-priority. This is particularly important since, at least for small simulations like I have been doing, the benefit of an additional compute thread reaches a computational limit, and it does so before exhausting all available CPUs. This means that higher clock speeds with fewer cores and bigger caches should provide the best performance.

Here is the suggested core allocation order to use as much cache as possible:

socket
chiplet (group)
core
sibling (hyperthread)

Related/future work

Bug How to simulate impedance transformer with openEMS? #103 Maybe a Clean have suboptimal core scheduling in the operating system kernel such that threads are being assigned in places where cache is not optimal utilized.
@biergaizi is working on a refactor of the threading and tiling engine in Announcing the First Test Release of the Tiling Engine with 200% to 600% Speedup #92. The initial results look amazing and NUMA awareness should be tested again with that codebase to see if it performs better with tightly coupled cache and memory on the same socket, page interleaving for increased memory bandwidth, or it prefers to spread out the threads to use as much cache as possible, like we have seen here.
Further testing could be done on AMD processors, which have a richer NUMA hierarchy, where pairs of cores share cache on the same "chiplets".

MC/s/core and Optimization

You may have noticed a measurement I added to the output above which is MC/s/core. The purposes of this to visualize the added benefit of going to an additional compute thread. Except that there is jitter in the measurements, the general trend is that each additional thread reduces the calculations per second per core. For example, from test-5:

48.6 (MC/s/core)
56.6 (MC/s/core)
50.1 (MC/s/core)
45.7 (MC/s/core)
43.5 (MC/s/core)
44.7 (MC/s/core)
44.7 (MC/s/core)
33.1 (MC/s/core)

Optimization algorithms like Particle Swarm (in Python or in Perl) can benefit by calculating multiple particles simultaneously. Thus, an optimal solution may be found faster by running multiple simulations at the highest MC/s/core rate across multiple cores, even if that would be suboptimal for a single run.

For an optimization algorithm it is more important to calculate many sample points that are provided by the optimization algorithm, than to complete a single simulation quickly. Of course for a single simulation you just want to run as fast as you can, but for an optimization you want to run as many simulation points as possible so that it will saturate all the processor cores that you have available.

While I have not looked into this yet, I plan to implement parallel-Particle-Swarm as multiple fork()ed openEMS simulation instances.

VolkerMuehlhaus · 2023-12-14T08:45:44Z

VolkerMuehlhaus
Dec 14, 2023

Very interesting, that you!

Could you share the exact model that you used here, so that we don't compare apples to oranges? In the yagi thread we had a few versions, and you applied your change now, so we better define the exact testcase here.

7 replies

biergaizi Dec 19, 2023

Most openEMS simulations currently also does not take advantage of AMD's 3D V-Cache according to my earlier tests on AMD Ryzen 9 7950X3D, in fact it prefers clock frequency and runs faster on the non-cache CCD (CCD2's result). My hypothesis is that a simulation either fits in cache or they do not, adding an extra 64 MiB does not help since different update steps tends to flush each other's data out from cache.

VolkerMuehlhaus Dec 19, 2023

The links to the tests in your latest post given an empty browser tab here.

Just to avoid misunderstanding: the machine that I use is the "regular" Ryzen 9 5950X, not the version with that extra 3D cache.

I had checked with the developers of my commercial FDTD solver (Empire XPU) and they also assumed (without having tested) that their solver will not benefit from the 3D version (or even be slowed down) because not all cores can access the larger cache.

biergaizi Dec 19, 2023

The links to the tests in your latest post given an empty browser tab here.

The links are several hundred characters long, some browsers may have problems handling them. Right click and "save as". Alternatively, just check the full report here: thliebig/openEMS#105

Just to avoid misunderstanding: the machine that I use is the "regular" Ryzen 9 5950X, not the version with that extra 3D cache.

I was just pointing out the existence of my test data. I mentioned the 7950X3D not because I confused your CPUs, but because it's a more powerful CPU than the last-gen 5950X so it's a useful data point for comparison (not to mention the topic of this thread itself is cache locality).

I've already systematically tested openEMS on 10 systems at significant time and monetary expenses. so I'm going to repost it whenever someone tries to do any test so they're informed of the prior work. Not to mention that the entire test suite has already been published so everyone can replicate those result.

they also assumed (without having tested) that their solver will not benefit from the 3D version (or even be slowed down) because not all cores can access the larger cache.

The question is what happens if you manually pin the processes only on the cores with the larger cache. You can even do a A/B comparsion this way, one run on cores with extra cache and one run without. This was exactly what I did. The conclusion was that extra cache has absolutely no effect on performance, likely due to poor locality of the FDTD engine.

VolkerMuehlhaus Dec 19, 2023

Thanks for the link, that works and I will look into the data now! I didn't notice that data before, and just wanted to share an extra data point because my not-so-new AMD box already outperforms the fat old 4-socket machine for this testcase.

I'm sure you know that AMD is offering the generation 7 CPU in both regular version 7950X as well as 3D memory version 7950X3D, and my discussion with Empire developers was if the (little) extra cost for the 3D version is worth it.

You mention that you tested the 3D version, but your link says 7950X (without 3D suffix), so you might want to update that to avoid confusion.

biergaizi Dec 19, 2023

You mention that you tested the 3D version, but your link says 7950X (without 3D suffix), so you might want to update that to avoid confusion.

Sorry, I accidentally linked to an earlier version of the report. I meant to link the updated version: thliebig/openEMS#117 Please ignore the old version and read the new one. Both the 7950X3D and the standard 7950X were tested, overall there are 11 systems in total. I paid several hundreds dollars of CPU time to run these tests.

One thing to note is that the 7950X and the 7950X3D were tested on different system versions, so the 7950X is paradoxically slower than the 7950X3D even the non-cached CCD was compared - this is a "systematic error". I plan to rerun the test once I finish the development work of the new engine.

biergaizi · 2023-12-14T13:52:11Z

biergaizi
Dec 14, 2023

This is yet another flawed benchmark. You see this behavior only because the simulation domain is small. I've posted my comment to another thread, please be sure to check it.

Understanding the Performance Characteristics of FDTD and openEMS

4 replies

KJ7LNW Dec 14, 2023
Author

I've posted my comment to another thread, please be sure to check it.

Thank you for the link!

You see this behavior only because the simulation domain is small.

As indicated in the post, this is a simulation with a small memory footprint. The intention was to share simulation results, and inform users with multi socket systems to try to get the entire simulation into cache if possible. For now, as you note, this fits completely in cache and does not put pressure on the memory bus.

If NUMA awareness is ever added, it could be beneficial to check the socket last level cache sizes and let me inform the thread scheduling algorithm.

If I have something bigger to test then I will provide additional information on that NUMA system. I would be curious to know how spreading across cores and cache coherent effects larger simulations.

biergaizi Dec 14, 2023

The intention was to share simulation results, and inform users with multi socket systems to try to get the entire simulation into cache if possible.

Yes, I agree on this aspect. It would be the best for everyone if all simulation data can fit in cache. L3 cache is 10x faster than the NUMA interconnect or DRAM anyway.

In my experience, circuit board scale simulations are usually large and far from being able to fit in cache - but if you find that small domains are sufficient to model practical antennas (similar to the ones commonly modeled by NEC), it's certainly the best solution.

If I have something bigger to test then I will provide additional information on that NUMA system. I would be curious to know how spreading across cores and cache coherent effects larger simulations.

Last month, I also built a Xeon E5 v4 test bench specifically for testing NUMA awareness.

An interesting feature of Xeon E5 v4 is that the CPU uses a "Cluster-on-Die" design. Each CPU is spitted into two parts, each part has its own Ring bus and a 2-channel memory controller, connected to another half via two bridges, so a 2-socket CPU system has 4 NUMA nodes. If the code is bad, you only get 1/4 of the memory bandwidth, so this would be a pretty good test for NUMA. If I can get openEMS to run fast on this server, it'll probably run great on almost everything else.

Another great feature of these CPUs is that they have a comprehensive set of low-level CPU performance counters for profiling - and more importantly, unlike other CPUs, these Broadwell-EP counters are well-documented, well-understood, and have previously used by many HPC researchers - they can find many problems that conventional profilers simply cannot see. I plan to share how to use them in a future post.

KJ7LNW Dec 14, 2023
Author

Last month, I also built a Xeon E5 v4 test bench specifically for testing NUMA awareness.

Nice!

If the code is bad, you only get 1/4 of the memory bandwidth, so this would be a pretty good test for NUMA. If I can get openEMS to run fast on this server, it'll probably run great on almost everything else.

Do think having multiple L3 caches will need to be an openEMS consideration in the future? There was a recent noteworthy performance boost in the Linux kernel by accommodating those caches. Or a shorter version: TL;DR

biergaizi Dec 15, 2023

Do think having multiple L3 caches will need to be an openEMS consideration in the future?

Yes, eventually this should be done. All of the newest workstation CPUs from Intel and AMD use multi-chip design, so it's going to be an important feature. An "LLC node" should be treated just like a NUMA node. I already noticed that openEMS's simulation performance is unusually slow in some simulations on my AMD Ryzen 3000 (Zen 2) machine, but my Zen 5000 (Zen 3) doubles the performance (large simulation with > 1 MCells with PML) - even though STREAM says the memory bandwidth of both is ~30 GB/s (DDR4 3200 x2). Zen 2 has 16 + 16 MiB LLC, but Zen 3 has 32 MiB LLC, so I suspect it's related to cache locality and CCX-crossing.

Linux kernel supports "NUMA emulation" for developer use. You can set a kernel argument and Linux will report and behave as if the machine is NUMA. I expect it will be very useful in some tests that we can do in the future.

biergaizi · 2023-12-22T12:51:50Z

biergaizi
Dec 22, 2023

Just did a quick test today on different NUMA settings on E5 v4 and identified two key performance killers for big simulations: NUMA and extensions. Note that these are big simulations, and their characteristics are different from small simulations tested by KJ7LNW.

Home Snoop + DIR + OSB, NUMA off (1 NUMA node)
850 MC/s at 23 threads, operator highly compressible, without extensions.
Home Snoop + DIR + OSB, NUMA on (2 NUMA nodes)
500 MC/s at 15 threads, operator highly compressible, without extensions.
Cluster on Die (4 NUMA nodes)
Doesn't scale beyond 200 MC/s or so, operator highly compressible, without extensions.

Thus, it's apparent that NUMA non-awareness is the most important scaling problem, and should be relatively easy to fix. I plan to include NUMA awareness in my own new engine, which is a completely rewritten version. With my own engine, I'm seeing performance around 950 MC/s on this machine, even in Cluster-on-Die mode. It's a test kernel that operates outside openEMS, so the number is not final and the real ones should be lower, but I have high hopes ;-) If it won't work as expected, I also have a backup plan of adding NUMA support in the existing engine, it would be much easier to develop than my engine rewrite project, and would be the most worthwhile improvement.

Extensions are big performance killers, more so than I initial expected. With PML enabled, the same simulation no longer scales beyond 200 MC/s. With PML and Conducting Sheet enabled, it doesn't even scale beyond 100 MC/s. This complete explains the absence of any scaling on big workstations. like Apple M1 Ultra. The culprits are probably poor data locality and too many locks. I already have some ideas in mind on how to fix them. Stay tuned.

For reference, here's an explanation of the different NUMA settings available on E5 v4 (Broadwell):

6 replies

biergaizi Dec 23, 2023

Would you (briefly) define "Extensions"? What locking is required?

I have already explained everything several months earlier, just dig deeper into the Pull Requests, Issues and older forum threads, and you'll find them all. You don't even need to dig that deep, the original openEMS research paper also contains sufficient information about "extensions", the citation item to that paper is already listed in the official documentation, in the Learn More section. I hate to explain it again again again and again... Perhaps it will be the final blow to force me completing my unfinished documentation on openEMS's internal architecture ;-)

But before that, let's try it again.

openEMS is first and foremost an academic research project, so the code focuses on modularity, not efficiency. As a result, the FDTD kernel only calculates the main FDTD update extensions. Only a simulation with only PEC and PMC doesn't use extensions, and basically contain the following logic:

for (int t = 0; t < timesteps; t++) {
    KernelUpdateVoltages();
    KernelUpdateCurrents();
}

Almost all features that do not belong to the main FDTD update loop are implemented as extensions. If you use Mur's ABC, it's coded as an extension. If you use Perfectly Matched Layer, it's an extension. If you use Conducting Sheet, it's an extension. If you use Lorentz material, it's an extension. If you use the Cylindrical coordinate system, it's an extension. Even initial condition (excitation signal injection via ports) is implemented as an extension.

Furthermore, because of the focus on modularity, the extensions operate at a high level of abstraction and are completely unaware of what the main kernel is doing, and vice versa. They're all hooks dynamically registered to the kernel. The logic then becomes:

for (int t = 0; t < timesteps; t++) {
    ExtensionPreUpdateVoltage();
    UpdateVoltages();
    ExtensionPostUpdateVoltage();
    ExtensionApplyVoltages();

    ExtensionPreUpdateCurrent();
    UpdateCurrents();
    ExtensionPostUpdateCurrent();
    ExtensionApplyCurrents();
}

In other words, before the main electric field kernel is executed. First, we run ExtensionPreUpdateVoltage() across the entire simulation box to update the internal state of each extensions. Then, the main FDTD kernel UpdateVoltages() is executed to update the electric field as usual. Then, we run ExtensionPostUpdateVoltage() across the entire simulation box to update the internal state of each extensions again. Finally, ExtensionApplyVoltages() applies the modified electric field of each extension to the main simulation kernel.

As you can see there are two problems. First, because each extension is independent of each other and of the main kernel, they need hold an internal copy of the electric field within themselves, the states are updated in a pre-update and again in a post-update step, they're not allowed to modify the main electric field until the last moment in ExtensionApplyVoltages() - as a result, in the worst-case, a 1-pass full memory scan is increased to a 4-pass full memory scan, so this can't be fast - fortunately not all extensions do actually use all the 4 hooks at the same time.

Another problem is that each extension apply their action to the entire simulation box, we have no idea about which region is actually modified. As a result, as the end of all of these hooks, a global synchronization is required in the Multithreaded engine, it currently uses Boost::barrier. Because there is one lock per extension per hook, a huge number of locking and unlocking is involved. If an all-PML box is simulated, 6 instances of the PML extension is created at the 6 surfaces of the simulation box. The PML registers 4 hooks, pre-current, pre-voltage, post-current, post- update. As a result, simulating a single step of an all-PML box already involves has 24 synchronization steps...

Maybe it would be possible to use RCU locking semantics to minimize TLB flushes and cache line bouncing.

Optimizing the lock itself would be the "wrong" approach in my opinion, it's better to eliminate them. My plan to attack this problem is via spatial tiling. Most extensions apply their actions locally, there's rarely any "spooky action from a distance". So we just need to way to tell an extension: only do an update from (i0, j0, k0) to (i1, j1, k1), don't touch anything else. This way, we can guarantee that calculating within one tile doesn't affect any data on another tile, and one global synchronization per extension hook can be simplified to one global synchronization per timestep, dramatically reducing the number of locking and unlocking operations. The tiling also ensures a degree of data locality to allow more cache hits, eliminating the redundant memory reads even though the multiple scans still exist.

Another potential problem and solution is the heavy use of C++ virtual functions. Because all extensions are registered dynamically, every interaction between the extension and the main engine is done via virtual functions. In fact, currently all extensions repeat their code 3 times for each engine type to force the compiler to inline these calls, making it bad for both readability and performance. Thorsten Liebig already said that for a future high-performance engine, it's acceptable to hardcode some extensions into the main kernel, and that would be one way to do it... But I still want to keep the code modular. An idea is to use compile-time polymorphism instead of run-time polymorphism to maintain modularity. Extensions would still be coded as hooks, just not dynamically inserted into the main loop.

thliebig Dec 23, 2023
Maintainer

Perhaps it will be the final blow to force me completing my unfinished documentation on openEMS's internal architecture ;-)

As I said, I highly encourage you to contribute to the documentation and thus you can answer questions like this with just a link to them...

If you use the Cylindrical coordinate system, it's an extension.

That is not correct, the cylindrical coordinate (CC) FDTD is fully equivalent to the "normal" FDTD, except how the coefficients are calculated. But the CC-FDTD has it's own very special boundary conditions (BC) which are implemented as extensions, but that is like all other BC too.

KJ7LNW Dec 26, 2023
Author

Optimizing the lock itself would be the "wrong" approach in my opinion, it's better to eliminate them.

Oh, got it! That is definitely the way to approach it.

biergaizi Dec 29, 2023

Initial result: Using my experimental (unpublished) rectangular tiling engine, I was able to double PML performance in at least one test. On the same dual Xeon E5 v4 system, the upstream code is as fast as my rectangular tiling engine at around 850 - 900 MC/s, but slows down dramatically to 200 MC/s if PML is enabled. Meanwhile, my engine reached 450 MC/s. Thus, the PML overhead has reduced from 4x to just 2x, and I suspect I can still extract a little more performance from it if I modify my tiling layout to be PML aware (now PML and the tiles are misaligned, so there's some memory access overhead).

However, in the same simulation, my already-published diamond tiling engine was able to reach 600 MC/s just on a dual-channel DDR4 Zen 3 desktop with PML, and reach 1600 MC/s on the same desktop without PML. This is due to the fact that rectangular tiling only calculates one timestep per tile, but diamond tiling calculates multiple timesteps per tile, allowing a desktop to beat a server. But my initial work on diamond tiling was abandoned in favor of rectangular tiling because the tile shape and memory address calculation can be extremely complicated, and since my eventual goal was GPU acceleration which requires highly regular code structure and this rectangular tiling engine was in fact a CPU-prototype of the eventual GPU code, so I didn't peruse diamond tiling further.

A new goal for 2024 is to see if I can combine best of both worlds to create a tiling layout which is relatively easy to access similar to rectangular tiling, but as efficient as diamond tiling. I just noticed that if the width of a diamond tile is an even number, it can be seen as two rectangular tiles as far as memory access is concerned. This can be an important insight to reduce the complexity of diamond tiling. I hope this technique can eventually be adapted in GPU code to reach a PEC simulation speed "over 9000" MC/s.

Furthermore, I realized what I called "diamond tiling" was in fact "trapezoid tiling", not true diamond tiling - the papers I referenced to used both terms interchangeably and I didn't realize there's a difference. True diamond tiling is a modification based on trapezoid tiling (which is why many papers used both terms interchangeably), and is even more efficient. It will also be a subject of my further investigation.

biergaizi May 13, 2024

A new goal for 2024 is to see if I can combine best of both worlds to create a tiling layout which is relatively easy to access similar to rectangular tiling, but as efficient as diamond tiling.

Unfortunately, this goal was unsuccessful. The overlap between tiles makes the problem tricky and difficult to manage.

However, I was able to rewrite my existing Tiling engine with cleaner code, rework the operator compression algorithm, and add basic first-touch policy for NUMA. The new FDTD engine prototype (not yet integrated to openEMS, and without PML or any other plugins) was able to achieve 1600 MC/s to 2728 MC/s using the same test on the same system, and now a single-socket run is as fast as the old dual-socket run. Another test on a Xeon Ice Lake bare-metal server was as fast as 8200 MC/s to over 9000 MC/s. I plan to integrate this version into openEMS soon.

openEMS prefers local CPU cache size over cache locality and memory bandwidth #163

Uh oh!

Uh oh!

Test 1, initial single-socket assumption

Test 2, multi-socket with single-socket memory

Test 3, multi-socket with interleaved memory

Test 4, multi-socket with socket-local memory

Test 5, limit execution to 2 sockets.

Test 6, limit execution to 3 sockets.

Analysis

openEMS NUMA Awareness

Related/future work

MC/s/core and Optimization

Replies: 3 comments · 17 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KJ7LNW Dec 14, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KJ7LNW Dec 14, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thliebig Dec 23, 2023 Maintainer

Uh oh!

KJ7LNW Dec 26, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Replies: 3 comments 17 replies

KJ7LNW Dec 14, 2023
Author

KJ7LNW Dec 14, 2023
Author

thliebig Dec 23, 2023
Maintainer

KJ7LNW Dec 26, 2023
Author