Replies: 3 comments 17 replies
-
|
Very interesting, that you! Could you share the exact model that you used here, so that we don't compare apples to oranges? In the yagi thread we had a few versions, and you applied your change now, so we better define the exact testcase here. |
Beta Was this translation helpful? Give feedback.
-
|
This is yet another flawed benchmark. You see this behavior only because the simulation domain is small. I've posted my comment to another thread, please be sure to check it. |
Beta Was this translation helpful? Give feedback.
-
|
Just did a quick test today on different NUMA settings on E5 v4 and identified two key performance killers for big simulations: NUMA and extensions. Note that these are big simulations, and their characteristics are different from small simulations tested by KJ7LNW.
Thus, it's apparent that NUMA non-awareness is the most important scaling problem, and should be relatively easy to fix. I plan to include NUMA awareness in my own new engine, which is a completely rewritten version. With my own engine, I'm seeing performance around 950 MC/s on this machine, even in Cluster-on-Die mode. It's a test kernel that operates outside openEMS, so the number is not final and the real ones should be lower, but I have high hopes ;-) If it won't work as expected, I also have a backup plan of adding NUMA support in the existing engine, it would be much easier to develop than my engine rewrite project, and would be the most worthwhile improvement. Extensions are big performance killers, more so than I initial expected. With PML enabled, the same simulation no longer scales beyond 200 MC/s. With PML and Conducting Sheet enabled, it doesn't even scale beyond 100 MC/s. This complete explains the absence of any scaling on big workstations. like Apple M1 Ultra. The culprits are probably poor data locality and too many locks. I already have some ideas in mind on how to fix them. Stay tuned. For reference, here's an explanation of the different NUMA settings available on E5 v4 (Broadwell): |
Beta Was this translation helpful? Give feedback.
![Different NUMA settings]](https://www.servethehome.com/wp-content/uploads/2016/03/Intel-Xeon-E5-2600-V4-Snoop-Mode.png)
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
When I first started running openEMS, I made the assumption that memory bandwidth was more important than CPU cache size. I also expected that inter-socket communication latency would slow down the computation. After running some tests on an old quad socket E7-4890 v2 server, it turns out that I was completely wrong and much better performance is available across sockets to utilize as much cache as possible.
Test 1, initial single-socket assumption
With the assumption above, I bound an openEMS job to a single 15 core socket (socket 3) using
tasksetandnumactl:Test 2, multi-socket with single-socket memory
On a whim during the simulation above, I used
tasksetto spread the cores across multiple sockets while it was running. The memory was still bound the single socket-3, so I was surprised to see such a big difference. Here is an almost 50% increase in performance with that equivalent test:Test 3, multi-socket with interleaved memory
I expected that since memory was bound to a single socket in test-2, the performance of memory interleaving would suffer. I set up NUMA interleaving to spread memory across the sockets. This was better than test 1, but it was worse the test-2. Thus, CPU cache is more important than memory bandwidth:
Test 4, multi-socket with socket-local memory
The fourth test spread computation across multiple sockets, but tells us that the process prefers memory allocated from the local node. Since interleaving turn out to be worse than being memory bound to a single node, I was surprised to find out that this worked only about is well as when memory was bound to a single socket:
Test 5, limit execution to 2 sockets.
This is the fastest test, but read the analysis section below.
Test 6, limit execution to 3 sockets.
Surprisingly slow:
Analysis
Since binding to a single node (
--membind=3) actually worked about the same as memory allocation preferring a local node (--localalloc), this tells us that the computation is operating almost completely from CPU cache. The cache on these sockets is enormous (for 2014) as you can see here:By spreading the cache across all 4 sockets, 150 megabytes of cache were available. When I ran
topduring the calculation it appears to only use to ~250 megabytes of RAM in resident memory. Presumably the working set for the problem is less than that, so it is possible that the whole thing fits in cache.There is one outlier: test 5 when limited to only two sockets performed the best, however it was only better by ~3% which could be within them margin of error. This indicates that there is more going on then just cache benefits. However, the system is not completely idle. There are other services running that I tried to work around which may introduce jitter. If I get the opportunity to run these tests on a completely idle systems and I will. I believe that these tests were performed on cores that were unallocated to any other process.
These tests for perform using @LubomirJagos's optimized Yagi that he cleaned up in #158. The only differences that I reduced the wire size from 5mm to 2mm for this simulation.
openEMS NUMA Awareness
While more investigation maybe necessary to get a complete picture of this story, I think that the core scheduling algorithm for NUMA awareness in openEMS is pretty straight forward: allocate threads in cache-priority. This is particularly important since, at least for small simulations like I have been doing, the benefit of an additional compute thread reaches a computational limit, and it does so before exhausting all available CPUs. This means that higher clock speeds with fewer cores and bigger caches should provide the best performance.
Here is the suggested core allocation order to use as much cache as possible:
Related/future work
MC/s/core and Optimization
You may have noticed a measurement I added to the output above which is
MC/s/core. The purposes of this to visualize the added benefit of going to an additional compute thread. Except that there is jitter in the measurements, the general trend is that each additional thread reduces the calculations per second per core. For example, from test-5:Optimization algorithms like Particle Swarm (in Python or in Perl) can benefit by calculating multiple particles simultaneously. Thus, an optimal solution may be found faster by running multiple simulations at the highest MC/s/core rate across multiple cores, even if that would be suboptimal for a single run.
For an optimization algorithm it is more important to calculate many sample points that are provided by the optimization algorithm, than to complete a single simulation quickly. Of course for a single simulation you just want to run as fast as you can, but for an optimization you want to run as many simulation points as possible so that it will saturate all the processor cores that you have available.
While I have not looked into this yet, I plan to implement parallel-Particle-Swarm as multiple
fork()ed openEMS simulation instances.Beta Was this translation helpful? Give feedback.
All reactions