Maximizing number of CPUs #569
-
Hi WindNinja team, I have a question on how to maximize the number of CPUs when running WindNinja. Our current setup uses downloaded (local) HRRR files and loops over the data. We have set the Here the current config file
How could this be scaled to use more CPUs? Any recommendations on your end are appreciated. Thank you, |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 4 replies
-
Not sure how this would change on a cluster, you may have to do additional configuration on your end, as part of the WindNinja installation process. But on a desktop/laptop, you just set num_threads to the number of CPUs you want to run, up to the maximum on your machine. The mass only solver uses OpenMP, the momentum solver through OpenFOAM uses OpenMPI. But I've noticed that on my smaller machines, 2 cores and 4 logical processors, both the mass and momentum solver let me run num_threads up to 4, but with my larger machine with 64 cores split into two CPU units, the mass solver will still let me run with num_threads up to 64, the momentum solver will only let me run with num_threads up to 32. This could have more to do with how a domain is decomposed/split into sections for the momentum solver than with the available number of threads though. I need to double check this information with someone that is out till tomorrow, but I believe that on our current hpc cluster, using slurm with singularity containers created from WindNinja's docker file, we see something similar to what I just described above. I believe that when giving out the resources to the singularity containers and slurm processes, we had to specify twice as many cores as the num_threads set in the configuration files for our momentum solver runs. I'm not sure for the mass only solver runs, I suspect that they didn't have this constraint though. As far as how effective/efficient the calculations get with increasing number of CPUs, I'm not sure. It did seem to me like the mass solver starts to slow down more than the momentum solver once you start adding even huger numbers of elements to calculate over while using the same number of CPUs. But I can't remember what I did for those tests, I think that I was using 24 threads with somewhere between like 1 million and 5 million cells (might have actually been 5 to 10 million cells, I'm pretty sure that I haven't gone too far past 10 million cells though), and the mass only solver was still pretty fast compared to the momentum solver. Let's see, doing some quick calculations. 1038x1346 cells at 100m doesn't make sense, unless you just mean in a single vertical layer? So is this a domain size of 103.8 km x 134.6 km at 100 m resolution? Assuming WindNinja makes ~10 vertical layers of that single vertical layer (momentum solver), then this would be 1038134620 cells, so 1,397,148 * 10 cells ~13.97148 million cells. Though, trying to calculate the exact number of cells always gets quirky because WindNinja's momentum solver usually does two rounds of refinement splitting up the bottom-most layers of cells 4 ways, could easily be up to ~20 million cells then. I believe the mass only solver always does a minimum of 20 vertical layers with no near ground cell splitting? I need to double check these numbers, but if I'm understanding the case you are trying to run correctly, it does seem to be quite on the beefy side, lots of elements to calculate over. Best thing would be to try out the case, maybe with much lower resolution first, increasing in resolution till it starts to slow down or even break, see if you can even run the desired simulations with the resources that you have available and what they will take. The solutions will be less accurate at the lower resolution, but you should at least be able to test whether your cases are going to run into problems at the higher resolutions and if there are specific slowdowns for your case this way. I hope I understood what you were asking well enough and answered at least some of your questions. I'm still a bit confused by what you meant by "Would we start the days we iterate over in parallel?". |
Beta Was this translation helpful? Give feedback.
-
Thank you for the information and insights. To add some more clarification to my information. The domain size I mentioned was taken from the DEM as a GeoTiff. So we are indeed talking about a 103.8 km x 134.6 km domain at 100m pixel size. Reading your response, I also just realized that I forgot to mention that I tried increasing the I am trying to understand where this upper limit is coming from and if it is a WindNinja configuration I missed or an installation/setup issue. Going back to the "days in parallel" idea is based on the fact that we iterate over each day as a single call to WindNinja. So our steps are to download the HRRR data for 24 hours, run WindNinja (with the above configuration), and repeat. Hence we could parallelize each day in a separate SLURM job. |
Beta Was this translation helpful? Give feedback.
-
I'm still looking through the code, to see if there is any kind of potential limiter. So far I'm not seeing anything other than, would there be smarter ways to divvy up the calculations between threads, I don't think that we want to mess with that unless we have to, I would assume that it is probably pretty optimized. I've dealt with CUDA programming, but OpenMP is still pretty new to me, so it might take me a bit to study up and get more particular on how the calculations are being done. Those who would know more without studying the code are out of town for about two more weeks as well. How are you measuring the number of CPUs/threads that are actually being used? It is possible that by checking status using just unix commands like top, a set of processes wrap up as another set is beginning, in such a way as to make it seem like fewer CPUs/threads are being used. Are you seeing some kind of peaking in the amount of runtime of the simulations? If you are doing more than one run at a time, you COULD experiment with running separate WindNinja runs with separate .cfg files in parallel, with a given num_threads to try to improve the speed of a single given run. Though it sounds like that might be what you are already attempting to do, just num_threads is seemingly tapping out. It could be that you are hitting some kind of compute limit on your system though, and might need to change some of the configuration of how you are trying to run stuff if you still want it to run faster. Part of what I'm hoping to do to understand the OpenMP code, is to put in some print num_threads used by OpenMP during a given step, but for now, I saw that if you do "WindNinja --version", WindNinja prints the maximum number of threads it thinks are available on the system/instance. Might be good to double check that the number printed for your cases isn't just 24 (though I would be surprised if WindNinja didn't throw an error if specifying num_threads greater than the number of threads that are available on a given system). |
Beta Was this translation helpful? Give feedback.
-
My information is indeed coming from top. There I see a max CPU percentage of 2400 when WindNinja is running and when you expand the CPU information, half of them idle around 0. Current total run time for one day is around 4 minutes. Here the requested output from
It does seem to recognize the number of CPUs * threads. For a little more background - we use the WindNinja outputs to force a spatially distributed snow model, which uses the information to calculate turbulent fluxes. We want to scale this model and run over larger domains (i.e. Upper Colorado River) and the wind simulations are one aspect where we try to find ways to better scale our approach. Mainly started this discussion to explore options in the near-term future. |
Beta Was this translation helpful? Give feedback.
-
OpenFOAM (which is what we use for the mass and momentum solver) does not use hyperthreading. So the max |
Beta Was this translation helpful? Give feedback.
OpenFOAM (which is what we use for the mass and momentum solver) does not use hyperthreading. So the max
num_threads
you will be able to use with the mass and momentum solver is the number of physical cores. Hope that helps.