-
Notifications
You must be signed in to change notification settings - Fork 23
Optimizing History with the PFIO server
Applications using the MAPL_Cap such as the GEOSgcm, GEOSctm, GEOSldas, GCHP, etc... use the MAPL_History for diagnostic output. All output is routed through the PFIO server. Extra resources beyond the needs of the model in an MPI sense can be allotted for the PFIO Server to exploit asynchronous output and decrease your applications wall time. By default the user does not have to do this and History will work just fine if you don't, although maybe not optimally. This page will explain how and when to configure the PFIO server to run on separate resources. While we can give general recommendations for how to configure the PFIO server we cannot emphasize enough that you should run your application multiple times and tune it to find the optimal configuration for you use case as what may be appropriate in one use case is not in another.
As was stated in the introduction, the MAPL_History component is always writing the output via the PFIO server. However, in the default case (i.e. the user does nothing other than start the application on the number of MPI tasks needed for the model) the PFIO server is running on the same MPI resources as the application. Each time the run method of History is executed it will not return until all the files that were to be written in that step have completed. All the data aggregation and writing is done on the same MPI tasks as the rest of the application so the it cannot proceed until all output for that step is completed. There is no asynchronicity or overlap between compute and write in this case.
At low resolutions of your model run or cases with little history output this is sufficient. For example if you are running the GESOgcm.x at c24/c48/c90 for development purposes with a modest History output on 2 or 3 nodes, there's no sense in dedicating any extra resources for the PFIO server. As a concrete example, by default the GEOSgcm uses an NX = 4 and NY = 24 at c24 and c48, so you launch your application with 96 MPI tasks.
For exploiting asynchronous output when using History we recommend using the multigroup server option for the PFIO server. With PFIO server, the model (or application) does not write the data to the disk directly. Instead, it forwards the data to the PFIO server. Without waiting for the writing, the model continues after forwarding the data. For the best performance, users should try different configurations of PFIO for a specific run. In general, there is a "reasonable" estimated configuration for users to start with. If you run a model requiring NUM_MODEL_PES of cores, each node has NUM_CORES_PER_NODE, the total number of history collections is NUM_HIST_COLLECTION, so
MODEL_NODE = (NUM_MODEL_PES/NUM_CORES_PER_NODE),
O_NODES =(NUM_HIST_COLLECTION+ 0.1*NUM_MODEL_PES) / NUM_CORES_PER_NODE
NPES_BACKEND = NUM_HIST_COLLECTION/O_NODES
TOTAL_PES = (MODEL_NODE + O_NODES)*NUM_CORES_PER_NODE
All above number should round up to an integer.
The run command line would look like
mirin -np TOTAL_PES ./GEOSgcm.x --npes_model NUM_MODEL_PES --nodes_output_server O_NODES --oserver_type multigroup --npes_backend_pernode NPES_BACKEND