@@ -93,6 +93,47 @@ it is also recommended to wrap any ``BL_PROFILE_TINY_FLUSH();`` calls in
9393informative ``amrex::Print() `` lines to ensure accurate identification of each
9494set of timers.
9595
96+ Hot Spots and Load Balance
97+ ~~~~~~~~~~~~~~~~~~~~~~~~~~
98+
99+ The output of TinyProfiler can help us to identify hot spots. For example,
100+ the following output shows the top three hot spots of a linear solver test
101+ running on 4 MPI processes.
102+
103+ .. highlight :: console
104+
105+ ::
106+
107+ --------------------------------------------------------------------------------------------
108+ Name NCalls Excl. Min Excl. Avg Excl. Max Max %
109+ --------------------------------------------------------------------------------------------
110+ MLPoisson::Fsmooth() 560 0.4775 0.4793 0.4815 34.97%
111+ MLPoisson::Fapply() 114 0.1103 0.113 0.1167 8.48%
112+ FabArray::Xpay() 109 0.1 0.1013 0.1038 7.54%
113+
114+ In this test, there are 16 boxes evenly distributed among 4 MPI processes. The
115+ output above shows that the load is perfectly balanced. However, if the load
116+ is not balanced, the results can be very different and sometimes
117+ misleading. For example, if we put 2, 2, 6 and 6 boxes on processes 0, 1, 2
118+ and 3, respectively, the top three hot spots now include two MPI
119+ communication functions, ``FillBoundary `` and ``ParallelCopy ``.
120+
121+ .. highlight :: console
122+
123+ ::
124+
125+ --------------------------------------------------------------------------------------------
126+ Name NCalls Excl. Min Excl. Avg Excl. Max Max %
127+ --------------------------------------------------------------------------------------------
128+ FillBoundary_finish() 607 0.01568 0.3367 0.6574 41.97%
129+ MLPoisson::Fsmooth() 560 0.2133 0.4047 0.5973 38.13%
130+ FabArray::ParallelCopy_finish() 231 0.002977 0.09748 0.1895 12.10%
131+
132+ The reason that the MPI communication appears slow is that the lightly
133+ loaded processes have to wait for messages sent by the heavily loaded
134+ processes. See also :ref: `sec:profopts ` for a diagnostic option that may
135+ provide more insight on the load imbalance.
136+
96137.. _sec:full:profiling :
97138
98139Full Profiling
0 commit comments