Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bonus Task: Optimization of Nogtom Module
Changes
Brief Expression For Loops
For nested loops, we use a more simple format. For example:
We use do concurrent to modify the aforementioned loop block into a simple and standardized expression.
For elementwise operations with neat subscripts, such as assignment or calculation at the same position in right and left array, we can have a more concise method. For instance:
We use the modern Fortran method of array slicing assignment to modify the aforementioned loop block into a concise and standardized expression, as follows:
Similarly, for more complex cases of array slicing assignment, such as
We modify it to the following code:
We have verified aforementioned usage of array slicing assignment in
Test/Evaluation2, confirming the functional correctness of this method.Remove Argsort
From line 1727 to 1790 in
mod_micro_nogtom_old.F90which is the target blocks, we find out that theargsortfunction is actually redundant by analysing the output ofTest/Evaluation1, which is an experiment on verifing the correctness withoutargsort.ddBy analysing the output of
Test/Evaluation1, we can also prove that the output ofargsort, i.e.iorderis useless among the target code blocks. We can also logically remove it step by step. For example:For the aforementioned code, the operation on
sinksumrecalculatessinksumin the order ofiorder. Notice that the assignment of the loop:where
lind2(jo,jn)andsinksum(jo)are independent from the order in whichjoiterates. So we can directly removejoand modify the block into following code:Since the method of calculating
sinksumhas no change on its value, the calculation turns out to be redundant. The same applies toratio. Therefore, the target blocks can be simplified tolind2=qsexp<d_zero, and the calculations forsinksumandratioare removed.For the loop:
It can also be proven that this symmetric position assignment method of
qsexp(jo,jn)andqsexp(jn,jo)is also independent from the order of row traversaljo = iorder(n). Therefore,jocan be directly simplified ton, and the code is modified toMultiplication Optimization
As for following block:
For the aforementioned code, we note that in the accumulation processes of
rainh, there is a expressiondtgdp * pfplsx(n,j,i,k+1) * dp. Additionally, we find thatdtgdpequalsdt * egrav / dp. Therefore, by the commutative property of multiplication,dtgdp * dpequalsdt * egrav / dp * dp, which simplifies todt * egrav.Thus, the calculation of
rainhcan be simplified torainh = rainh + wlhvocp * dt * egrav * pfplsx(n,j,i,k+1)andrainh = rainh + wlhsocp * dt * egrav * pfplsx(n,j,i,k+1).Furthermore, we can remove the temporary variables
rainandrainhby directly using the finally assigned variablessumq1andsumh1. By applying thedo concurrentmethod, we optimized it into final version.In
Test/Evaluation3, we measured the performance, and this version is 60% faster than the original.Results
For given input file ISC24.in:
Time spent on mod_micro_nogtom.F90 speeds up from 384s to 339s with about12% performance improvement.
And percent of time spent on mod_micro_nogtom.F90 decreases from 30.7% to 28.1%, profiled by Vtune.
Evaluation and Experiment
We do a lot to ensure we get correct answer and better performance.
Evaluation1: In this optimization, we remove function
argsortand code blocks using the result ofargsortiorder. We show how it works in mod_micro_nogtom.F90 and for random inputs there exists the same situation which parently shows that functionargsortcan be deleted with no impact on correctness but save 10% time to run this file.Evaluation2: In this optimization, we verify the correctness like
aamax = maxval(abs(qlhs(:,n)))andThrough the output, we show that the results are correct.
Evaluation3: In this experiment we speed up the block by logical analysis. We will show how it works in mod_micro_nogtom.F90 and test the speedup times. It turns out to be about 60% speedup along with logical optimization and removing local vars on the example blocks. We tried many ways to optimize including divide the independent outcomes to run seperately. Finally we apply a multiplication optimization and do concurren to speed up.
Experiment1: In this experiment we test on replacing function
argsortwith Bubble Sort algorithm to Quick Sort algorithm. We show how it works better in mod_micro_nogtom.F90 only with nqx > 100. However, for current inputs with nqx = 5,7 or 10, there exists the same situation whereargsortbecome much slower in Quick Sort for more complex computation. So functionargsortcan't be replaced by QuickSort. Finally, we still show this test and hope that QuickSort can be a choice for future version whenargsortis reused and nqx expanded to larger scale.Experiment2: In this experiment we change the sequence of loop from k, i, j, n to n, k, i, j and use do concurrent properties to speed up. We switch n to the outside thus we decrease the time spent on if else statement. However it has no profit for the block along with do concurrent. We only replace it with do concurrent for better readability.
Experiment3: In this experiment, we use loop unrolling on the mysolve subroutine to speed up. We try to replace the nqx with its actual value, with input file ISC24.in nqx=5, this will help compiler to speed up. But it doesn't work well for small nqx scale.
To better contrast what we have done on the target file, you can choose mod_micro_nogtom.F90 and mod_micro_nogtom_old.F90 to scan for the same time