Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Detailed Changes in the Pull Request
This pull request includes the following major changes to the
mod_micro_nogtommodule:1. Added Compiler Directives
The code optimization involved the incorporation of OpenMP directives to leverage
SIMDinstructions, which significantly improved its performance. OpenMP directives were strategically placed to enable vectorization, specifically using the!$omp simddirectives. This allowed the compiler to efficiently process multiple data elements in parallel, resulting in a performance boost. The compiler vectorization report was a valuable resource during this process, providing insights into potential areas for optimization and guiding the placement of OpenMP directives.The
!dir$ ivdepdirective was added to inform the compiler that there are no dependencies in vectorizing the instructions. This directive ensures that the compiler generates code that can be executed without any conflicts or dependencies between the instructions.The
!dir$ vectoralways directive was added above the initialization of matrices likesumh1(:,:,:) = d_zeroto ensure that the compiler always vectorizes them.The directive
!dir$ novectorwas added above loops that iterated from1tonqxto instruct the compiler not to vectorize those loops. The decision to add this directive was based on the observation that nqx was relatively small (found to be5), which meant that vectorizing these loops may incur a significant overhead that could potentially decrease performance.We had also added
!$omp parallel dodirectives to check if threading could bring any performance improvements, but eventually, it turned out that the overheads of threading outstanded the performance improvement. We did not remove these compiler directives, but we run the application after exportingOMP_NUM_THREADS = 1, which also makes these directives redundant.2. Performed Scalar Expansion
Scalar expansion has been performed on several arrays to allow for better vectorization of the loops. The following arrays have been expanded:
tnew_expandeddp_expandedqe_expandedtmpl_expandedtmpi_expandedzdelta_expandedphases_expandedThis optimization technique helped vectorize some loops, which could otherwise hae not been vectorized, due to reasons of overwriting the scalar variable.
Consider the following loop in the original code
All the scalars that were being assigned to, i.e.,
tnew,dp,qe,tmplandtmpi, were replaced with their vector versions.Similar changes have been performed for the variables
zdeltaandphases.3. Restructured Loops for Efficiency
The structure of some loops has been modified to make the code more efficient. Consider the foolowing loop in the original code
which was restrutured in the following manner to avoid the extra computation taking place
kztimes for each combination of(i, j). The modified codeThe modified code stores the sum values in a temporary array
cloud_sum_calcfirst, which is then used to modify thecldtopdistarray.Correctness Validation
The team has ensured the correctness of the changes by comparing the output file generated by the modified implementation with the output file generated by the original implementation. The experiments were conducted on the PARAMSANGANAK supercomputer at IIT Kanpur.
lrcemip_perturbwas set tofalseto to disable any randomization, to check the validity of our output.Build Script
Run Script
Performance Improvements
We checked the performance of the application, specifically the
nogtommodule, by profiling it usingVTuneonPARAMSANGANK. Since the code in the module was a serial one, to check performance, we used 48 processes, all on one node, and checked the total compute time of thenogtomsubroutine. The input files were altered to run for1day instead of10days in the original input file.For the smaller input file
isc24_small.in, we observed a performance improvement, speedup of about112.3%from about300seconds to267seconds. The time data is the overall compute time of the nogtom subroutine for all the48processes.As we had expected from vectorization of intructions, we got much more performance improvement, speedup of about
123.1%on the larger input file,isc24.in, from6331seconds to5143seconds.Submission for the Bonus Task
This pull request is the submission for the bonus task of RegCM in the Student Cluster Competition (SCC) at ISC'24 from
Team ExaDecimals, IIT Kanpur.The changes described above aim to improve the performance and efficiency of the
mod_micro_nogtommodule, while maintaining the correctness of the implementation. The team has put significant effort into optimizing the code and is confident that these changes will contribute to the overall performance of the RegCM model.