-
Notifications
You must be signed in to change notification settings - Fork 184
Avoid spread intrinsic #1281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid spread intrinsic #1281
Conversation
I know this is the annoying part, but GFN-FF would likely benefit the most from improved parallelization (i.e. in #1240). Likely much more than just a few ms here and there |
It is not a few ms for here and there for GCC, unfortunately. |
8c6dfe0 to
c64a94f
Compare
|
For GFN-FF and |
Signed-off-by: Igor S. Gerasimov <[email protected]>
Signed-off-by: Igor S. Gerasimov <[email protected]>
Signed-off-by: Igor S. Gerasimov <[email protected]>
Signed-off-by: Igor S. Gerasimov <[email protected]>
Signed-off-by: Igor S. Gerasimov <[email protected]>
Signed-off-by: Igor S. Gerasimov <[email protected]>
Signed-off-by: Igor S. Gerasimov <[email protected]>
Signed-off-by: Igor S. Gerasimov <[email protected]>
Signed-off-by: Igor S. Gerasimov <[email protected]>
Signed-off-by: Igor S. Gerasimov <[email protected]>
Signed-off-by: Igor S. Gerasimov <[email protected]>
Signed-off-by: Igor S. Gerasimov <[email protected]>
Signed-off-by: Igor S. Gerasimov <[email protected]>
Signed-off-by: Igor S. Gerasimov <[email protected]>
Signed-off-by: Igor S. Gerasimov <[email protected]>
Signed-off-by: Igor S. Gerasimov <[email protected]>
|
@thfroitzheim, ping :-) |
thfroitzheim
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
With gfortran, spread intrinsic is not optimized (more precisely this is libgfortran call) and therefore it significantly affects on performance, especially for 3-body terms.
This patch speeds up gradient code in 1.5-2x times with gfortran as well other parts which I did not measure. By some reason, it also affects ifx but it gives only 10% speed up for
deriv_atm_triplesubroutine.I have used unrolled cycles since they provide an extra optimizations for ifx: time was improved from 680 to 625 ms for my input. Usual cycles does not give this improvement: see the assembler difference here: https://godbolt.org/z/zs6e7vzT4. The first source code is the original, the second is presented in this PR and the third one uses cycles.
I did not touched initializations/io parts where spread are used.