- 
                Notifications
    
You must be signed in to change notification settings  - Fork 15.1k
 
[flang-rt] Optimise ShallowCopy and use it in CopyInAssign #140569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Using Descriptor.Element<>() when iterating through a rank-1 array is currently inefficient, because the generic implementation suitable for arrays of any rank makes the compiler unable to perform optimisations that would make the rank-1 case considerably faster. This is currently done inside ShallowCopy, as well as inside Assign where the implementation of elemental copies is equivalent to ShallowCopyDiscontiguousToDiscontiguous. To address that, add a DescriptorIterator abstraction specialised both for the optimised rank-1 case as well as for the generic case, and use that throughout ShallowCopy to iterate over the arrays. Furthermore, depending on the pointer type passed to memcpy, the optimiser can remove the memcpy calls from ShallowCopy altogether which can result in substantial performance improvements on its own. Check the element size throughout ShallowCopy and use the pointer type that matches it where applicable to make these optimisations possible. Finally, replace the implementation of elemental copies inside Assign to make use of the ShallowCopy* family of functions whenever possible. For the thornado-mini application, this reduces the runtime by 27.7%. Signed-off-by: Kajetan Puchalski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this! The results are promising!
Please read my comments inlined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates. Is the speedup the same after the most recent changes?
Some of these changes look heavily dependent upon particular compiler optimizations. Which compiler are you using to build flang-rt? Is there a similar speedup for another compiler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update!
Please add unittests in flang-rt/unittests/Runtime.
| 
          
 ✅ With the latest revision this PR passed the C/C++ code formatter.  | 
    
bce4c63    to
    e3991f8      
    Compare
  
    
          
 It is, no substantial changes compared to the original. 
 I normally build with clang + lld, but I double checked with gcc + ld and performance looks the same. It looks like both compilers can handle it.  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you for the updates!
Please give others some time to take another look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the patch. There are just a few minor comments.
Co-authored-by: Yusuke MINATO <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| 
           LLVM Buildbot has detected a new failure on builder  Full details are available at: https://lab.llvm.org/buildbot/#/builders/205/builds/10090 Here is the relevant piece of the build log for the reference | 
    
| 
           LLVM Buildbot has detected a new failure on builder  Full details are available at: https://lab.llvm.org/buildbot/#/builders/204/builds/10112 Here is the relevant piece of the build log for the reference | 
    
| 
           LLVM Buildbot has detected a new failure on builder  Full details are available at: https://lab.llvm.org/buildbot/#/builders/203/builds/11299 Here is the relevant piece of the build log for the reference | 
    
| 
           kNoAsyncId changed to kNoAsyncObject 2-3 days ago hence the failures, I already merged a fix.  | 
    
| 
           LLVM Buildbot has detected a new failure on builder  Full details are available at: https://lab.llvm.org/buildbot/#/builders/140/builds/23618 Here is the relevant piece of the build log for the reference | 
    
| 
           LLVM Buildbot has detected a new failure on builder  Full details are available at: https://lab.llvm.org/buildbot/#/builders/153/builds/32558 Here is the relevant piece of the build log for the reference | 
    
| 
           LLVM Buildbot has detected a new failure on builder  Full details are available at: https://lab.llvm.org/buildbot/#/builders/157/builds/28751 Here is the relevant piece of the build log for the reference | 
    
| 
           LLVM Buildbot has detected a new failure on builder  Full details are available at: https://lab.llvm.org/buildbot/#/builders/80/builds/13606 Here is the relevant piece of the build log for the reference | 
    
          
 Always make sure that the pre commit ci pass before merging.  | 
    
| 
           LLVM Buildbot has detected a new failure on builder  Full details are available at: https://lab.llvm.org/buildbot/#/builders/53/builds/16175 Here is the relevant piece of the build log for the reference | 
    
          
 It did, the issue was that I did not freshly rebase the PR before merging and the variable changed after the last rebase.  | 
    
| 
           LLVM Buildbot has detected a new failure on builder  Full details are available at: https://lab.llvm.org/buildbot/#/builders/130/builds/13307 Here is the relevant piece of the build log for the reference | 
    
| 
           LLVM Buildbot has detected a new failure on builder  Full details are available at: https://lab.llvm.org/buildbot/#/builders/172/builds/12806 Here is the relevant piece of the build log for the reference | 
    
| 
           LLVM Buildbot has detected a new failure on builder  Full details are available at: https://lab.llvm.org/buildbot/#/builders/29/builds/14106 Here is the relevant piece of the build log for the reference | 
    
| 
           LLVM Buildbot has detected a new failure on builder  Full details are available at: https://lab.llvm.org/buildbot/#/builders/17/builds/8251 Here is the relevant piece of the build log for the reference | 
    
| 
           LLVM Buildbot has detected a new failure on builder  Full details are available at: https://lab.llvm.org/buildbot/#/builders/50/builds/13760 Here is the relevant piece of the build log for the reference | 
    
| 
           LLVM Buildbot has detected a new failure on builder  Full details are available at: https://lab.llvm.org/buildbot/#/builders/20/builds/11478 Here is the relevant piece of the build log for the reference | 
    
| 
           LLVM Buildbot has detected a new failure on builder  Full details are available at: https://lab.llvm.org/buildbot/#/builders/201/builds/4736 Here is the relevant piece of the build log for the reference | 
    
| 
           This change seems to have broken building unittests for me: This is Gentoo Linux amd64. flang-rt is configured as:  | 
    
          
 I can't seem to reproduce it, including on Gentoo amd64. Do you maybe have more details on how it breaks?  | 
    
| 
           Not really — I've just tried building the latest versions (as of 8d374f1). The most recent snapshot (  | 
    
| 
           Ah right I see, it happens for me only on a gcc standalone flang-rt build. Running check-flang-rt within a full llvm build works fine regardless of the compiler. Standalone clang works fine too. That's pretty odd, I'll try to figure out where the problem is.  | 
    
| 
           The problem is that clang is more forgiving in that it will generate the default template instantiation even if you don't explicitly ask for it, whereas gcc does not and it then errors out when you try to use it. This fixes it for me:  | 
    
Using Descriptor.Element<>() when iterating through a rank-1 array is currently inefficient, because the generic implementation suitable for arrays of any rank makes the compiler unable to perform optimisations that would make the rank-1 case considerably faster.
This is currently done inside ShallowCopy, as well as by CopyInAssign, where the implementation of elemental copies (inside Assign) is equivalent to ShallowCopyDiscontiguousToDiscontiguous.
To address that, add a DescriptorIterator abstraction specialised for arrays of various ranks, and use that throughout ShallowCopy to iterate over the arrays.
Furthermore, depending on the pointer type passed to memcpy, the optimiser can remove the memcpy calls from ShallowCopy altogether which can result in substantial performance improvements on its own. Specialise ShallowCopy for various element pointer types to make these optimisations possible.
Finally, replace the call to Assign inside CopyInAssign with a call to newly optimised ShallowCopy.
For the thornado-mini application, this reduces the runtime by 27.7%.