|
| 1 | +================================ |
| 2 | +ClangIR Code Duplication Roadmap |
| 3 | +================================ |
| 4 | + |
| 5 | +.. contents:: |
| 6 | + :local: |
| 7 | + |
| 8 | +Introduction |
| 9 | +============ |
| 10 | + |
| 11 | +This document describes the general approach to code duplication in the ClangIR |
| 12 | +code generation implementation. It acknowledges specific problems with the |
| 13 | +current implementation, discusses strategies for mitigating the risk inherent in |
| 14 | +the current approach, and describes a general long-term plan for addressing the |
| 15 | +issue. |
| 16 | + |
| 17 | +Background |
| 18 | +========== |
| 19 | + |
| 20 | +The ClangIR code generation is very closely modeled after Clang's LLVM IR code |
| 21 | +generation, and we intend for the CIR produced to eventually be semantically |
| 22 | +equivalent to the LLVM IR produced when not going through ClangIR. However, we |
| 23 | +acknowledge that as the ClangIR implementation is under development, there will |
| 24 | +be differences in semantics, both because we have not yet implemented all |
| 25 | +features of the classic codegen and because the CIR dialect is still evolving |
| 26 | +and does not yet have a way to represent all of the necessary semantics. |
| 27 | + |
| 28 | +We have chosen to model the ClangIR code generation directly after the classic |
| 29 | +codegen, to the point of following identical code structure, using similar names |
| 30 | +and often duplicating the logic because this seemed to be the most certain path |
| 31 | +to producing equivalent results. Having such nearly identical code allows for |
| 32 | +direct comparison between the CIR codegen and the LLVM IR codegen to find what |
| 33 | +is missing or incorrect in the CIR implementation. |
| 34 | + |
| 35 | +However, we recognize that this is not a sustainable permanent solution. As |
| 36 | +bugs are fixed and new features are added to the classic codegen, the process of |
| 37 | +keeping the analogous CIR code up to date will be a purely manual process. |
| 38 | + |
| 39 | +Long term, we need a more sustainable approach. |
| 40 | + |
| 41 | +Current Strategy |
| 42 | +================ |
| 43 | + |
| 44 | +Practical considerations require that we make steady progress towards a working |
| 45 | +implementation of ClangIR. This necessity is directly opposed to the goal of |
| 46 | +minimizing code duplication. |
| 47 | + |
| 48 | +For this reason, we have decided to accept a large amount of code duplication |
| 49 | +in the short term, even with the explicit understanding that this is producing |
| 50 | +a significant amount of technical debt as the project progresses. |
| 51 | + |
| 52 | +As the CIR implementation is developed, we often note small pieces of code that |
| 53 | +could be shared with the classic codegen if they were moved to a different part |
| 54 | +of the source, such as a shared utility class in some directory available to |
| 55 | +both codegen implementations or by moving the function into a related AST class. |
| 56 | +It is left to the discretion of the developer and reviewers to decide whether |
| 57 | +such refactoring should be done during the CIR development, or if it is |
| 58 | +sufficient to leave a comment in the code indicating this as an opportunity for |
| 59 | +future improvement. Because much of the current code is likely to change when |
| 60 | +the long term code sharing strategy is complete, we will lean towards only |
| 61 | +implementing refactorings that make sense independent of the code sharing |
| 62 | +problem. |
| 63 | + |
| 64 | +We have discussed various ways that major classes such as CGCXXABI/CIRGenCXXABI |
| 65 | +could be refactored to allow parts of there implementation to be shared today |
| 66 | +through inheritence and templated base classes. However, this may prove to be |
| 67 | +wasted effort when the permanent solution is developed, so we have decided that |
| 68 | +it is better to accept significant amounts of code duplication now, and defer |
| 69 | +this type of refactoring until it is clear what the permanent solution will be. |
| 70 | + |
| 71 | +Mitigation Through Testing |
| 72 | +========================== |
| 73 | + |
| 74 | +The most important tactic that we are using to mitigate the risk of CIR diverging |
| 75 | +from classic codegen is to incorporate two sets of LLVM IR checks in the CIR |
| 76 | +codegen LIT tests. One set checks the LLVM IR that is produced by first |
| 77 | +generating CIR and then lowering that to LLVM IR. Another set checks the LLVM IR |
| 78 | +that is produced directly by the classic codegen. |
| 79 | + |
| 80 | +At the time that tests are created, we compare the LLVM IR output from these two |
| 81 | +paths to verify (manually) that any meaningful differences between them are the |
| 82 | +result of known missing features in the current CIR implementation. Whenever |
| 83 | +possible, differences are corrected in the same PR that the test is being added, |
| 84 | +updating the CIR implementation as it is being developed. |
| 85 | + |
| 86 | +However, these tests serve a second purpose. They also serve as sentinels to |
| 87 | +alert us to changes in the classic codegen behavior that will need to be |
| 88 | +accounted for in the CIR implementation. While we appreciate any help from |
| 89 | +developers contributing to classic codegen, our current expectation is that it |
| 90 | +will be the responsibility of the ClangIR contributors to update the CIR |
| 91 | +implementation when these tests fail. |
| 92 | + |
| 93 | +As the CIR implementation gets closer to the goal of IR that is semantically |
| 94 | +equivalent to the LLVM IR produced by the classic codegen, we would like to |
| 95 | +enhance the CIR tests to perform some automatic verification of the equivalence |
| 96 | +of the generated LLVM IR, perhaps using a tool such as Alive2. |
| 97 | + |
| 98 | +Eventually, we would like to be able to run all existing classic codegen tests |
| 99 | +using the CIR path as well. |
| 100 | + |
| 101 | +Other Considerations |
| 102 | +==================== |
| 103 | + |
| 104 | +The close modeling of CIR after classic codegen has also meant that the CIR |
| 105 | +dialect often represents language details at a much lower level than it ideally |
| 106 | +should. |
| 107 | + |
| 108 | +In the interest of having a complete working implementation of ClangIR as soon |
| 109 | +as is practical, we have chosen to take the approach of following the classic |
| 110 | +codegen implementation closely in the initial implementation and only raising |
| 111 | +the representation in the CIR dialect to a higher level when there is a clear |
| 112 | +and immediate benefit to doing so. |
| 113 | + |
| 114 | +Over time, we expect to progressively raise the CIR representation to a higher |
| 115 | +level and remove low level details, including ABI-specific handling from the |
| 116 | +dialect. However, having a working implementation in place makes it easier to |
| 117 | +verify that the high level representation and subsequent lowering are correct. |
| 118 | + |
| 119 | +Mixing With Other Dialects |
| 120 | +========================== |
| 121 | + |
| 122 | +Mixing of dialects is a central design feature of MLIR. The CIR dialect is |
| 123 | +currently more self-contained than most dialects, but even now we generate |
| 124 | +the ACC (OpenACCC) dialect in combination with CIR, and when support for OpenMP |
| 125 | +and CUDA are added, similar mixing will occur. |
| 126 | + |
| 127 | +We also expect CIR to be at least partially lowered to other dialects during |
| 128 | +the optimization phase to enable features such as data dependence analysis, even |
| 129 | +if we will eventually be lowering it to LLVM IR. |
| 130 | + |
| 131 | +Therefore, any plan for generating LLVM IR from CIR must be integrated with the |
| 132 | +general MLIR lowering design, which typically involves lowering to the LLVM |
| 133 | +dialect, which is then transformed to LLVM IR. |
| 134 | + |
| 135 | +Other Consumers of CIR and MLIR |
| 136 | +=============================== |
| 137 | + |
| 138 | +We must also consider that we will not always be lowering CIR to LLVM IR. CIR, |
| 139 | +usually mixed with other dialects, will also be directed to offload targets |
| 140 | +and other code generators through interfaces that are opaque to Clang. We must |
| 141 | +still produce semantically correct CIR for these consumers. |
| 142 | + |
| 143 | +Long Term Vision |
| 144 | +================ |
| 145 | + |
| 146 | +As the CIR implementation matures, we will eliminate target-specific handling |
| 147 | +from the high-level CIR generated by Clang. The high-level CIR will then be |
| 148 | +progressively lowered to a form that is closer to LLVM IR, including a pass |
| 149 | +that inserts ABI-specific handling, potentially representing the target-specific |
| 150 | +details in another dialect. |
| 151 | + |
| 152 | +As we raise CIR to this higher level implementation, there will naturally be |
| 153 | +less code duplication, and less need to have the same logic repeated in the |
| 154 | +CIR generation. |
| 155 | + |
| 156 | +We will continue to use that same basic design and structure for CIR code |
| 157 | +generation, with classes like CIRGenModule and CIRGenFunction that serve the |
| 158 | +same purpose as their counterparts in classic codegen, but the handling there |
| 159 | +will be more closely tied to core semantics and therefore less likely to require |
| 160 | +frequent changes to stay in sync with classic codegen. |
| 161 | + |
| 162 | +As the handling of low-level details is moved to later lowering phases, we will |
| 163 | +need to move away from the current tight coupling of the CIR and classic codegen |
| 164 | +implementations. As this happens, we will look for ways that this handling can |
| 165 | +be moved to new classes that are specifically designed to be shared among |
| 166 | +clients that are targeting different IR substrates. That is, rather than trying |
| 167 | +to overlay reuse onto the existing implementations, we will replace relevant |
| 168 | +parts of the existing implementation, piece by piece, as appropriate, with new |
| 169 | +implementations that perform the same function but with a more general design. |
| 170 | + |
| 171 | +Example: C Calling Convention Handling |
| 172 | +====================================== |
| 173 | + |
| 174 | +C calling convention handling is an example of a general purpose redesign that |
| 175 | +is already underway. This was started independently of CIR, but it will be |
| 176 | +directly useful for lowering from high-level call representation in CIR to a |
| 177 | +representation that includes the target- and calling convention-specific details |
| 178 | +of function signatures, parameter type coercion, and so on. |
| 179 | + |
| 180 | +The current CIR implementation duplicates most of the classic codegen handling |
| 181 | +for function call handling, but it omits several pieces that handle type |
| 182 | +coercion. This leads to an implementation that has all of the complexity of the |
| 183 | +class codegen without actually achieving the goals of that complexity. It will |
| 184 | +be a significant improvement to the CIR implementation to simplify the function |
| 185 | +call handling in such a way that it generates a high-level representation of the |
| 186 | +call, while preserving all information that will be needed to lower the call to |
| 187 | +an ABI-compliant representation in a later phase of compilation. |
| 188 | + |
| 189 | +This provides a clear example where trying to refactor the classic codegen in |
| 190 | +some way to be reused by CIR would have been counterproductive. The classic |
| 191 | +codegen implementation was tightly coupled with Clang's LLVM IR generation. The |
| 192 | +implementation is being completely redesigned to allow general reuse, not just by |
| 193 | +CIR, but also by other front ends. |
| 194 | + |
| 195 | +The CIR calling convention lowering will make use of the general purpose C |
| 196 | +calling convention library that is being created, but it should create an MLIR |
| 197 | +transform pass on top of that library that is general enough to be used by other |
| 198 | +dialects, such as FIR, that also need the same calling convention handling. |
| 199 | + |
| 200 | +Significant Areas For Improvement |
| 201 | +================================= |
| 202 | + |
| 203 | +The following list enumerates some of the areas where significant restructuring |
| 204 | +of the code is needed to enable better code sharing between CIR and classic |
| 205 | +codegen. Each of these areas is relatively self-contained in the codegen |
| 206 | +implementation, making the path to a shared implementation relatively clear. |
| 207 | + |
| 208 | +C++ ABI Handling |
| 209 | + VTable generation |
| 210 | + Virtual function calls |
| 211 | + Constructor and destructor arguments |
| 212 | + Dynamic casts |
| 213 | + Base class address calculation |
| 214 | + Type descriptors |
| 215 | + Array new and delete |
| 216 | +Constant expression evaluation |
| 217 | +Complex multiplication and division expansion |
| 218 | +Builtin function handling |
| 219 | +Exception Handling and C++ Cleanups |
| 220 | +Inline assembly handling |
| 221 | + |
| 222 | +Pervasive Low-Level Issues |
| 223 | +========================== |
| 224 | + |
| 225 | +This section lists some of the features where a non-trivial amount of code |
| 226 | +is shared between CIR and classic codegen, but the handling of the feature |
| 227 | +is distributed across the codegen implementation, making it more difficult |
| 228 | +to design an abstraction that can easily be shared. |
| 229 | + |
| 230 | +Global variable and function linkage |
| 231 | +Alignment management |
| 232 | +Debug information |
| 233 | +TBAA handling |
| 234 | +Sanitizer integration |
| 235 | +Lifetime markers |
0 commit comments