Skip to content

Commit 4b42338

Browse files
committed
[Clang][CIR][Doc] Document CIR code duplication plans
This adds a document describing known problems with code duplication in the CIR codegen implementation, strategies to mitigate the risks caused by that code duplication, and a general long-term plan for minimizing the problem.
1 parent 6de1c25 commit 4b42338

File tree

1 file changed

+235
-0
lines changed

1 file changed

+235
-0
lines changed
Lines changed: 235 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,235 @@
1+
================================
2+
ClangIR Code Duplication Roadmap
3+
================================
4+
5+
.. contents::
6+
:local:
7+
8+
Introduction
9+
============
10+
11+
This document describes the general approach to code duplication in the ClangIR
12+
code generation implementation. It acknowledges specific problems with the
13+
current implementation, discusses strategies for mitigating the risk inherent in
14+
the current approach, and describes a general long-term plan for addressing the
15+
issue.
16+
17+
Background
18+
==========
19+
20+
The ClangIR code generation is very closely modeled after Clang's LLVM IR code
21+
generation, and we intend for the CIR produced to eventually be semantically
22+
equivalent to the LLVM IR produced when not going through ClangIR. However, we
23+
acknowledge that as the ClangIR implementation is under development, there will
24+
be differences in semantics, both because we have not yet implemented all
25+
features of the classic codegen and because the CIR dialect is still evolving
26+
and does not yet have a way to represent all of the necessary semantics.
27+
28+
We have chosen to model the ClangIR code generation directly after the classic
29+
codegen, to the point of following identical code structure, using similar names
30+
and often duplicating the logic because this seemed to be the most certain path
31+
to producing equivalent results. Having such nearly identical code allows for
32+
direct comparison between the CIR codegen and the LLVM IR codegen to find what
33+
is missing or incorrect in the CIR implementation.
34+
35+
However, we recognize that this is not a sustainable permanent solution. As
36+
bugs are fixed and new features are added to the classic codegen, the process of
37+
keeping the analogous CIR code up to date will be a purely manual process.
38+
39+
Long term, we need a more sustainable approach.
40+
41+
Current Strategy
42+
================
43+
44+
Practical considerations require that we make steady progress towards a working
45+
implementation of ClangIR. This necessity is directly opposed to the goal of
46+
minimizing code duplication.
47+
48+
For this reason, we have decided to accept a large amount of code duplication
49+
in the short term, even with the explicit understanding that this is producing
50+
a significant amount of technical debt as the project progresses.
51+
52+
As the CIR implementation is developed, we often note small pieces of code that
53+
could be shared with the classic codegen if they were moved to a different part
54+
of the source, such as a shared utility class in some directory available to
55+
both codegen implementations or by moving the function into a related AST class.
56+
It is left to the discretion of the developer and reviewers to decide whether
57+
such refactoring should be done during the CIR development, or if it is
58+
sufficient to leave a comment in the code indicating this as an opportunity for
59+
future improvement. Because much of the current code is likely to change when
60+
the long term code sharing strategy is complete, we will lean towards only
61+
implementing refactorings that make sense independent of the code sharing
62+
problem.
63+
64+
We have discussed various ways that major classes such as CGCXXABI/CIRGenCXXABI
65+
could be refactored to allow parts of there implementation to be shared today
66+
through inheritence and templated base classes. However, this may prove to be
67+
wasted effort when the permanent solution is developed, so we have decided that
68+
it is better to accept significant amounts of code duplication now, and defer
69+
this type of refactoring until it is clear what the permanent solution will be.
70+
71+
Mitigation Through Testing
72+
==========================
73+
74+
The most important tactic that we are using to mitigate the risk of CIR diverging
75+
from classic codegen is to incorporate two sets of LLVM IR checks in the CIR
76+
codegen LIT tests. One set checks the LLVM IR that is produced by first
77+
generating CIR and then lowering that to LLVM IR. Another set checks the LLVM IR
78+
that is produced directly by the classic codegen.
79+
80+
At the time that tests are created, we compare the LLVM IR output from these two
81+
paths to verify (manually) that any meaningful differences between them are the
82+
result of known missing features in the current CIR implementation. Whenever
83+
possible, differences are corrected in the same PR that the test is being added,
84+
updating the CIR implementation as it is being developed.
85+
86+
However, these tests serve a second purpose. They also serve as sentinels to
87+
alert us to changes in the classic codegen behavior that will need to be
88+
accounted for in the CIR implementation. While we appreciate any help from
89+
developers contributing to classic codegen, our current expectation is that it
90+
will be the responsibility of the ClangIR contributors to update the CIR
91+
implementation when these tests fail.
92+
93+
As the CIR implementation gets closer to the goal of IR that is semantically
94+
equivalent to the LLVM IR produced by the classic codegen, we would like to
95+
enhance the CIR tests to perform some automatic verification of the equivalence
96+
of the generated LLVM IR, perhaps using a tool such as Alive2.
97+
98+
Eventually, we would like to be able to run all existing classic codegen tests
99+
using the CIR path as well.
100+
101+
Other Considerations
102+
====================
103+
104+
The close modeling of CIR after classic codegen has also meant that the CIR
105+
dialect often represents language details at a much lower level than it ideally
106+
should.
107+
108+
In the interest of having a complete working implementation of ClangIR as soon
109+
as is practical, we have chosen to take the approach of following the classic
110+
codegen implementation closely in the initial implementation and only raising
111+
the representation in the CIR dialect to a higher level when there is a clear
112+
and immediate benefit to doing so.
113+
114+
Over time, we expect to progressively raise the CIR representation to a higher
115+
level and remove low level details, including ABI-specific handling from the
116+
dialect. However, having a working implementation in place makes it easier to
117+
verify that the high level representation and subsequent lowering are correct.
118+
119+
Mixing With Other Dialects
120+
==========================
121+
122+
Mixing of dialects is a central design feature of MLIR. The CIR dialect is
123+
currently more self-contained than most dialects, but even now we generate
124+
the ACC (OpenACCC) dialect in combination with CIR, and when support for OpenMP
125+
and CUDA are added, similar mixing will occur.
126+
127+
We also expect CIR to be at least partially lowered to other dialects during
128+
the optimization phase to enable features such as data dependence analysis, even
129+
if we will eventually be lowering it to LLVM IR.
130+
131+
Therefore, any plan for generating LLVM IR from CIR must be integrated with the
132+
general MLIR lowering design, which typically involves lowering to the LLVM
133+
dialect, which is then transformed to LLVM IR.
134+
135+
Other Consumers of CIR and MLIR
136+
===============================
137+
138+
We must also consider that we will not always be lowering CIR to LLVM IR. CIR,
139+
usually mixed with other dialects, will also be directed to offload targets
140+
and other code generators through interfaces that are opaque to Clang. We must
141+
still produce semantically correct CIR for these consumers.
142+
143+
Long Term Vision
144+
================
145+
146+
As the CIR implementation matures, we will eliminate target-specific handling
147+
from the high-level CIR generated by Clang. The high-level CIR will then be
148+
progressively lowered to a form that is closer to LLVM IR, including a pass
149+
that inserts ABI-specific handling, potentially representing the target-specific
150+
details in another dialect.
151+
152+
As we raise CIR to this higher level implementation, there will naturally be
153+
less code duplication, and less need to have the same logic repeated in the
154+
CIR generation.
155+
156+
We will continue to use that same basic design and structure for CIR code
157+
generation, with classes like CIRGenModule and CIRGenFunction that serve the
158+
same purpose as their counterparts in classic codegen, but the handling there
159+
will be more closely tied to core semantics and therefore less likely to require
160+
frequent changes to stay in sync with classic codegen.
161+
162+
As the handling of low-level details is moved to later lowering phases, we will
163+
need to move away from the current tight coupling of the CIR and classic codegen
164+
implementations. As this happens, we will look for ways that this handling can
165+
be moved to new classes that are specifically designed to be shared among
166+
clients that are targeting different IR substrates. That is, rather than trying
167+
to overlay reuse onto the existing implementations, we will replace relevant
168+
parts of the existing implementation, piece by piece, as appropriate, with new
169+
implementations that perform the same function but with a more general design.
170+
171+
Example: C Calling Convention Handling
172+
======================================
173+
174+
C calling convention handling is an example of a general purpose redesign that
175+
is already underway. This was started independently of CIR, but it will be
176+
directly useful for lowering from high-level call representation in CIR to a
177+
representation that includes the target- and calling convention-specific details
178+
of function signatures, parameter type coercion, and so on.
179+
180+
The current CIR implementation duplicates most of the classic codegen handling
181+
for function call handling, but it omits several pieces that handle type
182+
coercion. This leads to an implementation that has all of the complexity of the
183+
class codegen without actually achieving the goals of that complexity. It will
184+
be a significant improvement to the CIR implementation to simplify the function
185+
call handling in such a way that it generates a high-level representation of the
186+
call, while preserving all information that will be needed to lower the call to
187+
an ABI-compliant representation in a later phase of compilation.
188+
189+
This provides a clear example where trying to refactor the classic codegen in
190+
some way to be reused by CIR would have been counterproductive. The classic
191+
codegen implementation was tightly coupled with Clang's LLVM IR generation. The
192+
implementation is being completely redesigned to allow general reuse, not just by
193+
CIR, but also by other front ends.
194+
195+
The CIR calling convention lowering will make use of the general purpose C
196+
calling convention library that is being created, but it should create an MLIR
197+
transform pass on top of that library that is general enough to be used by other
198+
dialects, such as FIR, that also need the same calling convention handling.
199+
200+
Significant Areas For Improvement
201+
=================================
202+
203+
The following list enumerates some of the areas where significant restructuring
204+
of the code is needed to enable better code sharing between CIR and classic
205+
codegen. Each of these areas is relatively self-contained in the codegen
206+
implementation, making the path to a shared implementation relatively clear.
207+
208+
C++ ABI Handling
209+
VTable generation
210+
Virtual function calls
211+
Constructor and destructor arguments
212+
Dynamic casts
213+
Base class address calculation
214+
Type descriptors
215+
Array new and delete
216+
Constant expression evaluation
217+
Complex multiplication and division expansion
218+
Builtin function handling
219+
Exception Handling and C++ Cleanups
220+
Inline assembly handling
221+
222+
Pervasive Low-Level Issues
223+
==========================
224+
225+
This section lists some of the features where a non-trivial amount of code
226+
is shared between CIR and classic codegen, but the handling of the feature
227+
is distributed across the codegen implementation, making it more difficult
228+
to design an abstraction that can easily be shared.
229+
230+
Global variable and function linkage
231+
Alignment management
232+
Debug information
233+
TBAA handling
234+
Sanitizer integration
235+
Lifetime markers

0 commit comments

Comments
 (0)