Skip to content

Commit c4afdf5

Browse files
authored
Add Clad/CppInterOp project and fix openprojectlist.yml (#149)
1 parent 55fcf42 commit c4afdf5

File tree

1 file changed

+133
-91
lines changed

1 file changed

+133
-91
lines changed

_data/openprojectlist.yml

Lines changed: 133 additions & 91 deletions
Original file line numberDiff line numberDiff line change
@@ -1,66 +1,141 @@
1-
- name: "Enable CUDA compilation on Cppyy-Numba generated IR"
1+
- name: "Implement CppInterOp API exposing memory, ownership and thread safety information "
22
description: |
3-
Cppyy is an automatic, run-time, Python-C++ bindings generator, for calling
4-
C++ from Python and Python from C++. Initial support has been added that
5-
allows Cppyy to hook into the high-performance Python compiler,
6-
Numba which compiles looped code containing C++ objects/methods/functions
7-
defined via Cppyy into fast machine code. Since Numba compiles the code in
8-
loops into machine code it crosses the language barrier just once and avoids
9-
large slowdowns accumulating from repeated calls between the two languages.
10-
Numba uses its own lightweight version of the LLVM compiler toolkit (llvmlite)
11-
that generates an intermediate code representation (LLVM IR) which is also
12-
supported by the Clang compiler capable of compiling CUDA C++ code.
13-
14-
The project aims to demonstrate Cppyy's capability to provide CUDA paradigms to
15-
Python users without any compromise in performance. Upon successful completion
16-
a possible proof-of-concept can be expected in the below code snippet -
17-
18-
```python
19-
import cppyy
20-
import cppyy.numba_ext
21-
22-
cppyy.cppdef('''
23-
__global__ void MatrixMul(float* A, float* B, float* out) {
24-
// kernel logic for matrix multiplication
25-
}
26-
''')
27-
28-
@numba.njit
29-
def run_cuda_mul(A, B, out):
30-
# Allocate memory for input and output arrays on GPU
31-
# Define grid and block dimensions
32-
# Launch the kernel
33-
MatrixMul[griddim, blockdim](d_A, d_B, d_out)
34-
```
3+
Incremental compilation pipelines process code chunk-by-chunk by building
4+
an ever-growing translation unit. Code is then lowered into the LLVM IR
5+
and subsequently run by the LLVM JIT. Such a pipeline allows creation of
6+
efficient interpreters. The interpreter enables interactive exploration
7+
and makes the C++ language more user friendly. The incremental compilation
8+
mode is used by the interactive C++ interpreter, Cling, initially developed
9+
to enable interactive high-energy physics analysis in a C++ environment.
10+
11+
Clang and LLVM provide access to C++ from other programming languages,
12+
but currently only exposes the declared public interfaces of such C++
13+
code even when it has parsed implementation details directly. Both the
14+
high-level and the low-level program representation has enough information
15+
to capture and expose more of such details to improve language
16+
interoperability. Examples include details of memory management, ownership
17+
transfer, thread safety, externalized side-effects, etc. For example, if
18+
memory is allocated and returned, the caller needs to take ownership; if a
19+
function is pure, it can be elided; if a call provides access to a data member,
20+
it can be reduced to an address lookup.
21+
22+
The goal of this project is to develop API for CppInterOp which are capable of
23+
extracting and exposing such information AST or from JIT-ed code and use it in
24+
cppyy (Python-C++ language bindings) as an exemplar. If time permits, extend
25+
the work to persistify this information across translation units and use it on
26+
code compiled with Clang.
27+
28+
tasks: |
29+
* Collect and categorize possible exposed interop information kinds
30+
* Write one or more facilities to extract necessary implementation details
31+
* Design a language-independent interface to expose this information
32+
* Integrate the work in clang-repl and Cling
33+
* Implement and demonstrate its use in cppyy as an exemplar
34+
* Present the work at the relevant meetings and conferences.
35+
36+
- name: "Implement and improve an efficient, layered tape with prefetching capabilities"
37+
description: |
38+
In mathematics and computer algebra, automatic differentiation (AD) is a set
39+
of techniques to numerically evaluate the derivative of a function specified
40+
by a computer program. Automatic differentiation is an alternative technique
41+
to Symbolic differentiation and Numerical differentiation (the method of
42+
finite differences). Clad is based on Clang which provides the necessary
43+
facilities for code transformation. The AD library can differentiate
44+
non-trivial functions, to find a partial derivative for trivial cases and has
45+
good unit test coverage.
46+
47+
The most heavily used entity in AD is a stack-like data structure called a
48+
tape. For example, the first-in last-out access pattern, which naturally
49+
occurs in the storage of intermediate values for reverse mode AD, lends
50+
itself towards asynchronous storage. Asynchronous prefetching of values
51+
during the reverse pass allows checkpoints deeper in the stack to be stored
52+
furthest away in the memory hierarchy. Checkpointing provides a mechanism to
53+
parallelize segments of a function that can be executed on independent cores.
54+
Inserting checkpoints in these segments using separate tapes enables keeping
55+
the memory local and not sharing memory between cores. We will research
56+
techniques for local parallelization of the gradient reverse pass, and extend
57+
it to achieve better scalability and/or lower constant overheads on CPUs and
58+
potentially accelerators. We will evaluate techniques for efficient memory
59+
use, such as multi-level checkpointing support. Combining already developed
60+
techniques will allow executing gradient segments across different cores or
61+
in heterogeneous computing systems. These techniques must be robust and
62+
user-friendly, and minimize required application code and build system changes.
63+
64+
This project aims to improve the efficiency of the clad tape and generalize
65+
it into a tool-agnostic facility that could be used outside of clad as well.
66+
67+
tasks: |
68+
* Optimize the current tape by avoiding re-allocating on resize in favor of using connected slabs of array
69+
* Enhance existing benchmarks demonstrating the efficiency of the new tape
70+
* Add the tape thread safety
71+
* Implement multilayer tape being stored in memory and on disk
72+
* [Stretch goal] Support cpu-gpu transfer of the tape
73+
* [Stretch goal] Add infrastructure to enable checkpointing offload to the new tape
74+
* [Stretch goal] Performance benchmarks
75+
76+
- name: "Enabling CUDA compilation on Cppyy-Numba generated IR"
77+
description: |
78+
Cppyy is an automatic, run-time, Python-C++ bindings generator, for calling
79+
C++ from Python and Python from C++. Initial support has been added that
80+
allows Cppyy to hook into the high-performance Python compiler,
81+
Numba which compiles looped code containing C++ objects/methods/functions
82+
defined via Cppyy into fast machine code. Since Numba compiles the code in
83+
loops into machine code it crosses the language barrier just once and avoids
84+
large slowdowns accumulating from repeated calls between the two languages.
85+
Numba uses its own lightweight version of the LLVM compiler toolkit (llvmlite)
86+
that generates an intermediate code representation (LLVM IR) which is also
87+
supported by the Clang compiler capable of compiling CUDA C++ code.
88+
89+
The project aims to demonstrate Cppyy's capability to provide CUDA paradigms to
90+
Python users without any compromise in performance. Upon successful completion
91+
a possible proof-of-concept can be expected in the below code snippet -
92+
93+
```python
94+
import cppyy
95+
import cppyy.numba_ext
96+
97+
cppyy.cppdef('''
98+
__global__ void MatrixMul(float* A, float* B, float* out) {
99+
// kernel logic for matrix multiplication
100+
}
101+
''')
102+
103+
@numba.njit
104+
def run_cuda_mul(A, B, out):
105+
# Allocate memory for input and output arrays on GPU
106+
# Define grid and block dimensions
107+
# Launch the kernel
108+
MatrixMul[griddim, blockdim](d_A, d_B, d_out)
109+
```
35110
tasks: |
36-
* Add support for declaration and parsing of Cppyy-defined CUDA code on
37-
the Numba extension.
38-
* Design and develop a CUDA compilation and execution mechanism.
39-
* Prepare proper tests and documentation.
111+
* Add support for declaration and parsing of Cppyy-defined CUDA code on
112+
the Numba extension.
113+
* Design and develop a CUDA compilation and execution mechanism.
114+
* Prepare proper tests and documentation.
40115
41116
- name: "Cppyy STL/Eigen - Automatic conversion and plugins for Python based ML-backends"
42117
description: |
43-
Cppyy is an automatic, run-time, Python-C++ bindings generator, for calling
44-
C++ from Python and Python from C++. Cppyy uses pythonized wrappers of useful
45-
classes from libraries like STL and Eigen that allow the user to utilize them
46-
on the Python side. Current support follows container types in STL like
47-
std::vector, std::map, and std::tuple and the Matrix-based classes in
48-
Eigen/Dense. These cppyy objects can be plugged into idiomatic expressions
49-
that expect Python builtin-types. This behaviour is achieved by growing
50-
pythonistic methods like `__len__` while also retaining its C++ methods
51-
like `size`.
52-
53-
Efficient and automatic conversion between C++ and Python is essential
54-
towards high-performance cross-language support. This approach eliminates
55-
overheads arising from iterative initialization such as comma insertion in
56-
Eigen. This opens up new avenues for the utilization of Cppyy’s bindings in
57-
tools that perform numerical operations for transformations, or optimization.
58-
59-
The on-demand C++ infrastructure wrapped by idiomatic Python enables new
60-
techniques in ML tools like JAX/CUTLASS. This project allows the C++
61-
infrastructure to be plugged into at service to the users seeking
62-
high-performance library primitives that are unavailable in Python.
63-
118+
Cppyy is an automatic, run-time, Python-C++ bindings generator, for calling
119+
C++ from Python and Python from C++. Cppyy uses pythonized wrappers of useful
120+
classes from libraries like STL and Eigen that allow the user to utilize them
121+
on the Python side. Current support follows container types in STL like
122+
std::vector, std::map, and std::tuple and the Matrix-based classes in
123+
Eigen/Dense. These cppyy objects can be plugged into idiomatic expressions
124+
that expect Python builtin-types. This behaviour is achieved by growing
125+
pythonistic methods like `__len__` while also retaining its C++ methods
126+
like `size`.
127+
128+
Efficient and automatic conversion between C++ and Python is essential
129+
towards high-performance cross-language support. This approach eliminates
130+
overheads arising from iterative initialization such as comma insertion in
131+
Eigen. This opens up new avenues for the utilization of Cppyy’s bindings in
132+
tools that perform numerical operations for transformations, or optimization.
133+
134+
The on-demand C++ infrastructure wrapped by idiomatic Python enables new
135+
techniques in ML tools like JAX/CUTLASS. This project allows the C++
136+
infrastructure to be plugged into at service to the users seeking
137+
high-performance library primitives that are unavailable in Python.
138+
64139
tasks: |
65140
* Extend STL support for std::vectors of arbitrary dimensions
66141
* Improve the initialization approach for Eigen classes
@@ -230,39 +305,6 @@
230305
status: completed
231306
responsible: Anubhab Ghosh
232307

233-
- name: "Implement libInterOp API exposing memory, ownership and thread safety information"
234-
description: |
235-
Incremental compilation pipelines process code chunk-by-chunk by building an
236-
ever-growing translation unit. Code is then lowered into the LLVM IR and
237-
subsequently run by the LLVM JIT. Such a pipeline allows creation of
238-
efficient interpreters. The interpreter enables interactive exploration and
239-
makes the C++ language more user friendly. The incremental compilation mode
240-
is used by the interactive C++ interpreter, Cling, initially developed to
241-
enable interactive high-energy physics analysis in a C++ environment.
242-
243-
Clang and LLVM provide access to C++ from other programming languages, but
244-
currently only exposes the declared public interfaces of such C++ code
245-
even when it has parsed implementation details directly. Both the high-level
246-
and the low-level program representation has enough information to capture
247-
and expose more of such details to improve language interoperability.
248-
Examples include details of memory management, ownership transfer, thread
249-
safety, externalized side-effects, etc. For example, if memory is allocated
250-
and returned, the caller needs to take ownership; if a function is pure, it
251-
can be elided; if a call provides access to a data member, it can be reduced
252-
to an address lookup. The goal of this project is to develop API for
253-
libInterOp which are capable of extracting and exposing such information AST
254-
or from JIT-ed code and use it in cppyy (Python-C++ language bindings) as an
255-
exemplar. If time permits, extend the work to persistify this information
256-
across translation units and use it on code compiled with Clang.
257-
tasks: |
258-
There are several foreseen tasks:
259-
* Collect and categorize possible exposed interop information kinds
260-
* Write one or more facilities to extract necessary implementation details
261-
* Design a language-independent interface to expose this information
262-
* Integrate the work in clang-repl and Cling
263-
* Implement and demonstrate its use in cppyy as an exemplar
264-
* Present the work at the relevant meetings and conferences.
265-
266308
- name: "Tutorial development with clang-repl"
267309
description: |
268310
Incremental compilation pipelines process code chunk-by-chunk by building an

0 commit comments

Comments
 (0)