|
1 |
| -- name: "Enable CUDA compilation on Cppyy-Numba generated IR" |
| 1 | +- name: "Implement CppInterOp API exposing memory, ownership and thread safety information " |
2 | 2 | description: |
|
3 |
| - Cppyy is an automatic, run-time, Python-C++ bindings generator, for calling |
4 |
| - C++ from Python and Python from C++. Initial support has been added that |
5 |
| - allows Cppyy to hook into the high-performance Python compiler, |
6 |
| - Numba which compiles looped code containing C++ objects/methods/functions |
7 |
| - defined via Cppyy into fast machine code. Since Numba compiles the code in |
8 |
| - loops into machine code it crosses the language barrier just once and avoids |
9 |
| - large slowdowns accumulating from repeated calls between the two languages. |
10 |
| - Numba uses its own lightweight version of the LLVM compiler toolkit (llvmlite) |
11 |
| - that generates an intermediate code representation (LLVM IR) which is also |
12 |
| - supported by the Clang compiler capable of compiling CUDA C++ code. |
13 |
| - |
14 |
| - The project aims to demonstrate Cppyy's capability to provide CUDA paradigms to |
15 |
| - Python users without any compromise in performance. Upon successful completion |
16 |
| - a possible proof-of-concept can be expected in the below code snippet - |
17 |
| - |
18 |
| - ```python |
19 |
| - import cppyy |
20 |
| - import cppyy.numba_ext |
21 |
| - |
22 |
| - cppyy.cppdef(''' |
23 |
| - __global__ void MatrixMul(float* A, float* B, float* out) { |
24 |
| - // kernel logic for matrix multiplication |
25 |
| - } |
26 |
| - ''') |
27 |
| -
|
28 |
| - @numba.njit |
29 |
| - def run_cuda_mul(A, B, out): |
30 |
| - # Allocate memory for input and output arrays on GPU |
31 |
| - # Define grid and block dimensions |
32 |
| - # Launch the kernel |
33 |
| - MatrixMul[griddim, blockdim](d_A, d_B, d_out) |
34 |
| - ``` |
| 3 | + Incremental compilation pipelines process code chunk-by-chunk by building |
| 4 | + an ever-growing translation unit. Code is then lowered into the LLVM IR |
| 5 | + and subsequently run by the LLVM JIT. Such a pipeline allows creation of |
| 6 | + efficient interpreters. The interpreter enables interactive exploration |
| 7 | + and makes the C++ language more user friendly. The incremental compilation |
| 8 | + mode is used by the interactive C++ interpreter, Cling, initially developed |
| 9 | + to enable interactive high-energy physics analysis in a C++ environment. |
| 10 | +
|
| 11 | + Clang and LLVM provide access to C++ from other programming languages, |
| 12 | + but currently only exposes the declared public interfaces of such C++ |
| 13 | + code even when it has parsed implementation details directly. Both the |
| 14 | + high-level and the low-level program representation has enough information |
| 15 | + to capture and expose more of such details to improve language |
| 16 | + interoperability. Examples include details of memory management, ownership |
| 17 | + transfer, thread safety, externalized side-effects, etc. For example, if |
| 18 | + memory is allocated and returned, the caller needs to take ownership; if a |
| 19 | + function is pure, it can be elided; if a call provides access to a data member, |
| 20 | + it can be reduced to an address lookup. |
| 21 | + |
| 22 | + The goal of this project is to develop API for CppInterOp which are capable of |
| 23 | + extracting and exposing such information AST or from JIT-ed code and use it in |
| 24 | + cppyy (Python-C++ language bindings) as an exemplar. If time permits, extend |
| 25 | + the work to persistify this information across translation units and use it on |
| 26 | + code compiled with Clang. |
| 27 | +
|
| 28 | + tasks: | |
| 29 | + * Collect and categorize possible exposed interop information kinds |
| 30 | + * Write one or more facilities to extract necessary implementation details |
| 31 | + * Design a language-independent interface to expose this information |
| 32 | + * Integrate the work in clang-repl and Cling |
| 33 | + * Implement and demonstrate its use in cppyy as an exemplar |
| 34 | + * Present the work at the relevant meetings and conferences. |
| 35 | +
|
| 36 | +- name: "Implement and improve an efficient, layered tape with prefetching capabilities" |
| 37 | + description: | |
| 38 | + In mathematics and computer algebra, automatic differentiation (AD) is a set |
| 39 | + of techniques to numerically evaluate the derivative of a function specified |
| 40 | + by a computer program. Automatic differentiation is an alternative technique |
| 41 | + to Symbolic differentiation and Numerical differentiation (the method of |
| 42 | + finite differences). Clad is based on Clang which provides the necessary |
| 43 | + facilities for code transformation. The AD library can differentiate |
| 44 | + non-trivial functions, to find a partial derivative for trivial cases and has |
| 45 | + good unit test coverage. |
| 46 | +
|
| 47 | + The most heavily used entity in AD is a stack-like data structure called a |
| 48 | + tape. For example, the first-in last-out access pattern, which naturally |
| 49 | + occurs in the storage of intermediate values for reverse mode AD, lends |
| 50 | + itself towards asynchronous storage. Asynchronous prefetching of values |
| 51 | + during the reverse pass allows checkpoints deeper in the stack to be stored |
| 52 | + furthest away in the memory hierarchy. Checkpointing provides a mechanism to |
| 53 | + parallelize segments of a function that can be executed on independent cores. |
| 54 | + Inserting checkpoints in these segments using separate tapes enables keeping |
| 55 | + the memory local and not sharing memory between cores. We will research |
| 56 | + techniques for local parallelization of the gradient reverse pass, and extend |
| 57 | + it to achieve better scalability and/or lower constant overheads on CPUs and |
| 58 | + potentially accelerators. We will evaluate techniques for efficient memory |
| 59 | + use, such as multi-level checkpointing support. Combining already developed |
| 60 | + techniques will allow executing gradient segments across different cores or |
| 61 | + in heterogeneous computing systems. These techniques must be robust and |
| 62 | + user-friendly, and minimize required application code and build system changes. |
| 63 | +
|
| 64 | + This project aims to improve the efficiency of the clad tape and generalize |
| 65 | + it into a tool-agnostic facility that could be used outside of clad as well. |
| 66 | +
|
| 67 | + tasks: | |
| 68 | + * Optimize the current tape by avoiding re-allocating on resize in favor of using connected slabs of array |
| 69 | + * Enhance existing benchmarks demonstrating the efficiency of the new tape |
| 70 | + * Add the tape thread safety |
| 71 | + * Implement multilayer tape being stored in memory and on disk |
| 72 | + * [Stretch goal] Support cpu-gpu transfer of the tape |
| 73 | + * [Stretch goal] Add infrastructure to enable checkpointing offload to the new tape |
| 74 | + * [Stretch goal] Performance benchmarks |
| 75 | +
|
| 76 | +- name: "Enabling CUDA compilation on Cppyy-Numba generated IR" |
| 77 | + description: | |
| 78 | + Cppyy is an automatic, run-time, Python-C++ bindings generator, for calling |
| 79 | + C++ from Python and Python from C++. Initial support has been added that |
| 80 | + allows Cppyy to hook into the high-performance Python compiler, |
| 81 | + Numba which compiles looped code containing C++ objects/methods/functions |
| 82 | + defined via Cppyy into fast machine code. Since Numba compiles the code in |
| 83 | + loops into machine code it crosses the language barrier just once and avoids |
| 84 | + large slowdowns accumulating from repeated calls between the two languages. |
| 85 | + Numba uses its own lightweight version of the LLVM compiler toolkit (llvmlite) |
| 86 | + that generates an intermediate code representation (LLVM IR) which is also |
| 87 | + supported by the Clang compiler capable of compiling CUDA C++ code. |
| 88 | + |
| 89 | + The project aims to demonstrate Cppyy's capability to provide CUDA paradigms to |
| 90 | + Python users without any compromise in performance. Upon successful completion |
| 91 | + a possible proof-of-concept can be expected in the below code snippet - |
| 92 | +
|
| 93 | + ```python |
| 94 | + import cppyy |
| 95 | + import cppyy.numba_ext |
| 96 | + |
| 97 | + cppyy.cppdef(''' |
| 98 | + __global__ void MatrixMul(float* A, float* B, float* out) { |
| 99 | + // kernel logic for matrix multiplication |
| 100 | + } |
| 101 | + ''') |
| 102 | +
|
| 103 | + @numba.njit |
| 104 | + def run_cuda_mul(A, B, out): |
| 105 | + # Allocate memory for input and output arrays on GPU |
| 106 | + # Define grid and block dimensions |
| 107 | + # Launch the kernel |
| 108 | + MatrixMul[griddim, blockdim](d_A, d_B, d_out) |
| 109 | + ``` |
35 | 110 | tasks: |
|
36 |
| - * Add support for declaration and parsing of Cppyy-defined CUDA code on |
37 |
| - the Numba extension. |
38 |
| - * Design and develop a CUDA compilation and execution mechanism. |
39 |
| - * Prepare proper tests and documentation. |
| 111 | + * Add support for declaration and parsing of Cppyy-defined CUDA code on |
| 112 | + the Numba extension. |
| 113 | + * Design and develop a CUDA compilation and execution mechanism. |
| 114 | + * Prepare proper tests and documentation. |
40 | 115 |
|
41 | 116 | - name: "Cppyy STL/Eigen - Automatic conversion and plugins for Python based ML-backends"
|
42 | 117 | description: |
|
43 |
| - Cppyy is an automatic, run-time, Python-C++ bindings generator, for calling |
44 |
| - C++ from Python and Python from C++. Cppyy uses pythonized wrappers of useful |
45 |
| - classes from libraries like STL and Eigen that allow the user to utilize them |
46 |
| - on the Python side. Current support follows container types in STL like |
47 |
| - std::vector, std::map, and std::tuple and the Matrix-based classes in |
48 |
| - Eigen/Dense. These cppyy objects can be plugged into idiomatic expressions |
49 |
| - that expect Python builtin-types. This behaviour is achieved by growing |
50 |
| - pythonistic methods like `__len__` while also retaining its C++ methods |
51 |
| - like `size`. |
52 |
| - |
53 |
| - Efficient and automatic conversion between C++ and Python is essential |
54 |
| - towards high-performance cross-language support. This approach eliminates |
55 |
| - overheads arising from iterative initialization such as comma insertion in |
56 |
| - Eigen. This opens up new avenues for the utilization of Cppyy’s bindings in |
57 |
| - tools that perform numerical operations for transformations, or optimization. |
58 |
| - |
59 |
| - The on-demand C++ infrastructure wrapped by idiomatic Python enables new |
60 |
| - techniques in ML tools like JAX/CUTLASS. This project allows the C++ |
61 |
| - infrastructure to be plugged into at service to the users seeking |
62 |
| - high-performance library primitives that are unavailable in Python. |
63 |
| - |
| 118 | + Cppyy is an automatic, run-time, Python-C++ bindings generator, for calling |
| 119 | + C++ from Python and Python from C++. Cppyy uses pythonized wrappers of useful |
| 120 | + classes from libraries like STL and Eigen that allow the user to utilize them |
| 121 | + on the Python side. Current support follows container types in STL like |
| 122 | + std::vector, std::map, and std::tuple and the Matrix-based classes in |
| 123 | + Eigen/Dense. These cppyy objects can be plugged into idiomatic expressions |
| 124 | + that expect Python builtin-types. This behaviour is achieved by growing |
| 125 | + pythonistic methods like `__len__` while also retaining its C++ methods |
| 126 | + like `size`. |
| 127 | +
|
| 128 | + Efficient and automatic conversion between C++ and Python is essential |
| 129 | + towards high-performance cross-language support. This approach eliminates |
| 130 | + overheads arising from iterative initialization such as comma insertion in |
| 131 | + Eigen. This opens up new avenues for the utilization of Cppyy’s bindings in |
| 132 | + tools that perform numerical operations for transformations, or optimization. |
| 133 | +
|
| 134 | + The on-demand C++ infrastructure wrapped by idiomatic Python enables new |
| 135 | + techniques in ML tools like JAX/CUTLASS. This project allows the C++ |
| 136 | + infrastructure to be plugged into at service to the users seeking |
| 137 | + high-performance library primitives that are unavailable in Python. |
| 138 | + |
64 | 139 | tasks: |
|
65 | 140 | * Extend STL support for std::vectors of arbitrary dimensions
|
66 | 141 | * Improve the initialization approach for Eigen classes
|
|
230 | 305 | status: completed
|
231 | 306 | responsible: Anubhab Ghosh
|
232 | 307 |
|
233 |
| -- name: "Implement libInterOp API exposing memory, ownership and thread safety information" |
234 |
| - description: | |
235 |
| - Incremental compilation pipelines process code chunk-by-chunk by building an |
236 |
| - ever-growing translation unit. Code is then lowered into the LLVM IR and |
237 |
| - subsequently run by the LLVM JIT. Such a pipeline allows creation of |
238 |
| - efficient interpreters. The interpreter enables interactive exploration and |
239 |
| - makes the C++ language more user friendly. The incremental compilation mode |
240 |
| - is used by the interactive C++ interpreter, Cling, initially developed to |
241 |
| - enable interactive high-energy physics analysis in a C++ environment. |
242 |
| -
|
243 |
| - Clang and LLVM provide access to C++ from other programming languages, but |
244 |
| - currently only exposes the declared public interfaces of such C++ code |
245 |
| - even when it has parsed implementation details directly. Both the high-level |
246 |
| - and the low-level program representation has enough information to capture |
247 |
| - and expose more of such details to improve language interoperability. |
248 |
| - Examples include details of memory management, ownership transfer, thread |
249 |
| - safety, externalized side-effects, etc. For example, if memory is allocated |
250 |
| - and returned, the caller needs to take ownership; if a function is pure, it |
251 |
| - can be elided; if a call provides access to a data member, it can be reduced |
252 |
| - to an address lookup. The goal of this project is to develop API for |
253 |
| - libInterOp which are capable of extracting and exposing such information AST |
254 |
| - or from JIT-ed code and use it in cppyy (Python-C++ language bindings) as an |
255 |
| - exemplar. If time permits, extend the work to persistify this information |
256 |
| - across translation units and use it on code compiled with Clang. |
257 |
| - tasks: | |
258 |
| - There are several foreseen tasks: |
259 |
| - * Collect and categorize possible exposed interop information kinds |
260 |
| - * Write one or more facilities to extract necessary implementation details |
261 |
| - * Design a language-independent interface to expose this information |
262 |
| - * Integrate the work in clang-repl and Cling |
263 |
| - * Implement and demonstrate its use in cppyy as an exemplar |
264 |
| - * Present the work at the relevant meetings and conferences. |
265 |
| -
|
266 | 308 | - name: "Tutorial development with clang-repl"
|
267 | 309 | description: |
|
268 | 310 | Incremental compilation pipelines process code chunk-by-chunk by building an
|
|
0 commit comments