Skip to content

Commit c87e7a2

Browse files
aaronj0vgvassilev
authored andcommitted
Add latest projects from GSoC CR and LLVM
1 parent fdef05f commit c87e7a2

File tree

1 file changed

+289
-1
lines changed

1 file changed

+289
-1
lines changed

_data/openprojectlist.yml

Lines changed: 289 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,86 @@
1+
- name: "Integrate a Large Language Model with the xeus-cpp Jupyter kernel"
2+
description: |
3+
xeus-cpp is a Jupyter kernel for cpp based on the native implementation
4+
of the Jupyter protocol xeus. This enables users to write and execute
5+
C++ code interactively, seeing the results immediately. This REPL
6+
(read-eval-print-loop) nature allows rapid prototyping and iterations
7+
without the overhead of compiling and running separate C++ programs.
8+
This also achieves C++ and Python integration within a single Jupyter
9+
environment.
10+
11+
This project aims to integrate a large language model, such as Bard/Gemini,
12+
with the xeus-cpp Jupyter kernel. This integration will enable users to
13+
interactively generate and execute code in C++ leveraging the assistance
14+
of the language model. Upon successful integration, users will have access
15+
to features such as code autocompletion, syntax checking, semantic
16+
understanding, and even code generation based on natural language prompts.
17+
18+
tasks: |
19+
* Design and implement mechanisms to interface the large language model with the xeus-cpp kernel. Jupyter-AI might be used as a motivating example
20+
* Develop functionalities within the kernel to utilize the language model for code generation based on natural language descriptions and suggestions for autocompletion.
21+
* Comprehensive documentation and thorough testing/CI additions to ensure reliability.
22+
* [Stretch Goal] After achieving the previous milestones, the student can work on specializing the model for enhanced syntax and semantic understanding capabilities by using xeus notebooks as datasets.
23+
24+
25+
- name: "Implementing missing features in xeus-cpp"
26+
description: |
27+
xeus-cpp is a Jupyter kernel for cpp based on the native implementation
28+
of the Jupyter protocol xeus. This enables users to write and execute
29+
C++ code interactively, seeing the results immediately. This REPL
30+
(read-eval-print-loop) nature allows rapid prototyping and iterations
31+
without the overhead of compiling and running separate C++ programs.
32+
This also achieves C++ and Python integration within a single Jupyter
33+
environment.
34+
35+
The xeus-cpp is a successor of xeus-clang-repl and xeus-cling. The project
36+
goal is to advance the project feature support to the extent of what’s
37+
supported in xeus-clang-repl and xeus-cling.
38+
39+
tasks: |
40+
* Fix occasional bugs in clang-repl directly in llvm upstream
41+
* Implement the value printing logic
42+
* Advance the wasm infrastructure
43+
* Write tutorials and demonstrators
44+
* Complete the transition of xeus-clang-repl to xeus-cpp
45+
46+
47+
- name: "Adoption of CppInterOp in ROOT"
48+
description: |
49+
Incremental compilation pipelines process code chunk-by-chunk by building
50+
an ever-growing translation unit. Code is then lowered into the LLVM IR
51+
and subsequently run by the LLVM JIT. Such a pipeline allows creation of
52+
efficient interpreters. The interpreter enables interactive exploration
53+
and makes the C++ language more user friendly. The incremental compilation
54+
mode is used by the interactive C++ interpreter, Cling, initially developed
55+
to enable interactive high-energy physics analysis in a C++ environment.
56+
The CppInterOp library provides a minimalist approach for other languages
57+
to identify C++ entities (variables, classes, etc.). This enables
58+
interoperability with C++ code, bringing the speed and efficiency of C++
59+
to simpler, more interactive languages like Python. CppInterOp provides
60+
primitives that are good for providing reflection information.
61+
62+
The ROOT is an open-source data analysis framework used by high energy
63+
physics and others to analyze petabytes of data, scientifically. The
64+
framework provides support for data storage and processing by relying
65+
on Cling, Clang, LLVM for building automatically efficient I/O
66+
representation of the necessary C++ objects. The I/O properties of each
67+
object is described in a compilable C++ file called a /dictionary/.
68+
ROOT’s I/O dictionary system relies on reflection information provided
69+
by Cling and Clang. However, the reflection information system has grown
70+
organically and now ROOT’s core/metacling system has been hard to maintain
71+
and integrate.
72+
73+
The goal of this project is to integrate CppInterOp in ROOT where possible.
74+
75+
tasks: |
76+
* To achieve this goal we expect several infrastructure items to be completed such as Windows support, WASM support
77+
* Make reusable github actions across multiple repositories
78+
* Sync the state of the dynamic library manager with the one in ROOT
79+
* Sync the state of callfunc/jitcall with the one in ROOT
80+
* Prepare the infrastructure for upstreaming to llvm
81+
* Propose an RFC and make a presentation to the ROOT development team
82+
83+
184
- name: "Implement CppInterOp API exposing memory, ownership and thread safety information "
285
description: |
386
Incremental compilation pipelines process code chunk-by-chunk by building
@@ -82,7 +165,7 @@
82165
defined via Cppyy into fast machine code. Since Numba compiles the code in
83166
loops into machine code it crosses the language barrier just once and avoids
84167
large slowdowns accumulating from repeated calls between the two languages.
85-
Numba uses its own lightweight version of the LLVM compiler toolkit (llvmlite)
168+
Numba uses its own lightweight version of the LLVM compiler toolkit ([llvmlite](https://github.com/numba/llvmlite))
86169
that generates an intermediate code representation (LLVM IR) which is also
87170
supported by the Clang compiler capable of compiling CUDA C++ code.
88171
@@ -146,6 +229,211 @@
146229
* Work on integrating these plugins with toolkits like CUTLASS that
147230
utilise the bindings to provide a Python API
148231
232+
- name: "Improve the LLVM.org Website Look and Feel"
233+
description: |
234+
The llvm.org website serves as the central hub for information about the
235+
LLVM project, encompassing project details, current events, and relevant
236+
resources. Over time, the website has evolved organically, prompting the
237+
need for a redesign to enhance its modernity, structure, and ease of
238+
maintenance.
239+
240+
The goal of this project is to create a contemporary and coherent static
241+
website that reflects the essence of LLVM.org. This redesign aims to improve
242+
navigation, taxonomy, content discoverability, and overall usability. Given
243+
the critical role of the website in the community, efforts will be made to
244+
engage with community members, seeking consensus on the proposed changes.
245+
246+
LLVM's [current website](https://llvm.org) is a complicated mesh of uncoordinated pages with
247+
inconsistent, static links pointing to both internal and external sources.
248+
The website has grown substantially and haphazardly since its inception.
249+
250+
It requires a major UI and UX overhaul to be able to better serve the LLVM
251+
community.
252+
253+
Based on a preliminary site audit, following are some of the problem areas
254+
that need to be addressed.
255+
256+
**Sub-Sites**: Many of the sections/sub-sites have a completely different UI/UX
257+
(e.g., [main](https://llvm.org), [clang](https://clang.llvm.org),
258+
[lists](https://lists.llvm.org/cgi-bin/mailman/listinfo),
259+
[foundation](https://foundation.llvm.org),
260+
[circt](https://circt.llvm.org/docs/GettingStarted/),
261+
[lnt](http://lnt.llvm.org), and [docs](https://llvm.org/docs)).
262+
Sub-sites are divided into 8 separate repos and use different technologies
263+
including [Hugo](https://github.com/llvm/circt-www/blob/main/website/config.toml),
264+
[Jekyll](https://github.com/llvm/clangd-www/blob/main/_config.yml), etc.
265+
266+
**Navigation**: On-page navigation is inconsistent and confusing. Cross-sub-site
267+
navigation is inconsistent, unintuitive, and sometimes non-existent. Important
268+
subsections often depend on static links within (seemingly random) pages.
269+
Multi-word menu items are center-aligned and flow out of margins.
270+
271+
**Pages**: Many [large write-ups](https://clang.llvm.org/docs/UsersManual.html)
272+
lack pagination, section boundaries, etc., making
273+
them seem more intimidating than they really are. Several placeholder pages
274+
re-route to [3rd party services](https://llvm.swoogo.com/2023devmtg),
275+
adding bloat and inconsistency.
276+
277+
**Search**: Search options are placed in unintuitive locations, like the bottom
278+
of the side panel, or from [static links](https://llvm.org/docs/) to
279+
[redundant pages](https://llvm.org/docs/search.html). Some pages have
280+
no search options at all. With multiple sections of the website hosted in
281+
separate projects/repos, cross-sub-site search doesn't seem possible.
282+
283+
**Expected results**: A modern, coherent-looking website that attracts new
284+
prospect users and empowers the existing community with better navigation,
285+
taxonomy, content discoverability, and overall usability. It should also
286+
include a more descriptive Contribution Guide ([example](https://kitian616.github.io/jekyll-TeXt-theme/docs/en/layouts)) to help novice
287+
contributors, as well as to help maintain a coherent site structure.
288+
289+
Since the website is a critical infrastructure and most of the community
290+
will have an opinion this project should try to engage with the community
291+
building community consensus on the steps being taken.
292+
293+
tasks: |
294+
* Conduct a comprehensive content audit of the existing website.
295+
* Select appropriate technologies, preferably static site generators like
296+
Hugo or Jekyll.
297+
* Advocate for a separation of data and visualization, utilizing formats such
298+
as YAML and Markdown to facilitate content management without direct HTML
299+
coding.
300+
* Present three design mockups for the new website, fostering open discussions
301+
and allowing time for alternative proposals from interested parties.
302+
* Implement the chosen design, incorporating valuable feedback from the
303+
community.
304+
* Collaborate with content creators to integrate or update content as needed.
305+
306+
The successful candidate should commit to regular participation in weekly
307+
meetings, deliver presentations, and contribute blog posts as requested.
308+
Additionally, they should demonstrate the ability to navigate the community
309+
process with patience and understanding.
310+
311+
312+
- name: "On Demand Parsing in Clang"
313+
description: |
314+
Clang, like any C++ compiler, parses a sequence of characters as they appear,
315+
linearly. The linear character sequence is then turned into tokens and AST
316+
before lowering to machine code. In many cases the end-user code uses a small
317+
portion of the C++ entities from the entire translation unit but the user
318+
still pays the price for compiling all of the redundancies.
319+
320+
This project proposes to process the heavy compiling C++ entities upon using
321+
them rather than eagerly. This approach is already adopted in Clang’s CodeGen
322+
where it allows Clang to produce code only for what is being used. On demand
323+
compilation is expected to significantly reduce the compilation peak memory
324+
and improve the compile time for translation units which sparsely use their
325+
contents. In addition, that would have a significant impact on interactive
326+
C++ where header inclusion essentially becomes a no-op and entities will be
327+
only parsed on demand.
328+
329+
The Cling interpreter implements a very naive but efficient cross-translation
330+
unit lazy compilation optimization which scales across hundreds of libraries
331+
in the field of high-energy physics.
332+
333+
```cpp
334+
// A.h
335+
#include <string>
336+
#include <vector>
337+
template <class T, class U = int> struct AStruct {
338+
void doIt() { /*...*/ }
339+
const char* data;
340+
// ...
341+
};
342+
343+
template<class T, class U = AStruct<T>>
344+
inline void freeFunction() { /* ... */ }
345+
inline void doit(unsigned N = 1) { /* ... */ }
346+
347+
// Main.cpp
348+
#include "A.h"
349+
int main() {
350+
doit();
351+
return 0;
352+
}
353+
```
354+
355+
This pathological example expands to 37253 lines of code to process. Cling
356+
builds an index (it calls it an autoloading map) where it contains only
357+
forward declarations of these C++ entities. Their size is 3000 lines of code.
358+
359+
The index looks like:
360+
361+
```cpp
362+
// A.h.index
363+
namespace std{inline namespace __1{template <class _Tp, class _Allocator> class __attribute__((annotate("$clingAutoload$vector"))) __attribute__((annotate("$clingAutoload$A.h"))) __vector_base;
364+
}}
365+
...
366+
template <class T, class U = int> struct __attribute__((annotate("$clingAutoload$A.h"))) AStruct;
367+
```
368+
369+
Upon requiring the complete type of an entity, Cling includes the relevant
370+
header file to get it. There are several trivial workarounds to deal with
371+
default arguments and default template arguments as they now appear on the
372+
forward declaration and then the definition. You can read more [here](https://github.com/root-project/root/blob/master/README/README.CXXMODULES.md#header-parsing-in-root).
373+
374+
Although the implementation could not be called a reference implementation,
375+
it shows that the Parser and the Preprocessor of Clang are relatively stateless
376+
and can be used to process character sequences which are not linear in their
377+
nature. In particular namespace-scope definitions are relatively easy to handle
378+
and it is not very difficult to return to namespace-scope when we lazily parse
379+
something. For other contexts such as local classes we will have lost some
380+
essential information such as name lookup tables for local entities. However,
381+
these cases are probably not very interesting as the lazy parsing granularity
382+
is probably worth doing only for top-level entities.
383+
384+
Such implementation can help with already existing issues in the standard such
385+
as CWG2335, under which the delayed portions of classes get parsed immediately
386+
when they're first needed, if that first usage precedes the end of the class.
387+
That should give good motivation to upstream all the operations needed to
388+
return to an enclosing scope and parse something.
389+
390+
**Implementation approach**:
391+
392+
Upon seeing a tag definition during parsing we could create a forward declaration,
393+
record the token sequence and mark it as a lazy definition. Later upon complete
394+
type request, we could re-position the parser to parse the definition body.
395+
We already skip some of the template specializations in a similar way [[commit](https://github.com/llvm/llvm-project/commit/b9fa99649bc99), [commit](https://github.com/llvm/llvm-project/commit/0f192e89405ce)].
396+
397+
Another approach is every lazy parsed entity to record its token stream and change
398+
the Toks stored on LateParsedDeclarations to optionally refer to a subsequence of
399+
the externally-stored token sequence instead of storing its own sequence
400+
(or maybe change CachedTokens so it can do that transparently). One of the
401+
challenges would be that we currently modify the cached tokens list to append
402+
an "eof" token, but it should be possible to handle that in a different way.
403+
404+
In some cases, a class definition can affect its surrounding context in a few
405+
ways you'll need to be careful about here:
406+
407+
1) `struct X` appearing inside the class can introduce the name `X` into the enclosing context.
408+
409+
2) `static inline` declarations can introduce global variables with non-constant initializers
410+
that may have arbitrary side-effects.
411+
412+
For point (2), there's a more general problem: parsing any expression can trigger
413+
a template instantiation of a class template that has a static data member with
414+
an initializer that has side-effects. Unlike the above two cases, I don't think
415+
there's any way we can correctly detect and handle such cases by some simple analysis
416+
of the token stream; actual semantic analysis is required to detect such cases. But
417+
perhaps if they happen only in code that is itself unused, it wouldn't be terrible
418+
for Clang to have a language mode that doesn't guarantee that such instantiations
419+
actually happen.
420+
421+
Alternative and more efficient implementation could be to make the lookup tables
422+
range based but we do not have even a prototype proving this could be a feasible
423+
approach.
424+
425+
tasks: |
426+
* Design and implementation of on-demand compilation for non-templated functions
427+
* Support non-templated structs and classes
428+
* Run performance benchmarks on relevant codebases and prepare report
429+
* Prepare a community RFC document
430+
* [Stretch goal] Support templates
431+
432+
The successful candidate should commit to regular participation in weekly
433+
meetings, deliver presentations, and contribute blog posts as requested.
434+
Additionally, they should demonstrate the ability to navigate the
435+
community process with patience and understanding.
436+
149437
- name: "Enable cross-talk between Python and C++ kernels in xeus-clang-REPL by using Cppyy"
150438
description: |
151439
xeus-clang-REPL is a C++ kernel for Jupyter notebooks using clang-REPL as

0 commit comments

Comments
 (0)