|
| 1 | +- name: "Integrate a Large Language Model with the xeus-cpp Jupyter kernel" |
| 2 | + description: | |
| 3 | + xeus-cpp is a Jupyter kernel for cpp based on the native implementation |
| 4 | + of the Jupyter protocol xeus. This enables users to write and execute |
| 5 | + C++ code interactively, seeing the results immediately. This REPL |
| 6 | + (read-eval-print-loop) nature allows rapid prototyping and iterations |
| 7 | + without the overhead of compiling and running separate C++ programs. |
| 8 | + This also achieves C++ and Python integration within a single Jupyter |
| 9 | + environment. |
| 10 | +
|
| 11 | + This project aims to integrate a large language model, such as Bard/Gemini, |
| 12 | + with the xeus-cpp Jupyter kernel. This integration will enable users to |
| 13 | + interactively generate and execute code in C++ leveraging the assistance |
| 14 | + of the language model. Upon successful integration, users will have access |
| 15 | + to features such as code autocompletion, syntax checking, semantic |
| 16 | + understanding, and even code generation based on natural language prompts. |
| 17 | +
|
| 18 | + tasks: | |
| 19 | + * Design and implement mechanisms to interface the large language model with the xeus-cpp kernel. Jupyter-AI might be used as a motivating example |
| 20 | + * Develop functionalities within the kernel to utilize the language model for code generation based on natural language descriptions and suggestions for autocompletion. |
| 21 | + * Comprehensive documentation and thorough testing/CI additions to ensure reliability. |
| 22 | + * [Stretch Goal] After achieving the previous milestones, the student can work on specializing the model for enhanced syntax and semantic understanding capabilities by using xeus notebooks as datasets. |
| 23 | +
|
| 24 | +
|
| 25 | +- name: "Implementing missing features in xeus-cpp" |
| 26 | + description: | |
| 27 | + xeus-cpp is a Jupyter kernel for cpp based on the native implementation |
| 28 | + of the Jupyter protocol xeus. This enables users to write and execute |
| 29 | + C++ code interactively, seeing the results immediately. This REPL |
| 30 | + (read-eval-print-loop) nature allows rapid prototyping and iterations |
| 31 | + without the overhead of compiling and running separate C++ programs. |
| 32 | + This also achieves C++ and Python integration within a single Jupyter |
| 33 | + environment. |
| 34 | +
|
| 35 | + The xeus-cpp is a successor of xeus-clang-repl and xeus-cling. The project |
| 36 | + goal is to advance the project feature support to the extent of what’s |
| 37 | + supported in xeus-clang-repl and xeus-cling. |
| 38 | +
|
| 39 | + tasks: | |
| 40 | + * Fix occasional bugs in clang-repl directly in llvm upstream |
| 41 | + * Implement the value printing logic |
| 42 | + * Advance the wasm infrastructure |
| 43 | + * Write tutorials and demonstrators |
| 44 | + * Complete the transition of xeus-clang-repl to xeus-cpp |
| 45 | +
|
| 46 | +
|
| 47 | +- name: "Adoption of CppInterOp in ROOT" |
| 48 | + description: | |
| 49 | + Incremental compilation pipelines process code chunk-by-chunk by building |
| 50 | + an ever-growing translation unit. Code is then lowered into the LLVM IR |
| 51 | + and subsequently run by the LLVM JIT. Such a pipeline allows creation of |
| 52 | + efficient interpreters. The interpreter enables interactive exploration |
| 53 | + and makes the C++ language more user friendly. The incremental compilation |
| 54 | + mode is used by the interactive C++ interpreter, Cling, initially developed |
| 55 | + to enable interactive high-energy physics analysis in a C++ environment. |
| 56 | + The CppInterOp library provides a minimalist approach for other languages |
| 57 | + to identify C++ entities (variables, classes, etc.). This enables |
| 58 | + interoperability with C++ code, bringing the speed and efficiency of C++ |
| 59 | + to simpler, more interactive languages like Python. CppInterOp provides |
| 60 | + primitives that are good for providing reflection information. |
| 61 | +
|
| 62 | + The ROOT is an open-source data analysis framework used by high energy |
| 63 | + physics and others to analyze petabytes of data, scientifically. The |
| 64 | + framework provides support for data storage and processing by relying |
| 65 | + on Cling, Clang, LLVM for building automatically efficient I/O |
| 66 | + representation of the necessary C++ objects. The I/O properties of each |
| 67 | + object is described in a compilable C++ file called a /dictionary/. |
| 68 | + ROOT’s I/O dictionary system relies on reflection information provided |
| 69 | + by Cling and Clang. However, the reflection information system has grown |
| 70 | + organically and now ROOT’s core/metacling system has been hard to maintain |
| 71 | + and integrate. |
| 72 | +
|
| 73 | + The goal of this project is to integrate CppInterOp in ROOT where possible. |
| 74 | +
|
| 75 | + tasks: | |
| 76 | + * To achieve this goal we expect several infrastructure items to be completed such as Windows support, WASM support |
| 77 | + * Make reusable github actions across multiple repositories |
| 78 | + * Sync the state of the dynamic library manager with the one in ROOT |
| 79 | + * Sync the state of callfunc/jitcall with the one in ROOT |
| 80 | + * Prepare the infrastructure for upstreaming to llvm |
| 81 | + * Propose an RFC and make a presentation to the ROOT development team |
| 82 | +
|
| 83 | +
|
1 | 84 | - name: "Implement CppInterOp API exposing memory, ownership and thread safety information "
|
2 | 85 | description: |
|
3 | 86 | Incremental compilation pipelines process code chunk-by-chunk by building
|
|
82 | 165 | defined via Cppyy into fast machine code. Since Numba compiles the code in
|
83 | 166 | loops into machine code it crosses the language barrier just once and avoids
|
84 | 167 | large slowdowns accumulating from repeated calls between the two languages.
|
85 |
| - Numba uses its own lightweight version of the LLVM compiler toolkit (llvmlite) |
| 168 | + Numba uses its own lightweight version of the LLVM compiler toolkit ([llvmlite](https://github.com/numba/llvmlite)) |
86 | 169 | that generates an intermediate code representation (LLVM IR) which is also
|
87 | 170 | supported by the Clang compiler capable of compiling CUDA C++ code.
|
88 | 171 |
|
|
146 | 229 | * Work on integrating these plugins with toolkits like CUTLASS that
|
147 | 230 | utilise the bindings to provide a Python API
|
148 | 231 |
|
| 232 | +- name: "Improve the LLVM.org Website Look and Feel" |
| 233 | + description: | |
| 234 | + The llvm.org website serves as the central hub for information about the |
| 235 | + LLVM project, encompassing project details, current events, and relevant |
| 236 | + resources. Over time, the website has evolved organically, prompting the |
| 237 | + need for a redesign to enhance its modernity, structure, and ease of |
| 238 | + maintenance. |
| 239 | + |
| 240 | + The goal of this project is to create a contemporary and coherent static |
| 241 | + website that reflects the essence of LLVM.org. This redesign aims to improve |
| 242 | + navigation, taxonomy, content discoverability, and overall usability. Given |
| 243 | + the critical role of the website in the community, efforts will be made to |
| 244 | + engage with community members, seeking consensus on the proposed changes. |
| 245 | +
|
| 246 | + LLVM's [current website](https://llvm.org) is a complicated mesh of uncoordinated pages with |
| 247 | + inconsistent, static links pointing to both internal and external sources. |
| 248 | + The website has grown substantially and haphazardly since its inception. |
| 249 | +
|
| 250 | + It requires a major UI and UX overhaul to be able to better serve the LLVM |
| 251 | + community. |
| 252 | +
|
| 253 | + Based on a preliminary site audit, following are some of the problem areas |
| 254 | + that need to be addressed. |
| 255 | +
|
| 256 | + **Sub-Sites**: Many of the sections/sub-sites have a completely different UI/UX |
| 257 | + (e.g., [main](https://llvm.org), [clang](https://clang.llvm.org), |
| 258 | + [lists](https://lists.llvm.org/cgi-bin/mailman/listinfo), |
| 259 | + [foundation](https://foundation.llvm.org), |
| 260 | + [circt](https://circt.llvm.org/docs/GettingStarted/), |
| 261 | + [lnt](http://lnt.llvm.org), and [docs](https://llvm.org/docs)). |
| 262 | + Sub-sites are divided into 8 separate repos and use different technologies |
| 263 | + including [Hugo](https://github.com/llvm/circt-www/blob/main/website/config.toml), |
| 264 | + [Jekyll](https://github.com/llvm/clangd-www/blob/main/_config.yml), etc. |
| 265 | +
|
| 266 | + **Navigation**: On-page navigation is inconsistent and confusing. Cross-sub-site |
| 267 | + navigation is inconsistent, unintuitive, and sometimes non-existent. Important |
| 268 | + subsections often depend on static links within (seemingly random) pages. |
| 269 | + Multi-word menu items are center-aligned and flow out of margins. |
| 270 | + |
| 271 | + **Pages**: Many [large write-ups](https://clang.llvm.org/docs/UsersManual.html) |
| 272 | + lack pagination, section boundaries, etc., making |
| 273 | + them seem more intimidating than they really are. Several placeholder pages |
| 274 | + re-route to [3rd party services](https://llvm.swoogo.com/2023devmtg), |
| 275 | + adding bloat and inconsistency. |
| 276 | +
|
| 277 | + **Search**: Search options are placed in unintuitive locations, like the bottom |
| 278 | + of the side panel, or from [static links](https://llvm.org/docs/) to |
| 279 | + [redundant pages](https://llvm.org/docs/search.html). Some pages have |
| 280 | + no search options at all. With multiple sections of the website hosted in |
| 281 | + separate projects/repos, cross-sub-site search doesn't seem possible. |
| 282 | + |
| 283 | + **Expected results**: A modern, coherent-looking website that attracts new |
| 284 | + prospect users and empowers the existing community with better navigation, |
| 285 | + taxonomy, content discoverability, and overall usability. It should also |
| 286 | + include a more descriptive Contribution Guide ([example](https://kitian616.github.io/jekyll-TeXt-theme/docs/en/layouts)) to help novice |
| 287 | + contributors, as well as to help maintain a coherent site structure. |
| 288 | +
|
| 289 | + Since the website is a critical infrastructure and most of the community |
| 290 | + will have an opinion this project should try to engage with the community |
| 291 | + building community consensus on the steps being taken. |
| 292 | +
|
| 293 | + tasks: | |
| 294 | + * Conduct a comprehensive content audit of the existing website. |
| 295 | + * Select appropriate technologies, preferably static site generators like |
| 296 | + Hugo or Jekyll. |
| 297 | + * Advocate for a separation of data and visualization, utilizing formats such |
| 298 | + as YAML and Markdown to facilitate content management without direct HTML |
| 299 | + coding. |
| 300 | + * Present three design mockups for the new website, fostering open discussions |
| 301 | + and allowing time for alternative proposals from interested parties. |
| 302 | + * Implement the chosen design, incorporating valuable feedback from the |
| 303 | + community. |
| 304 | + * Collaborate with content creators to integrate or update content as needed. |
| 305 | + |
| 306 | + The successful candidate should commit to regular participation in weekly |
| 307 | + meetings, deliver presentations, and contribute blog posts as requested. |
| 308 | + Additionally, they should demonstrate the ability to navigate the community |
| 309 | + process with patience and understanding. |
| 310 | +
|
| 311 | +
|
| 312 | +- name: "On Demand Parsing in Clang" |
| 313 | + description: | |
| 314 | + Clang, like any C++ compiler, parses a sequence of characters as they appear, |
| 315 | + linearly. The linear character sequence is then turned into tokens and AST |
| 316 | + before lowering to machine code. In many cases the end-user code uses a small |
| 317 | + portion of the C++ entities from the entire translation unit but the user |
| 318 | + still pays the price for compiling all of the redundancies. |
| 319 | +
|
| 320 | + This project proposes to process the heavy compiling C++ entities upon using |
| 321 | + them rather than eagerly. This approach is already adopted in Clang’s CodeGen |
| 322 | + where it allows Clang to produce code only for what is being used. On demand |
| 323 | + compilation is expected to significantly reduce the compilation peak memory |
| 324 | + and improve the compile time for translation units which sparsely use their |
| 325 | + contents. In addition, that would have a significant impact on interactive |
| 326 | + C++ where header inclusion essentially becomes a no-op and entities will be |
| 327 | + only parsed on demand. |
| 328 | +
|
| 329 | + The Cling interpreter implements a very naive but efficient cross-translation |
| 330 | + unit lazy compilation optimization which scales across hundreds of libraries |
| 331 | + in the field of high-energy physics. |
| 332 | +
|
| 333 | + ```cpp |
| 334 | + // A.h |
| 335 | + #include <string> |
| 336 | + #include <vector> |
| 337 | + template <class T, class U = int> struct AStruct { |
| 338 | + void doIt() { /*...*/ } |
| 339 | + const char* data; |
| 340 | + // ... |
| 341 | + }; |
| 342 | +
|
| 343 | + template<class T, class U = AStruct<T>> |
| 344 | + inline void freeFunction() { /* ... */ } |
| 345 | + inline void doit(unsigned N = 1) { /* ... */ } |
| 346 | +
|
| 347 | + // Main.cpp |
| 348 | + #include "A.h" |
| 349 | + int main() { |
| 350 | + doit(); |
| 351 | + return 0; |
| 352 | + } |
| 353 | + ``` |
| 354 | +
|
| 355 | + This pathological example expands to 37253 lines of code to process. Cling |
| 356 | + builds an index (it calls it an autoloading map) where it contains only |
| 357 | + forward declarations of these C++ entities. Their size is 3000 lines of code. |
| 358 | + |
| 359 | + The index looks like: |
| 360 | +
|
| 361 | + ```cpp |
| 362 | + // A.h.index |
| 363 | + namespace std{inline namespace __1{template <class _Tp, class _Allocator> class __attribute__((annotate("$clingAutoload$vector"))) __attribute__((annotate("$clingAutoload$A.h"))) __vector_base; |
| 364 | + }} |
| 365 | + ... |
| 366 | + template <class T, class U = int> struct __attribute__((annotate("$clingAutoload$A.h"))) AStruct; |
| 367 | + ``` |
| 368 | +
|
| 369 | + Upon requiring the complete type of an entity, Cling includes the relevant |
| 370 | + header file to get it. There are several trivial workarounds to deal with |
| 371 | + default arguments and default template arguments as they now appear on the |
| 372 | + forward declaration and then the definition. You can read more [here](https://github.com/root-project/root/blob/master/README/README.CXXMODULES.md#header-parsing-in-root). |
| 373 | +
|
| 374 | + Although the implementation could not be called a reference implementation, |
| 375 | + it shows that the Parser and the Preprocessor of Clang are relatively stateless |
| 376 | + and can be used to process character sequences which are not linear in their |
| 377 | + nature. In particular namespace-scope definitions are relatively easy to handle |
| 378 | + and it is not very difficult to return to namespace-scope when we lazily parse |
| 379 | + something. For other contexts such as local classes we will have lost some |
| 380 | + essential information such as name lookup tables for local entities. However, |
| 381 | + these cases are probably not very interesting as the lazy parsing granularity |
| 382 | + is probably worth doing only for top-level entities. |
| 383 | +
|
| 384 | + Such implementation can help with already existing issues in the standard such |
| 385 | + as CWG2335, under which the delayed portions of classes get parsed immediately |
| 386 | + when they're first needed, if that first usage precedes the end of the class. |
| 387 | + That should give good motivation to upstream all the operations needed to |
| 388 | + return to an enclosing scope and parse something. |
| 389 | +
|
| 390 | + **Implementation approach**: |
| 391 | +
|
| 392 | + Upon seeing a tag definition during parsing we could create a forward declaration, |
| 393 | + record the token sequence and mark it as a lazy definition. Later upon complete |
| 394 | + type request, we could re-position the parser to parse the definition body. |
| 395 | + We already skip some of the template specializations in a similar way [[commit](https://github.com/llvm/llvm-project/commit/b9fa99649bc99), [commit](https://github.com/llvm/llvm-project/commit/0f192e89405ce)]. |
| 396 | +
|
| 397 | + Another approach is every lazy parsed entity to record its token stream and change |
| 398 | + the Toks stored on LateParsedDeclarations to optionally refer to a subsequence of |
| 399 | + the externally-stored token sequence instead of storing its own sequence |
| 400 | + (or maybe change CachedTokens so it can do that transparently). One of the |
| 401 | + challenges would be that we currently modify the cached tokens list to append |
| 402 | + an "eof" token, but it should be possible to handle that in a different way. |
| 403 | +
|
| 404 | + In some cases, a class definition can affect its surrounding context in a few |
| 405 | + ways you'll need to be careful about here: |
| 406 | +
|
| 407 | + 1) `struct X` appearing inside the class can introduce the name `X` into the enclosing context. |
| 408 | +
|
| 409 | + 2) `static inline` declarations can introduce global variables with non-constant initializers |
| 410 | + that may have arbitrary side-effects. |
| 411 | +
|
| 412 | + For point (2), there's a more general problem: parsing any expression can trigger |
| 413 | + a template instantiation of a class template that has a static data member with |
| 414 | + an initializer that has side-effects. Unlike the above two cases, I don't think |
| 415 | + there's any way we can correctly detect and handle such cases by some simple analysis |
| 416 | + of the token stream; actual semantic analysis is required to detect such cases. But |
| 417 | + perhaps if they happen only in code that is itself unused, it wouldn't be terrible |
| 418 | + for Clang to have a language mode that doesn't guarantee that such instantiations |
| 419 | + actually happen. |
| 420 | +
|
| 421 | + Alternative and more efficient implementation could be to make the lookup tables |
| 422 | + range based but we do not have even a prototype proving this could be a feasible |
| 423 | + approach. |
| 424 | +
|
| 425 | + tasks: | |
| 426 | + * Design and implementation of on-demand compilation for non-templated functions |
| 427 | + * Support non-templated structs and classes |
| 428 | + * Run performance benchmarks on relevant codebases and prepare report |
| 429 | + * Prepare a community RFC document |
| 430 | + * [Stretch goal] Support templates |
| 431 | +
|
| 432 | + The successful candidate should commit to regular participation in weekly |
| 433 | + meetings, deliver presentations, and contribute blog posts as requested. |
| 434 | + Additionally, they should demonstrate the ability to navigate the |
| 435 | + community process with patience and understanding. |
| 436 | +
|
149 | 437 | - name: "Enable cross-talk between Python and C++ kernels in xeus-clang-REPL by using Cppyy"
|
150 | 438 | description: |
|
151 | 439 | xeus-clang-REPL is a C++ kernel for Jupyter notebooks using clang-REPL as
|
|
0 commit comments