|
| 1 | +--- |
| 2 | +author: "Erick Velez (evelez7)" |
| 3 | +date: "2025-08-28" |
| 4 | +tags: ["GSoC", "clang-doc", "clang-tools-extra", "documentation"] |
| 5 | +title: "GSoC 2025: Improving Core Clang-Doc Functionality" |
| 6 | +--- |
| 7 | + |
| 8 | +I was selected as a contributor for GSoC 2025 under the project "Improving Core Clang-Doc Functionality" for LLVM. |
| 9 | +My mentors for the project were Paul Kirth and Petr Hosek. |
| 10 | + |
| 11 | +Clang-Doc is a tool in clang-tools-extra that generates documentation from Clang's AST. |
| 12 | +Clang-Doc can output documentation in Markdown, HTML, YAML, and JSON. |
| 13 | +The project started in 2018 but major development eventually slowed. |
| 14 | +Recently, there have been efforts to get it back on track. |
| 15 | + |
| 16 | +This year, the [GSoC project idea](https://discourse.llvm.org/t/improve-documentation-parsing-in-clang/84513) had a simple premise: improve core functionality. |
| 17 | + |
| 18 | +## The Issues |
| 19 | + |
| 20 | +The project idea proposed three main areas of focus to improve documentation quality. |
| 21 | + |
| 22 | +1. C++ support |
| 23 | +2. Doxygen comments |
| 24 | +3. Markdown support |
| 25 | + |
| 26 | +First, not all C++ constructs were supported, like friends or concepts. |
| 27 | +Not supporting core C++ constructs in C++ documentation is not good. |
| 28 | +Second, it's important that Doxygen command support is robust and that we can support as many as possible. |
| 29 | +Third and last, having Markdown available to developers for documentation would be useful. |
| 30 | +Markdown provides the power of expression in an area that is technically dense. |
| 31 | +It can be used to highlight critical information and warnings. |
| 32 | + |
| 33 | +### The Architecture |
| 34 | + |
| 35 | +Here's a quick overview on Clang-Doc's architecture, which follows a map-reduce pattern: |
| 36 | + |
| 37 | +1. Visit source declarations via Clang's ASTVisitor. |
| 38 | +2. Serialize relevant source information into an Info (Clang-Doc's main data entity). |
| 39 | +3. Once all source declarations are serialized, write them into bitcode, reduce, and read the reduced Infos. |
| 40 | +4. Serialize Infos into the desired format. |
| 41 | + |
| 42 | +It seems fairly straightforward, but the architecture had a critical flaw. |
| 43 | +If a new C++ construct needed to be supported, it would be visited and serialized, but then support would have to be added to each backend individually. |
| 44 | +If you wanted to serialize something in YAML, you'd have to implement the Markdown logic separately. |
| 45 | +This placed a very high maintenance cost for extending basic functionality, even if you just wanted to add something simple. |
| 46 | +It also easily led to generator disparity; a construct might be serialized in YAML, but not in Markdown. |
| 47 | + |
| 48 | +Testing was also in an awkward spot because it was unclear what format would be used to verify if the documentation output was acceptable. |
| 49 | +YAML was the initial candidate for this, but my mentors had started to consider JSON instead. |
| 50 | +Feature parity was far apart; some backends were tested for certain attributes that others didn't have. |
| 51 | + |
| 52 | +## The Good: Mustache |
| 53 | + |
| 54 | +Last year's GSoC brought in great improvements that became the basis of my summer. |
| 55 | +First, last year's GSoC contributor landed a large performance improvement. |
| 56 | +I might not have been able to test Clang-Doc on Clang itself without it. |
| 57 | + |
| 58 | +Another contribution that was essential to my summer is the [Mustache template implementation](https://mustache.github.io/) in LLVM. |
| 59 | +Mustache templates allow Clang-Doc to shift away from manually generating HTML tags and eliminate high maintenance burdens. |
| 60 | +Templates could also solve the feature parity problem by using JSON to feed templates. |
| 61 | + |
| 62 | + |
| 63 | +# Building a JSON Backend |
| 64 | + |
| 65 | +While familiarizing myself with the codebase during the Community Bonding Period, I quickly determined that implementing a JSON backend would be incredibly beneficial to the project and my summer plans. |
| 66 | +A JSON backend presented two immediate benefits: |
| 67 | + |
| 68 | +1. We could use it to feed HTML Mustache templates and future template usage. |
| 69 | +2. As the main feeder format, testing can be focused on the JSON output. |
| 70 | + |
| 71 | +The existing Mustache backend in Clang-Doc already contained logic to create JSON documents, but they were immediately discarded when the templates were rendered. |
| 72 | +I adapted most of the code into a separate generator to output JSON files and was able land it within two weeks. |
| 73 | +This ended up accelerating my work because I could implement support for C++ constructs and test them in JSON instead of another format that we would probably be refactoring in the near future. |
| 74 | + |
| 75 | +### Pull Requests |
| 76 | +- [add tags to Mustache namespace template](https://github.com/llvm/llvm-project/pull/142045) |
| 77 | +- [add namespaces](https://github.com/llvm/llvm-project/pull/142483) |
| 78 | +- [removed default label on some switches](https://github.com/llvm/llvm-project/pull/143919) |
| 79 | +- [precommit](https://github.com/llvm/llvm-project/pull/144160) and [add support for concepts](https://github.com/llvm/llvm-project/pull/144430) |
| 80 | +- [precommit](https://github.com/llvm/llvm-project/pull/145069) and [document global variables](https://github.com/llvm/llvm-project/pull/145070) |
| 81 | +- [refactor JSONGenerator array usage](https://github.com/llvm/llvm-project/pull/145595) |
| 82 | +- [refactor BitcodeReader::readSubBlock](https://github.com/llvm/llvm-project/pull/145835) |
| 83 | +- [serialize isBuiltIn and IsTemplate](https://github.com/llvm/llvm-project/pull/146149) |
| 84 | +- [precommit](https://github.com/llvm/llvm-project/pull/146164) and [friends](https://github.com/llvm/llvm-project/pull/146165) |
| 85 | + |
| 86 | +# Comments |
| 87 | + |
| 88 | +## Groups and Order |
| 89 | + |
| 90 | +Comments weren't ordered in Clang-Doc's HTML documentation. |
| 91 | +They were just displayed in whatever order they were serialized in, which is the order that they're written in source. |
| 92 | +This meant comments would be extremely difficult to read - you don't want to search for another parameter comment after reading the first one, even if they're expected to be written in order in source. |
| 93 | + |
| 94 | +Funnily enough, Mustache made this a little more complicated. |
| 95 | +The only logic operation that Mustache has to check if a field exists is an iteration like `{{#Fields}}`, but any header that denotes a comment section would be duplicated. |
| 96 | + |
| 97 | +```html |
| 98 | +{{#Fields}} |
| 99 | +<h3>Field Header</h3> |
| 100 | + {{FieldInfo}} |
| 101 | +{{/Fields}} |
| 102 | +``` |
| 103 | + |
| 104 | +All of the logic to order them needs to be done in the serialization to JSON itself, so I overhauled our comment organization. |
| 105 | +Previously, Clang-Doc's comments were organized exactly as in Clang's AST like the following: |
| 106 | + |
| 107 | +- FullComment |
| 108 | + - BriefComment |
| 109 | + - ParagraphComment |
| 110 | + - TextComment |
| 111 | + - TextComment |
| 112 | + - BriefComment |
| 113 | + - ParagraphComment |
| 114 | + |
| 115 | +Everything was unnecessarily nested under a FullComment, and TextComments were also unnecessarily nested. |
| 116 | +Every non-verbatim comment's text was held in one ParagraphComment. |
| 117 | +Since there was only one, we could reduce some boilerplate by directly mapping to the array of TextComments. |
| 118 | + |
| 119 | +After the change, Clang-Doc's comments were structured like this: |
| 120 | + |
| 121 | +- BriefComments |
| 122 | + - TextCommentArray |
| 123 | + - TextCommentArray |
| 124 | +- ParagraphComments |
| 125 | + - TextCommentArray |
| 126 | + |
| 127 | +Now, we can just iterate over every type of comment, which means iterating over every array. |
| 128 | +This left our JSON documentation with a few more fields, since one is needed for every Doxygen command, but with easier identification of what comments exist in the documentation. |
| 129 | + |
| 130 | + |
| 131 | +### Pull Requests |
| 132 | +- [add namespace references to VarInfo](https://github.com/llvm/llvm-project/pull/146964) |
| 133 | +- [Serialize record files with mangled name](https://github.com/llvm/llvm-project/pull/148021) |
| 134 | +- [fix ASan complaints from passing RepositoryURL as reference](https://github.com/llvm/llvm-project/pull/148923) |
| 135 | +- [refactor JSON for better Mustache compatibility](https://github.com/llvm/llvm-project/pull/149588) |
| 136 | +- [integrate JSON as the source for Mustache templates](https://github.com/llvm/llvm-project/pull/149589) |
| 137 | +- [separate comments into categories](https://github.com/llvm/llvm-project/pull/149590) |
| 138 | +- [enable comments in class templates](https://github.com/llvm/llvm-project/pull/149848) |
| 139 | +- [remove nesting of text comments inside paragraphs](https://github.com/llvm/llvm-project/pull/150451) |
| 140 | +- [generate comments for functions](https://github.com/llvm/llvm-project/pull/150570) |
| 141 | +- [add param comments to comment template](https://github.com/llvm/llvm-project/pull/150571) |
| 142 | + |
| 143 | +# Markdown |
| 144 | +Markdown was the most speculative aspect of the project. |
| 145 | +It wasn't clear whether we'd try to integrate a solution into Clang itself or whether we'd keep it in clang-tools-extra. |
| 146 | + |
| 147 | +## A JavaScript Solution |
| 148 | +The first option I explored was suggested by my mentor, which was a Javascript library called [Markdown-Tag](https://github.com/MarketingPipeline/Markdown-Tag) |
| 149 | +This would've been really convenient since all it requires is an HTML tag to enable rendering, so any comment text in a template can be easily rendered. |
| 150 | +Unfortunately, it requires all HTML to be sanitized, which defeats the purpose of a ready-made solution for us. |
| 151 | +We would have to parse any potential HTML in comments anyways. |
| 152 | + |
| 153 | +## A Parser Solution |
| 154 | +Without an out-of-the-box solution, we were left with implementing our own parser. |
| 155 | +When I considered this in my proposal, I knew an in-tree parser would want to conform to the simplest possible standard. |
| 156 | +Markdown has no official standard, so I opted for CommonMark conformance. |
| 157 | + |
| 158 | +The summer ended without a complete solution since the a couple weeks were spent researching whether or not this could be integrated directly in the Clang comment parser or whether we'd need to build our own solution or not. |
| 159 | +You can see my initial draft [here](https://github.com/llvm/llvm-project/pull/155887). |
| 160 | + |
| 161 | +# Overview |
| 162 | +I implemented a new JSON generator for Clang-Doc that will serve as the basis for documentation generation. |
| 163 | +This will vastly reduce overall lines of code and maintenance burdens. |
| 164 | +I refactored our comment handling to streamline the logic that handles them and for better output in the HTML. |
| 165 | +I also explored options for rendering Markdown and began an implenetation for a parser that I plan on working on in the future. |
| 166 | +Along the way, I also did some refactoring to improve code reuse and improve contributor burden by reducing boilerplate code. |
| 167 | + |
| 168 | +Over the summer, I addressed these issues: |
| 169 | +- [template operator T() produces a bad name](https://github.com/llvm/llvm-project/issues/59812) |
| 170 | +- [Add a JSON backend to clang-doc to better leverage mustache templates](https://github.com/llvm/llvm-project/issues/140094) |
| 171 | +- [Reconsider use of enum InfoType::IT_default](https://github.com/llvm/llvm-project/issues/142888) |
| 172 | +- [Add a JSON backend to clang-doc to better leverage mustache templates](https://github.com/llvm/llvm-project/issues/140094) |
| 173 | + |
| 174 | +# Future Work |
| 175 | + |
| 176 | +## Doxygen Grouping |
| 177 | + |
| 178 | +Doxygen has a very useful [grouping](https://www.doxygen.nl/manual/grouping.html) feature that allows structures to be grouped under a custom heading or on separate pages. |
| 179 | +You can see it in [llvm::sys::path](https://llvm.org/doxygen/namespacellvm_1_1sys_1_1path.html). |
| 180 | +We [opened up an issue](https://github.com/llvm/llvm-project/issues/151184#issuecomment-3133596874) for Clang to track this issue, which ended up being a duplicate of [this issue](https://github.com/llvm/llvm-project/issues/123582). |
| 181 | + |
| 182 | +There would most likely have to be some major changes to Clang's comment parsing and Clang's own parsing. |
| 183 | +That's because a lot of the group opening tokens in Clang are free-floating, like so: |
| 184 | + |
| 185 | +```cpp |
| 186 | +/// @{ |
| 187 | + |
| 188 | +class Foo {}; |
| 189 | +``` |
| 190 | +
|
| 191 | +That `@{` doesn't attach to a Decl; only comments directly above a declaration are attached to a Decl in the AST. |
| 192 | +My mentors wisely advised that this would be too much to even consider this summer, and could probably be its own GSoC project. |
| 193 | +
|
| 194 | +## Cross-referencing |
| 195 | +
|
| 196 | +In Doxygen you can use the `@copydoc` command to copy the documentation from one entity to another. |
| 197 | +Doxygen also displays where an entity is referenced, like where a function is invoked. |
| 198 | +Clang-Doc currently has no support for this kind of behavior. |
| 199 | +
|
| 200 | +Clang-Doc would need a preprocessing step where any reference to another entity is identified and then resolved somewhere. |
| 201 | +One of my mentors pointed out that it would be great to do during the reduction step where every Info is being visited anyways. |
| 202 | +This actually wasn't something I had even considered in my proposal besides identifying that `@copydoc` wasn't supported by the comment parser. |
| 203 | +It's a common feature of modern documentation, so hopefully someday soon Clang-Doc can acquire it. |
| 204 | +
|
| 205 | +# Acknowledgements |
| 206 | +Thank you very much to my mentors Paul Kirth and Petr Hosek for guiding me and advising me in this project. |
| 207 | +I learned so much from review feedback and our conversations. |
0 commit comments