refactor: Rework docstring parser #996

dybucc · 2025-12-07T08:12:26Z

The extractor needed some fine tuning to only pick up top level docstrings instead of (possibly wrong) locally scoped docstrings outside the reach of library users.

The function signature parser should now be a tad bit more efficient. It performs no operations on the whole array that gets passed beyond those strictly required by however as many elements (lines) are part of the function signature. Prior to this, there were a bunch of enumerate and join calls that wouldn't exactly be efficient for, say, arrays containing thousands of lines for some of the source code.

Further work will continue in the function parameter parsing, by possibly modifying the parameters of the parser itself, such that without modifying the resulting typst query output, we avoid performing two passes through the argument list; one in the initial function signature parsing, and another one in the parameter list parsing.

Once work on the function signature is done, the next step will be to fix the actual docstring parser, such that it picks up on newlines in function parameter documentation. An example of a docstring that I expect the parser to work through nicely is given in #986. Only after this is done, will I try to move on to seeing what can be done with the type syntax incompatibilities between the manual and the web documentation.

I also wanted to ask whether it's a good idea to be running the Python script for HTML generation directly and not through an isolated environment, by possibly using the nice integration just has with uv for script recipes [1].

The extractor needed some fine tuning to only pick up top level docstrings instead of (possibly wrong) locally scoped docstrings outside the reach of library users. The function signature parser should now be a tad bit more efficient. It performs no operations on the whole array that gets passed beyond those strictly required by however as many elements (lines) are part of the function signature. Prior to this, there were a bunch of `enumerate` and `join` calls that wouldn't exactly be efficient for, say, arrays containing thousands of lines for some of the source code.

johannes-wolf · 2025-12-07T14:36:54Z

I also wanted to ask whether it's a good idea to be running the Python script for HTML generation directly and not through an isolated environment, by possibly using the nice integration just has with uv for script recipes [1].

What exactly do you mean?

dybucc · 2025-12-07T14:54:06Z

I also wanted to ask whether it's a good idea to be running the Python script for HTML generation directly and not through an isolated environment, by possibly using the nice integration just has with uv for script recipes [1].

What exactly do you mean?

I mean the genhtml.py script could be part of the justfile as a script recipe, instead of being run as a separate command. Then there's the fact the current recipe for documentation generation just tries to run some python executable on the system.

The genhtml.py script could be made into a Python recipe directly inside the justfile or imported through a Just module. Then the output of the typst query could be stored as part of a variable inside a Just variable.

For both of these, Just provides nice built-ins to (1) require a program to exist in the user's PATH for execution and reporting so if that's not the case, (2) isolating Python script execution with uv while also requiring uv to be installed, and (3) further isolating (in the structural sense) from the codebase those aspects that are mostly related/used in our command runner of choice.

Granted, transitioning from an independent Python script to a script recipe in Just would be a bit of work, but it should reduce the cognitive load on the script by not expecting it to work independent of justfile string replacement facilities prior to running.

An example of this that I can quickly get my hands on is a justfile I made for a simple C++ project, that I attach below. See the _compile recipe for an example of what I'm talking about with Just's string facilities being used in conjunction with Python scripts, and see variables like lldb for an example of what I mean with having Just force a program to exist in the user's PATH before attempting some invocation.

set unstable := true
set shell := ["fish", "-c"]
set script-interpreter := ["uv", "run", "--script"]
set quiet := true

alias c := clean
alias d := doc
alias cc := compile
alias r := run
alias dbg := debug

src_dir := if path_exists(justfile_directory() / "src") == "true" { justfile_directory() / "src" } else { error("src directory not found") }
build_dir := justfile_directory() / "build"
target_out := build_dir / "final_program"
lsd := require("lsd")
src_files := prepend(src_dir / "", replace(shell(lsd + " --icon=never -1 " + src_dir), "\n", ' '))
obj_files := replace(replace_regex(src_files, '([[:alpha:]]+)\.cc', '${1}.o'), src_dir, build_dir)
clangd_flags := if path_exists(justfile_directory() / "compile_flags.txt") == "true" { justfile_directory() / "compile_flags.txt" } else { error("compile_flags not found") }
cxx := require("clang++")
cxxflags := trim(replace(replace_regex(read(clangd_flags), '(?m)^-I(.*)?\n', ''), "\n", ' '))
ldflags := trim(env("LDFLAGS", "") + " -pie")
cppflags := env("CPPFLAGS", "") + " " + trim(replace_regex(read(clangd_flags), '(?m)^-[^I](.*)', ''))
doxygen := require("doxygen")
doc_dir := if path_exists(justfile_directory() / "doc") == "true" { justfile_directory() / "doc" } else { error("doc directory not found") }
doxyfile := if path_exists(doc_dir / "configDoxygen.cfg") == "true" { doc_dir / "configDoxygen.cfg" } else { error("doxyfile not found") }
lldb := require("lldb")

[private]
default:
    just --list --unsorted --justfile {{ justfile() }}

# generates doxygen documentation
[macos]
doc:
    {{ doxygen }} {{ doxyfile }}
    {{ doc_dir / "html" }} && pwd | pbcopy

# cleans up build artifacts and older docs
clean:
    rm -rf {{ build_dir }}
    rm -rf {{ doc_dir / "html" }}

# build current project (non-incrementally)
[macos]
compile: _compile
    {{ cxx }} \
    {{ obj_files }} \
    -o {{ target_out }} \
    {{ ldflags }}

[script]
_compile: clean
    # /// script
    # dependencies = ["sh"]
    # ///
    import sh

    cxx = sh.Command({{ quote(cxx) }})
    cppflags = [{{ replace(quote(cppflags), " ", "', '") }}]
    cxxflags = [{{ replace(quote(cxxflags), " ", "', '") }}]

    input = [{{ replace(quote(src_files), " ", "', '") }}]
    output = [{{ replace(quote(obj_files), " ", "', '") }}]

    sh.mkdir("-p", {{ quote(build_dir) }})

    for i, file in enumerate(input):
        cxx(*cppflags, *cxxflags, c=file, o=output[i])

# run the thing
[no-quiet]
run *args: compile
    {{ target_out }} {{ args }}

# debug the thing
[no-quiet]
debug: compile
    {{ lldb }} {{ target_out }}

johannes-wolf · 2025-12-07T15:34:54Z

Hm, I have no opinion on this. But I would like to keep tools separate – the script should work without just. I guess expecting a working python executable on the host machine is fine, tbh.

dybucc · 2025-12-07T17:15:46Z

Hm, I have no opinion on this. But I would like to keep tools separate – the script should work without just. I guess expecting a working python executable on the host machine is fine, tbh.

Either way, there's still pending work on this PR before moving on to anything related to the web documentation. I'll see then if I can make some changes to the justfile without making the genhtml.py file rely on it.

Following the plan in the PR this branch is part of, the function signature parser rework is done. The conclusion is that the signature gets parsed now all in a single pass instead of having to separately consider the function and parameter span, and then parse the parameter span. The gains in efficency on the prior commit are still kept, so no operations are performed on the whole array; only those elements of the array with all the source lines corresponding to the function are parsed. There is, though, a small performance loss in the fact that a custom runtime regex is built to more accurately have the whitespace indentation of some parameters' default values be represented in the final output. Because the project doesn't use an autoformatter, if some lines contain an indentation that is beyond a single multiple of 2 (according to the .editorconfig file) then the parser will correctly recognize that the contents of the named argument, if the named argument is not a string or Typst content type, should be deindented by however as much whitespace was detected at the start of the parameter name. Now the parameter parser also correctly implements function default arguments.

johannes-wolf · 2025-12-29T15:31:22Z

I cherry picked your commit onto master.

dybucc · 2026-01-03T18:37:28Z

I just pushed the changes I've been making throughout the last few weeks to the parser.

These include merging both the function signature parser and the argument parser to extract
information on the arguments within the string. I did this after realizing the additional bits of
information that would've gotten passed to the argument parser wouldn't really make that big of a
difference when trying to reduce the amount of passes through that specific string and its
corresponding substrings.

I also introduced support for more Typst type values if some future function uses as a named
parameter with default values an anonymous function. This should now work, as I've tested it and it
works just fine.

The docstring parser proper is mostly done. Unlike the above, it's not yet working seamlessly with
the existing docstrings, but the only thing left is to better support some edge cases like badly
documented parameters, and Typst native syntax lists.

I've also experimented with some diagnostics in the docstring parsing process, but I ended up
leaving only one, as that seems like the only thing that would make real sense without making too
many assumptions. I think once I get the unified syntax for types working in both the PDF manual and
the web docs, I'll see about adding some trace information to the docstring for panics that may get
thrown.

The highlight of the refactored docstring parser is that it now also supports mutline parameter
types by supporting a slightly larger grammar. What I've not yet added support for is having
multiline default parameter values, which would be nice to pair up with the existing support for
multiline default values already implemented in the function signature parser.

To recap:

The docstring parser is almost done.
The unified type syntax is not yet done. I've not thought much about it, so ideas are very much
appreciated.
The docstring parser is going to get two types of documentation:
- Documentation for future and present contributors to document CeTZ well.
- Documentation for the parser itself, so the overhead for future maintainers of the parser is
  less.

The parser is mostly done. This commit will be ammended/fixed up once the docstring parser is completely done, so more details can be found in the accompanying PR.

dybucc changed the title ~~refactor: Rework extractor and fn signature parser~~ refactor: Rework docstring parser Dec 7, 2025

dybucc marked this pull request as draft December 7, 2025 08:15

dybucc force-pushed the docstring-parser-rework branch 2 times, most recently from ffddcbe to 05dd1fe Compare December 7, 2025 08:21

dybucc force-pushed the docstring-parser-rework branch from 05dd1fe to 540ec7a Compare December 7, 2025 08:27

dybucc mentioned this pull request Dec 7, 2025

docs: Rework consistency and style #986

Draft

johannes-wolf self-requested a review December 7, 2025 14:32

refactor: keep working on wip docstring parser

8a2e9fb

The parser is mostly done. This commit will be ammended/fixed up once the docstring parser is completely done, so more details can be found in the accompanying PR.

dybucc force-pushed the docstring-parser-rework branch from 454c568 to 8a2e9fb Compare January 4, 2026 18:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

refactor: Rework docstring parser #996

refactor: Rework docstring parser #996

Uh oh!

dybucc commented Dec 7, 2025 •

edited

Loading

Uh oh!

johannes-wolf commented Dec 7, 2025

Uh oh!

dybucc commented Dec 7, 2025

Uh oh!

johannes-wolf commented Dec 7, 2025

Uh oh!

dybucc commented Dec 7, 2025

Uh oh!

johannes-wolf commented Dec 29, 2025

Uh oh!

dybucc commented Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

refactor: Rework docstring parser #996

Are you sure you want to change the base?

refactor: Rework docstring parser #996

Uh oh!

Conversation

dybucc commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johannes-wolf commented Dec 7, 2025

Uh oh!

dybucc commented Dec 7, 2025

Uh oh!

johannes-wolf commented Dec 7, 2025

Uh oh!

dybucc commented Dec 7, 2025

Uh oh!

johannes-wolf commented Dec 29, 2025

Uh oh!

dybucc commented Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dybucc commented Dec 7, 2025 •

edited

Loading