Skip to content

Conversation

@Rahban1
Copy link
Contributor

@Rahban1 Rahban1 commented Jun 26, 2025

I have updated the tokenizer and improved on the custom trimmer. This aim to close #1457 and close #2114

@mortenpi mortenpi added Type: Enhancement Format: HTML Related to the default HTML output labels Jun 29, 2025
@Rahban1
Copy link
Contributor Author

Rahban1 commented Jul 16, 2025

These are the benchmarks results after these changes in the tokenizer :
Screenshot 2025-07-16 at 3 26 33 PM

Copy link
Collaborator

@fingolfin fingolfin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this!

As it is, while the changes here look reasonably, there is no way I feel I can evaluate them from just looking at the code.

So I wonder how we could go about testing this (and other search improvements) somewhat systematically? It might help if we agreed on a large enough test document (the julia documentation?) and then collected a bunch of search queries to be compared with any tweaks. (Of course one could also compare additional things -- but I think those then should be added to the collection if they seem to add meaningful new perspective).

My point here is: it is all too easy to run into a situation where search gives a bad result; make a change that improves that case; but never realize that it makes 5 other cases worse. Of course ultimately there is probably no way to be sure of that, but if we at least had a bunch of standard queries we always consider, that might help.

And of course if we agree to test on e.g. the Julia documentation, that means to evaluate this PR every reviewer has to build the Julia docs with this PR, which is a non-trivial amount of work. If we had an easy way to do that automatically for a PR, and put the result somewhere everyone can see it, that would make it tremendously easier.

(This is not asking you to do that, by the way, just thinking out loud; perhaps others already have better answers to this than me, then I am looking forward to learning :-)

const tokens = [];
let remaining = string;

// julia specific patterns
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, it seems a lot of code is duplicated here (and hence might get out of sync) ? I wonder if there is a way we could avoid that, e.g. putting shared code into a separate file that is included in both places?

(Of course this goes way beyond the scope of this PR, i.e., I am not asking you to take care of this here, I just wanted to put the thought out there.

Co-authored-by: Max Horn <max@quendi.de>
@Rahban1
Copy link
Contributor Author

Rahban1 commented Oct 25, 2025

@fingolfin Thank you for reviewing this, you have a great point about testing the updates and that is why I created the benchmarks #2740 to test these changes.

These are the benchmarks results after these changes in the tokenizer : Screenshot 2025-07-16 at 3 26 33 PM

this picture is from these benchmarks only and they are also in the ci/cd pipeline to check if any new change cause regression to these tests

and added more test cases here #2757

@fingolfin
Copy link
Collaborator

@Rahban1 ahhh thank you for your explanation, I wasn't aware of the benchmarking setup, that's fantastic!

So, does that mean this PR makes it so that search in the Julia manual, when built with your patch, will be able to find entries for ^ or && or ...?

@Rahban1
Copy link
Contributor Author

Rahban1 commented Oct 27, 2025

Yessss 😄

CHANGELOG.md Outdated

## Version [v1.13.0] - 2025-06-19

### Changed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Changed section should be after the "Added" section.

But in any case, this is in the wrong place here, we are already beyond 1.15.0 :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the liberty of moving this to the right place.

@fingolfin fingolfin added the Status: Approved Should be OK to merge label Oct 29, 2025
Removed duplicate entry for search tokenizer improvement in version v1.13.0.
@fingolfin fingolfin merged commit f808909 into JuliaDocs:master Oct 31, 2025
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Format: HTML Related to the default HTML output Status: Approved Should be OK to merge Type: Enhancement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Search can't find e.g. __init__ Search often gives insufficient results

4 participants