Improving the tokenizer #2744

Rahban1 · 2025-06-26T11:37:42Z

I have updated the tokenizer and improved on the custom trimmer. This aim to close #1457 and close #2114

…nizer

… counts of matches in the body text

Rahban1 · 2025-07-16T10:11:38Z

These are the benchmarks results after these changes in the tokenizer :

CHANGELOG.md

…earch

fingolfin

Thank you for working on this!

As it is, while the changes here look reasonably, there is no way I feel I can evaluate them from just looking at the code.

So I wonder how we could go about testing this (and other search improvements) somewhat systematically? It might help if we agreed on a large enough test document (the julia documentation?) and then collected a bunch of search queries to be compared with any tweaks. (Of course one could also compare additional things -- but I think those then should be added to the collection if they seem to add meaningful new perspective).

My point here is: it is all too easy to run into a situation where search gives a bad result; make a change that improves that case; but never realize that it makes 5 other cases worse. Of course ultimately there is probably no way to be sure of that, but if we at least had a bunch of standard queries we always consider, that might help.

And of course if we agree to test on e.g. the Julia documentation, that means to evaluate this PR every reviewer has to build the Julia docs with this PR, which is a non-trivial amount of work. If we had an easy way to do that automatically for a PR, and put the result somewhere everyone can see it, that would make it tremendously easier.

(This is not asking you to do that, by the way, just thinking out loud; perhaps others already have better answers to this than me, then I am looking forward to learning :-)

assets/html/js/search.js

fingolfin · 2025-10-24T16:55:58Z

test/search/wrapper.js

+        const tokens = [];
+        let remaining = string;
+
+        // julia specific patterns


Huh, it seems a lot of code is duplicated here (and hence might get out of sync) ? I wonder if there is a way we could avoid that, e.g. putting shared code into a separate file that is included in both places?

(Of course this goes way beyond the scope of this PR, i.e., I am not asking you to take care of this here, I just wanted to put the thought out there.

Co-authored-by: Max Horn <max@quendi.de>

Rahban1 · 2025-10-25T12:16:16Z

@fingolfin Thank you for reviewing this, you have a great point about testing the updates and that is why I created the benchmarks #2740 to test these changes.

These are the benchmarks results after these changes in the tokenizer :

this picture is from these benchmarks only and they are also in the ci/cd pipeline to check if any new change cause regression to these tests

and added more test cases here #2757

fingolfin · 2025-10-27T20:40:52Z

@Rahban1 ahhh thank you for your explanation, I wasn't aware of the benchmarking setup, that's fantastic!

So, does that mean this PR makes it so that search in the Julia manual, when built with your patch, will be able to find entries for ^ or && or ...?

Rahban1 · 2025-10-27T20:41:55Z

Yessss 😄

fingolfin · 2025-10-27T21:24:25Z

CHANGELOG.md


 ## Version [v1.13.0] - 2025-06-19

+### Changed


I think the Changed section should be after the "Added" section.

But in any case, this is in the wrong place here, we are already beyond 1.15.0 :-)

I took the liberty of moving this to the right place.

assets/html/js/search.js

test/search/wrapper.js

assets/html/js/search.js

Removed duplicate entry for search tokenizer improvement in version v1.13.0.

Rahban1 added 11 commits June 25, 2025 15:01

adding more operators that should not be stripped

48cc771

improved tokenize function, regex remaining

1c8351f

regex added and function for tokenizing the remaining word

fe6ffb6

changelog

bc21332

prettier

1bb4510

prettier again

5075bd2

prettier hopefully last

2ed4454

type in changelog

90a2a3c

Merge remote-tracking branch 'refs/remotes/upstream/master' into toke…

fafa21a

…nizer

Direct matches for titles should always come before results with high…

efcbae5

… counts of matches in the body text

prettier

712584f

mortenpi added Type: Enhancement Format: HTML Related to the default HTML output labels Jun 29, 2025

Merge branch 'master' into tokenizer

a32613a

asinghvi17 reviewed Jul 16, 2025

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

fix CHANGELOG typos and made the wrapper.js impl same as the actual s…

3603a54

…earch

fingolfin reviewed Oct 24, 2025

View reviewed changes

Apply suggestions from code review for writing comments

c1c5bce

Co-authored-by: Max Horn <max@quendi.de>

fingolfin reviewed Oct 27, 2025

View reviewed changes

fingolfin mentioned this pull request Oct 28, 2025

Search doesn't return accurate results for Genie docs #1749

Open

fingolfin reviewed Oct 28, 2025

View reviewed changes

assets/html/js/search.js Outdated Show resolved Hide resolved

fingolfin reviewed Oct 28, 2025

View reviewed changes

test/search/wrapper.js Show resolved Hide resolved

fingolfin reviewed Oct 28, 2025

View reviewed changes

test/search/wrapper.js Outdated Show resolved Hide resolved

fingolfin reviewed Oct 28, 2025

View reviewed changes

assets/html/js/search.js Show resolved Hide resolved

Rahban1 and others added 2 commits October 29, 2025 18:57

comments typos fixed

e72b381

Merge branch 'master' into tokenizer

00cef66

fingolfin added the Status: Approved Should be OK to merge label Oct 29, 2025

mortenpi added 2 commits October 31, 2025 16:51

Merge branch 'master' into tokenizer

fac1cfa

Update CHANGELOG to remove duplicate entry

6199727

Removed duplicate entry for search tokenizer improvement in version v1.13.0.

fingolfin merged commit f808909 into JuliaDocs:master Oct 31, 2025
26 checks passed

fingolfin mentioned this pull request Feb 2, 2026

Regression: search gets unusably slow for large search index #2872

Closed

Improving the tokenizer #2744

Improving the tokenizer #2744

Uh oh!

Conversation

Rahban1 commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rahban1 commented Jul 16, 2025

Uh oh!

Uh oh!

fingolfin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fingolfin Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Rahban1 commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fingolfin commented Oct 27, 2025

Uh oh!

Rahban1 commented Oct 27, 2025

Uh oh!

fingolfin Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

mortenpi Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Rahban1 commented Jun 26, 2025 •

edited

Loading

Rahban1 commented Oct 25, 2025 •

edited

Loading