-
Notifications
You must be signed in to change notification settings - Fork 505
Improving the tokenizer #2744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving the tokenizer #2744
Conversation
… counts of matches in the body text
fingolfin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this!
As it is, while the changes here look reasonably, there is no way I feel I can evaluate them from just looking at the code.
So I wonder how we could go about testing this (and other search improvements) somewhat systematically? It might help if we agreed on a large enough test document (the julia documentation?) and then collected a bunch of search queries to be compared with any tweaks. (Of course one could also compare additional things -- but I think those then should be added to the collection if they seem to add meaningful new perspective).
My point here is: it is all too easy to run into a situation where search gives a bad result; make a change that improves that case; but never realize that it makes 5 other cases worse. Of course ultimately there is probably no way to be sure of that, but if we at least had a bunch of standard queries we always consider, that might help.
And of course if we agree to test on e.g. the Julia documentation, that means to evaluate this PR every reviewer has to build the Julia docs with this PR, which is a non-trivial amount of work. If we had an easy way to do that automatically for a PR, and put the result somewhere everyone can see it, that would make it tremendously easier.
(This is not asking you to do that, by the way, just thinking out loud; perhaps others already have better answers to this than me, then I am looking forward to learning :-)
| const tokens = []; | ||
| let remaining = string; | ||
|
|
||
| // julia specific patterns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, it seems a lot of code is duplicated here (and hence might get out of sync) ? I wonder if there is a way we could avoid that, e.g. putting shared code into a separate file that is included in both places?
(Of course this goes way beyond the scope of this PR, i.e., I am not asking you to take care of this here, I just wanted to put the thought out there.
Co-authored-by: Max Horn <max@quendi.de>
|
@fingolfin Thank you for reviewing this, you have a great point about testing the updates and that is why I created the benchmarks #2740 to test these changes.
this picture is from these benchmarks only and they are also in the ci/cd pipeline to check if any new change cause regression to these tests and added more test cases here #2757 |
|
@Rahban1 ahhh thank you for your explanation, I wasn't aware of the benchmarking setup, that's fantastic! So, does that mean this PR makes it so that search in the Julia manual, when built with your patch, will be able to find entries for |
|
Yessss 😄 |
CHANGELOG.md
Outdated
|
|
||
| ## Version [v1.13.0] - 2025-06-19 | ||
|
|
||
| ### Changed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the Changed section should be after the "Added" section.
But in any case, this is in the wrong place here, we are already beyond 1.15.0 :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took the liberty of moving this to the right place.
Removed duplicate entry for search tokenizer improvement in version v1.13.0.


I have updated the tokenizer and improved on the custom trimmer. This aim to close #1457 and close #2114