-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Add StringScanner#scan, #check, and #skip overloads for Int
#16557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
straight-shoota
merged 7 commits into
crystal-lang:master
from
jneen:feature.string-scanner-int
Jan 22, 2026
Merged
Add StringScanner#scan, #check, and #skip overloads for Int
#16557
straight-shoota
merged 7 commits into
crystal-lang:master
from
jneen:feature.string-scanner-int
Jan 22, 2026
+228
−22
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
HertzDevil
reviewed
Jan 6, 2026
HertzDevil
reviewed
Jan 6, 2026
HertzDevil
reviewed
Jan 6, 2026
HertzDevil
reviewed
Jan 6, 2026
e174731 to
79c505e
Compare
HertzDevil
reviewed
Jan 6, 2026
4284abe to
c1094e0
Compare
ysbaddaden
approved these changes
Jan 12, 2026
c1094e0 to
e68381b
Compare
Contributor
Author
|
Apologies for the force-push, I think I applied a more commit-gardening kind of development style to this, so I've been trying to keep up-to-date with master. I'll use draft PRs in the future for early feedback. The only change in the force-push is a rebase - it is the same changeset. |
StringScanner#scan, #check, and #skip overloads for Int
straight-shoota
approved these changes
Jan 20, 2026
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Extracted from #16455
Rationale
Currently, the only way to move the scan head a specific number of characters forward (without using a dynamically compiled Regex) is to use the
offsetsetter, as inscanner.offset += n. This unfortunately has a major performance footgun, in that it can execute in linear time - causing two scans from the beginning of the string - in the presence of multi-byte characters.Feature
This PR implements
#scan(Int),#skip(Int), and#check(Int)which all try to move the scan head forward by the given number of characters, erroring on negative input. If there are not sufficient characters left in the string, this is treated as a match failure and the scan head is not moved.Notes
The full scans of the string came from
String#byte_index_to_char_indexand its inverseString#char_index_to_byte_index, which useChar::Readerto scan from the beginning of the string. This implementation is fine for small strings, but when called in a hot loop, as StringScanner-using code tends to be, this linearity can cause parsers to become quadratic only when multi-byte characters are present, which can be quite surprising and difficult to track down. This is probably unreasonable to fix in#offsetor at least#offset=directly, since without any other context their only option is to scan from the beginning of the string. In general we should encourage users not to use these methods in tight loops.As an alternative, this PR's
#scan(Int)and friends address the issue by usingChar::Readerdirectly, starting from the current byte offset instead of the beginning of the string (see#lookahead_byte_length(Int)). This requires trust that the byte offset is a valid character index, but the publicly available methods seem to all guarantee this anyways (presuming we trust PCRE). Currently#lookbehind_byte_lengthis included but unused, but is included here to support#rewind,#peek_behind, etc in the future.This also excludes
spec/std/string_scanner_spec.crfrom typo checking, since it tends to reference bits and pieces of example strings that are full of false positives - though I have somewhat ironically also included a grammatical fix in the same file.The documentation includes a warning on
#offset=about the performance issue, and a somewhat ugly workaround to #16499 in order to link specifically to theIntoverload to#skip, which would be the preferred alternative. Removing this workaround would cause the documentation to link to the Regex overload, which I think would be confusing. The workaround can be easily removed once linking to specific overloads is supported by the docs.I have also removed a line from the documentation for
#skip(Regex)that was incorrect and appeared to have been copy-pasted from#skip_untilor similar.