Skip to content

Conversation

@jneen
Copy link
Contributor

@jneen jneen commented Jan 6, 2026

Extracted from #16455

Rationale

Currently, the only way to move the scan head a specific number of characters forward (without using a dynamically compiled Regex) is to use the offset setter, as in scanner.offset += n. This unfortunately has a major performance footgun, in that it can execute in linear time - causing two scans from the beginning of the string - in the presence of multi-byte characters.

Feature

This PR implements #scan(Int), #skip(Int), and #check(Int) which all try to move the scan head forward by the given number of characters, erroring on negative input. If there are not sufficient characters left in the string, this is treated as a match failure and the scan head is not moved.

Notes

The full scans of the string came from String#byte_index_to_char_index and its inverse String#char_index_to_byte_index, which use Char::Reader to scan from the beginning of the string. This implementation is fine for small strings, but when called in a hot loop, as StringScanner-using code tends to be, this linearity can cause parsers to become quadratic only when multi-byte characters are present, which can be quite surprising and difficult to track down. This is probably unreasonable to fix in #offset or at least #offset= directly, since without any other context their only option is to scan from the beginning of the string. In general we should encourage users not to use these methods in tight loops.

As an alternative, this PR's #scan(Int) and friends address the issue by using Char::Reader directly, starting from the current byte offset instead of the beginning of the string (see #lookahead_byte_length(Int)). This requires trust that the byte offset is a valid character index, but the publicly available methods seem to all guarantee this anyways (presuming we trust PCRE). Currently #lookbehind_byte_length is included but unused, but is included here to support #rewind, #peek_behind, etc in the future.

This also excludes spec/std/string_scanner_spec.cr from typo checking, since it tends to reference bits and pieces of example strings that are full of false positives - though I have somewhat ironically also included a grammatical fix in the same file.

The documentation includes a warning on #offset= about the performance issue, and a somewhat ugly workaround to #16499 in order to link specifically to the Int overload to #skip, which would be the preferred alternative. Removing this workaround would cause the documentation to link to the Regex overload, which I think would be confusing. The workaround can be easily removed once linking to specific overloads is supported by the docs.

I have also removed a line from the documentation for #skip(Regex) that was incorrect and appeared to have been copy-pasted from #skip_until or similar.

@jneen jneen force-pushed the feature.string-scanner-int branch 4 times, most recently from e174731 to 79c505e Compare January 6, 2026 17:05
@jneen jneen force-pushed the feature.string-scanner-int branch 3 times, most recently from 4284abe to c1094e0 Compare January 7, 2026 17:10
@jneen jneen force-pushed the feature.string-scanner-int branch from c1094e0 to e68381b Compare January 12, 2026 14:42
@jneen
Copy link
Contributor Author

jneen commented Jan 12, 2026

Apologies for the force-push, I think I applied a more commit-gardening kind of development style to this, so I've been trying to keep up-to-date with master. I'll use draft PRs in the future for early feedback. The only change in the force-push is a rebase - it is the same changeset.

@straight-shoota straight-shoota changed the title Feature: Add StringScanner#scan, #check, and #skip overloads for Int Add StringScanner#scan, #check, and #skip overloads for Int Jan 20, 2026
@straight-shoota straight-shoota merged commit c7eacb7 into crystal-lang:master Jan 22, 2026
46 of 47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants