Skip to content

[prism] Don't panic on invalid UTF-8 strings in heredocs#758

Merged
reese merged 1 commit intotrunkfrom
reese-invalid-utf-8-panic
Jan 5, 2026
Merged

[prism] Don't panic on invalid UTF-8 strings in heredocs#758
reese merged 1 commit intotrunkfrom
reese-invalid-utf-8-panic

Conversation

@reese
Copy link
Collaborator

@reese reese commented Jan 5, 2026

The way we determine the "common indent" for a heredoc (that is, for squiggly heredocs, the farthest-left line that determines the indentation level) is to look for lines where the unescaped and the content_loc have different leading whitespaces. We previously did this by converting them to strings and looking at their length, but this panics on invalid UTF-8 because all the *_to_str functions assume they are given valid UTF-8 strings.

Instead, this PR just counts the whitespace chars directly at the beginning of each slice, which should achieve the same thing without panicking on invalid UTF-8.

Copy link
Collaborator

@froydnj froydnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! My efforts over the weekend to format individual files to try to find the invalid UTF8 were weirdly not crashing (vs. the format-in-one-go invocation that was), and I hadn't yet debugged what I did wrong. Ideally this will just magically fix things.

Comment on lines +915 to +918
let raw_leading = raw.iter().take_while(|&&b| b == b' ' || b == b'\t').count();
let unescaped_leading = unescaped
.iter()
.take_while(|&&b| b == b' ' || b == b'\t')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that Sorbet's heredoc bits in the parser have to take into account a tab being 8 spaces. Do we have to do that same accounting here as well?

Answering my own question, I guess not, because the number of spaces and tabs in the raw and unescaped are essentially coming from the same source, so things should just match up?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think in this case since they're the same source, it doesn't really matter, and I think if you had a tab that was beyond the common indent, it would be a part of the string contents and just get rendered anyways.

@reese reese merged commit 66b27c0 into trunk Jan 5, 2026
8 checks passed
@reese reese deleted the reese-invalid-utf-8-panic branch January 5, 2026 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants