Fix Tokenizer.prototype.tokenizeFrom string length after normalizing#1628
Fix Tokenizer.prototype.tokenizeFrom string length after normalizing#1628brandon-gong wants to merge 2 commits intobrownplt:horizonfrom
Conversation
|
Thanks! As you saw in the comment on the line you changed -- the length property here isn't unicode-aware, for sure. I'm pretty sure I used the original length of the string because the lexer iterates character-by-character, and needs to supply source locations to tokens in such a way that This is a particularly fiddly property to get right (see https://hsivonen.fi/string-length/, for an amusing example) and I mostly just punted on this when originally writing the lexer. Pyret gets this example weird, since CodeMirror doesn't handle the characters consistently with how they're output, either: |


This pull request addresses #1627, in which I was getting strange bugs on a particular character.
Currently
tokenizeFromnormalizes the given source string to Unicode Normalization Form C, then stores the original string's length in a separate variablethis.len. However, calling.normalize()on a string can change its length, so its necessarythis.lenreflects the length of the newly normalized string to avoid lexing errors.This issue turns out to be pretty prevalent, and I've found a slew of characters that cause the same error in Pyret right now. Below is a small sample of them that I found with a small script, but there are a lot more (even common characters with accent marks, like é, may have this issue).
In addition to fixing this issue by simply having
this.lenreflect the normalized string's length, I've also written two tests in the areas that I've found this to be an issue, namely block comments and string literals. If they're misplaced / unnecessary / not enough I can certainly change them!