(Originally reported at solidity-coverage 418)
("corrupt" might be overdramatizing this a little.)
It looks like ranges are calculated by character count rather than string length, and non-ascii characters are 'wider' than length 1. This can introduce unexpected drift if you're using the parser to identify string injection points when modifying source files.
Ascii: length 36
contract A {
/// S
uint x;
}
Non Ascii: length 37
contract A {
/// π
uint x;
}
These two contracts produce the same range data. Not sure this can (or should?) be fixed here. A simple work-around for my case is to sanitize files before parsing.
The issue raising this at SC involved scientific notation in a natspec comment.