Skip to content

Commit 1e51703

Browse files
committed
Python: Allow escaped quotes/backslashes in raw strings
Quoting the Python documentation (last paragraph of https://docs.python.org/3/reference/lexical_analysis.html#escape-sequences): "Even in a raw literal, quotes can be escaped with a backslash, but the backslash remains in the result; for example, r"\"" is a valid string literal consisting of two characters: a backslash and a double quote; r"\" is not a valid string literal (even a raw string cannot end in an odd number of backslashes)." We did not handle this correctly in the scanner, as we only consumed the backslash but not the following single or double quote, resulting in that character getting interpreted as the end of the string. To fix this, we do a second lookahead after consuming the backslash, and if the next character is the end character for the string, we advance the lexer across it as well. Similarly, backslashes in raw strings can escape other backslashes. Thus, for a string like '\\' we must consume the second backslash, otherwise we'll interpret it as escaping the end quote.
1 parent 5db601a commit 1e51703

File tree

2 files changed

+25
-0
lines changed

2 files changed

+25
-0
lines changed

python/extractor/tests/parser/strings.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,3 +77,12 @@
7777
b'\xc5\xe5'
7878
if 35:
7979
f"{x=}"
80+
if 36:
81+
r"a\"a"
82+
if 37:
83+
r'a\'a'
84+
if 38:
85+
r'a\\'
86+
if 39:
87+
r'a\
88+
'

python/extractor/tsg-python/tsp/src/scanner.cc

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -161,6 +161,22 @@ struct Scanner {
161161
} else if (lexer->lookahead == '\\') {
162162
if (delimiter.is_raw()) {
163163
lexer->advance(lexer, false);
164+
// In raw strings, backslashes _can_ escape the same kind of quotes as the outer
165+
// string, so we must take care to traverse any such escaped quotes now. If we don't do
166+
// this, we will mistakenly consider the string to end at that escaped quote.
167+
// Likewise, this also extends to escaped backslashes.
168+
if (lexer->lookahead == end_character || lexer->lookahead == '\\') {
169+
lexer->advance(lexer, false);
170+
}
171+
// Newlines after backslashes also cause issues, so we explicitly step over them here.
172+
if (lexer->lookahead == '\r') {
173+
lexer->advance(lexer, false);
174+
if (lexer->lookahead == '\n') {
175+
lexer->advance(lexer, false);
176+
}
177+
} else if (lexer->lookahead == '\n') {
178+
lexer->advance(lexer, false);
179+
}
164180
continue;
165181
} else if (delimiter.is_bytes()) {
166182
lexer->mark_end(lexer);

0 commit comments

Comments
 (0)