Skip to content

Commit cc39ae5

Browse files
committed
Python: Fix dataset check error for string encoding
Here's an example of one of these errors: ``` INVALID_KEY predicate py_cobjectnames(@py_cobject obj, string name) The key set {obj} does not functionally determine all fields. Here is a pair of tuples that agree on the key set but differ at index 1: Tuple 1 in row 63874: (72088,"u'<X>'") Tuple 2 in row 63875: (72088,"u'<?>'") ``` (Here, the substring `X` should really be the Unicode character U+FFFD, but for some reason I'm not allowed to put that in this commit message.) Inside the extractor, we assign IDs based on the string type (bytestring or Unicode) and a hash of the UTF-8 encoded content of the string. In this case, however, certain _different_ strings were receiving the same hash, due to replacement characters in the encoding process. In particular, we were converting unencodable characters to question marks in one place, and to U+FFFD in another place. This caused a discrepancy that lead to the dataset check error. To fix this, we put in a custom error handler that always puts the U+FFFD character in place of unencodable characters. With this, the strings now agree, and hence there is no clash.
1 parent d01593e commit cc39ae5

File tree

1 file changed

+18
-1
lines changed

1 file changed

+18
-1
lines changed

python/extractor/semmle/python/passes/objects.py

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,23 @@
4343

4444
LITERALS = (ast.Num, ast.Str)
4545

46+
# A variant of the 'replace' error handler that replaces unencodable characters with U+FFFD
47+
# rather than '?'. Without this, a string like '\uD800' (which is not encodable) would get mapped
48+
# to '?', and potentially clash with the regular string '?' if it appeared elsewhere in the source
49+
# code. Used in 'get_label_for_object' below. Based on code from https://peps.python.org/pep-0293/
50+
def fffd_replace(exc):
51+
if isinstance(exc, UnicodeEncodeError):
52+
return ((exc.end-exc.start)*u"\\ufffd", exc.end)
53+
elif isinstance(exc, UnicodeDecodeError):
54+
return (u"\\ufffd", exc.end)
55+
elif isinstance(exc, UnicodeTranslateError):
56+
return ((exc.end-exc.start)*u"\\ufffd", exc.end)
57+
else:
58+
raise TypeError("can't handle %s" % exc.__name__)
59+
60+
import codecs
61+
codecs.register_error("fffdreplace", fffd_replace)
62+
4663
class _CObject(object):
4764
'''Utility class to wrap arbitrary C objects.
4865
Treat all objects as unique. Rely on naming in the
@@ -239,7 +256,7 @@ def get_label_for_object(self, obj, default_label, obj_type):
239256
else:
240257
prefix = u"C_bytes$"
241258
if t is str:
242-
obj = obj.encode("utf8", errors='replace')
259+
obj = obj.encode("utf8", errors='fffdreplace')
243260
return prefix + hashlib.sha1(obj).hexdigest()
244261
if t is bytes:
245262
return prefix + hashlib.sha1(obj).hexdigest()

0 commit comments

Comments
 (0)