Skip to content

Commit 5164a72

Browse files
committed
fix: do not make the encoded document larger than expected
The Unicode replacement character becomes 2 bytes in UTF-8 (0xFF 0xFD). Replacing \0 with this character causes the encoded string to be one byte longer, making it possible for the encoded document to be longer than the maximum document size. Use the ASCII substitute character instead: it's only 1 byte long in UTF-8, so it does not make the encoded document grow.
1 parent da5d95d commit 5164a72

File tree

4 files changed

+14
-7
lines changed

4 files changed

+14
-7
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
### Fixed
2+
3+
- Do not make documents longer when preparing them to be sent to the API.

pygitguardian/client.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -312,8 +312,8 @@ def content_scan(
312312
"""
313313
content_scan handles the /scan endpoint of the API.
314314
315-
If document contains `0` bytes, they will be replaced with the Unicode
316-
replacement character.
315+
If document contains `0` bytes, they will be replaced with the ASCII substitute
316+
character.
317317
318318
:param filename: name of file, example: "intro.py"
319319
:param document: content of file
@@ -355,8 +355,8 @@ def multi_content_scan(
355355
"""
356356
multi_content_scan handles the /multiscan endpoint of the API.
357357
358-
If documents contain `0` bytes, they will be replaced with the Unicode
359-
replacement character.
358+
If documents contain `0` bytes, they will be replaced with the ASCII substitute
359+
character.
360360
361361
:param documents: List of dictionaries containing the keys document
362362
and, optionally, filename.

pygitguardian/models.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -94,8 +94,12 @@ def validate_size(document: Dict[str, Any], maximum_size: int) -> None:
9494
@post_load
9595
def replace_0_bytes(self, in_data: Dict[str, Any], **kwargs: Any) -> Dict[str, Any]:
9696
doc = in_data["document"]
97-
# Our API does not accept 0 bytes in documents, so replace them with the replacement character
98-
in_data["document"] = doc.replace("\0", "\uFFFD")
97+
# Our API does not accept 0 bytes in documents so replace them with
98+
# the ASCII substitute character.
99+
# We no longer uses the Unicode replacement character (U+FFFD) because
100+
# it makes the encoded string one byte longer, making it possible to
101+
# hit the maximum size limit.
102+
in_data["document"] = doc.replace("\0", "\x1a")
99103
return in_data
100104

101105
@post_load

tests/test_models.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ def test_document_handle_0_bytes(self):
4040
document = Document.SCHEMA.load(
4141
{"filename": "name", "document": "hello\0world"}
4242
)
43-
assert document["document"] == "hello\uFFFDworld"
43+
assert document["document"] == "hello\x1aworld"
4444

4545
def test_document_handle_surrogates(self):
4646
document = Document.SCHEMA.load(

0 commit comments

Comments
 (0)