Skip to content

Commit 345c9a5

Browse files
frankslinclaude
andauthored
Fix two out-of-bounds read issues when handling truncated UTF-8 input (#1005)
Two independent out-of-bounds read issues were identified in OpenCC's UTF-8 processing logic when handling malformed or truncated UTF-8 sequences. 1) MaxMatchSegmentation: NextCharLength() could return a value larger than the remaining input size. The previous logic subtracted this value from a size_t length counter, potentially causing underflow and subsequent out-of-bounds reads. 2) Conversion: Similar length handling could allow reads past the end of the input buffer during dictionary matching, potentially propagating unintended bytes to the conversion output. This patch fixes both issues by: - Explicitly tracking the end of the input buffer - Recomputing remaining length on each iteration - Clamping matched character and key lengths to the remaining buffer size - Preventing reads past the null terminator The changes preserve existing behavior for valid UTF-8 input and add test coverage for truncated UTF-8 sequences. These issues may have security implications when processing untrusted input and are classified as heap out-of-bounds reads (CWE-125). Co-authored-by: Claude <noreply@anthropic.com>
1 parent 0277545 commit 345c9a5

File tree

4 files changed

+110
-4
lines changed

4 files changed

+110
-4
lines changed

src/Conversion.cpp

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,14 +23,30 @@ using namespace opencc;
2323

2424
std::string Conversion::Convert(const char* phrase) const {
2525
std::ostringstream buffer;
26+
// Calculate string end to prevent reading beyond null terminator
27+
const char* phraseEnd = phrase;
28+
while (*phraseEnd != '\0') {
29+
phraseEnd++;
30+
}
31+
2632
for (const char* pstr = phrase; *pstr != '\0';) {
27-
Optional<const DictEntry*> matched = dict->MatchPrefix(pstr);
33+
size_t remainingLength = phraseEnd - pstr;
34+
Optional<const DictEntry*> matched = dict->MatchPrefix(pstr, remainingLength);
2835
size_t matchedLength;
2936
if (matched.IsNull()) {
3037
matchedLength = UTF8Util::NextCharLength(pstr);
38+
// Ensure we don't read beyond the null terminator
39+
if (matchedLength > remainingLength) {
40+
matchedLength = remainingLength;
41+
}
3142
buffer << UTF8Util::FromSubstr(pstr, matchedLength);
3243
} else {
3344
matchedLength = matched.Get()->KeyLength();
45+
// Defensive: ensure dictionary key length does not exceed remaining input
46+
// (MatchPrefix should already guarantee this, but defense in depth)
47+
if (matchedLength > remainingLength) {
48+
matchedLength = remainingLength;
49+
}
3450
buffer << matched.Get()->GetDefault();
3551
}
3652
pstr += matchedLength;

src/ConversionTest.cpp

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,4 +47,44 @@ TEST_F(ConversionTest, ConvertCString) {
4747
EXPECT_EQ(expected, converted);
4848
}
4949

50+
TEST_F(ConversionTest, TruncatedUtf8Sequence) {
51+
// This test specifically triggers the information disclosure vulnerability
52+
// in the old code. The bug occurs when a string ends with an incomplete
53+
// UTF-8 sequence.
54+
//
55+
// Background: UTF8Util::NextCharLength() examines only the first byte to
56+
// determine the expected character length (1-6 bytes), but doesn't verify
57+
// that enough bytes actually remain before the null terminator.
58+
//
59+
// Trigger condition: When the expected UTF-8 character length exceeds
60+
// the actual remaining bytes before null, the old code would:
61+
// 1. Call FromSubstr with a length crossing the null terminator
62+
// 2. Advance pstr beyond the null terminator
63+
// 3. Continue reading heap memory on next iteration
64+
// 4. Output leaked heap data to conversion result (INFORMATION DISCLOSURE)
65+
66+
// Construct a string ending with a truncated 3-byte UTF-8 sequence:
67+
// - Normal text: "干" (valid 3-byte UTF-8: 0xE5 0xB9 0xB2)
68+
// - Followed by: 0xE5 0xB9 (incomplete 3-byte sequence - missing last byte)
69+
std::string malformed;
70+
malformed += utf8(""); // Valid character
71+
malformed += '\xE5'; // Start of 3-byte UTF-8 (NextCharLength returns 3)
72+
malformed += '\xB9'; // Second byte
73+
// Missing third byte - only 2 bytes remain but NextCharLength expects 3
74+
// Old code would jump over null, read heap memory, and leak it in output
75+
76+
// The fixed code should handle this gracefully without information disclosure
77+
EXPECT_NO_THROW({
78+
const std::string converted = conversion->Convert(malformed);
79+
// Should convert "干" to "幹" (first candidate in dict) and preserve incomplete sequence
80+
std::string expected;
81+
expected += utf8(""); // Converted from "干" (dict has ["幹", "乾", "干"])
82+
expected += '\xE5'; // Incomplete sequence preserved as-is
83+
expected += '\xB9';
84+
EXPECT_EQ(expected, converted);
85+
// Should NOT contain garbage heap data beyond the input
86+
// (ASan would catch any out-of-bounds reads during conversion)
87+
});
88+
}
89+
5090
} // namespace opencc

src/MaxMatchSegmentation.cpp

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,12 +30,17 @@ SegmentsPtr MaxMatchSegmentation::Segment(const std::string& text) const {
3030
segLength = 0;
3131
}
3232
};
33-
size_t length = text.length();
33+
const char* textEnd = text.c_str() + text.length();
3434
for (const char* pstr = text.c_str(); *pstr != '\0';) {
35-
const Optional<const DictEntry*>& matched = dict->MatchPrefix(pstr, length);
35+
size_t remainingLength = textEnd - pstr;
36+
const Optional<const DictEntry*>& matched = dict->MatchPrefix(pstr, remainingLength);
3637
size_t matchedLength;
3738
if (matched.IsNull()) {
3839
matchedLength = UTF8Util::NextCharLength(pstr);
40+
// Ensure we don't advance beyond the string boundary
41+
if (matchedLength > remainingLength) {
42+
matchedLength = remainingLength;
43+
}
3944
segLength += matchedLength;
4045
} else {
4146
clearBuffer();
@@ -44,7 +49,6 @@ SegmentsPtr MaxMatchSegmentation::Segment(const std::string& text) const {
4449
segStart = pstr + matchedLength;
4550
}
4651
pstr += matchedLength;
47-
length -= matchedLength;
4852
}
4953
clearBuffer();
5054
return segments;

src/MaxMatchSegmentationTest.cpp

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,4 +43,50 @@ TEST_F(MaxMatchSegmentationTest, Segment) {
4343
EXPECT_EQ(utf8("干燥"), std::string(segments->At(3)));
4444
}
4545

46+
TEST_F(MaxMatchSegmentationTest, EmptyString) {
47+
const auto& segments = segmenter->Segment("");
48+
EXPECT_EQ(0, segments->Length());
49+
}
50+
51+
TEST_F(MaxMatchSegmentationTest, SingleCharacter) {
52+
const auto& segments = segmenter->Segment(utf8(""));
53+
EXPECT_EQ(1, segments->Length());
54+
EXPECT_EQ(utf8(""), std::string(segments->At(0)));
55+
}
56+
57+
TEST_F(MaxMatchSegmentationTest, TruncatedUtf8Sequence) {
58+
// This test specifically triggers the buffer overflow bug in the old code.
59+
// The bug occurs when a string ends with an incomplete UTF-8 sequence.
60+
//
61+
// Background: UTF8Util::NextCharLength() examines only the first byte to
62+
// determine the expected character length (1-6 bytes), but doesn't verify
63+
// that enough bytes actually remain in the buffer.
64+
//
65+
// Trigger condition: When the expected UTF-8 character length exceeds
66+
// the actual remaining bytes, the old code's "length -= matchedLength"
67+
// causes integer underflow (size_t wraps around to a huge value), leading
68+
// to out-of-bounds reads in the next MatchPrefix() call.
69+
70+
// Construct a string ending with a truncated 3-byte UTF-8 sequence:
71+
// - Normal text: "一" (valid 3-byte UTF-8: 0xE4 0xB8 0x80)
72+
// - Followed by: 0xE4 0xB8 (incomplete 3-byte sequence - missing last byte)
73+
std::string malformed;
74+
malformed += utf8(""); // Valid character
75+
malformed += '\xE4'; // Start of 3-byte UTF-8 (NextCharLength returns 3)
76+
malformed += '\xB8'; // Second byte
77+
// Missing third byte - only 2 bytes remain but NextCharLength expects 3
78+
// Old code: length=2, matchedLength=3 → length = 2-3 = SIZE_MAX (underflow)
79+
80+
// The fixed code should handle this gracefully without buffer overflow
81+
EXPECT_NO_THROW({
82+
const auto& segments = segmenter->Segment(malformed);
83+
// The valid character "一" plus the incomplete sequence form a single segment
84+
// (incomplete sequence doesn't match dictionary, gets accumulated with previous)
85+
EXPECT_EQ(1, segments->Length());
86+
// Output should preserve all input bytes (including incomplete sequence)
87+
// This is correct behavior - we don't discard data, we preserve it
88+
EXPECT_EQ(malformed, std::string(segments->At(0)));
89+
});
90+
}
91+
4692
} // namespace opencc

0 commit comments

Comments
 (0)