[libc] Scanf shouldn't match just "0x" for hex int #112440

michaelrj-google · 2024-10-15T21:44:36Z

Scanf parsing reads the longest possibly valid prefix for a given
conversion. Then, it performs the conversion on that string. In the case
of "0xZ" with a hex conversion (either "%x" or "%i") the longest
possibly valid prefix is "0x", which makes it the "input item" (per the
standard). The sequence "0x" is not a "matching sequence" for a hex
conversion, meaning it results in a matching failure, and parsing ends.
This is because to know that there's no valid digit after "0x" it reads
the 'Z', but it can only put back one character (the 'Z') leaving it
with consuming an invalid sequence.

(inspired by a thread on the libc-coord mailing list:
https://www.openwall.com/lists/libc-coord/2024/10/15/1, see 7.32.6.2 in
the standard for more details.)

Scanf parsing reads the longest possibly valid prefix for a given conversion. Then, it performs the conversion on that string. In the case of "0xZ" with a hex conversion (either "%x" or "%i") the longest possibly valid prefix is "0x", which makes it the "input item" (per the standard). The sequence "0x" is not a "matching sequence" for a hex conversion, meaning it results in a matching failure, and parsing ends. This is because to know that there's no valid digit after "0x" it reads the 'Z', but it can only put back one character (the 'Z') leaving it with consuming an invalid sequence. (inspired by a thread on the libc-coord mailing list, see 7.32.6.2 in the standard for more details.)

llvmbot · 2024-10-15T21:45:11Z

@llvm/pr-subscribers-libc

Author: Michael Jones (michaelrj-google)

Changes

Scanf parsing reads the longest possibly valid prefix for a given
conversion. Then, it performs the conversion on that string. In the case
of "0xZ" with a hex conversion (either "%x" or "%i") the longest
possibly valid prefix is "0x", which makes it the "input item" (per the
standard). The sequence "0x" is not a "matching sequence" for a hex
conversion, meaning it results in a matching failure, and parsing ends.
This is because to know that there's no valid digit after "0x" it reads
the 'Z', but it can only put back one character (the 'Z') leaving it
with consuming an invalid sequence.

(inspired by a thread on the libc-coord mailing list, see 7.32.6.2 in
the standard for more details.)

Full diff: https://github.com/llvm/llvm-project/pull/112440.diff

2 Files Affected:

(modified) libc/src/stdio/scanf_core/int_converter.cpp (+17-4)
(modified) libc/test/src/stdio/sscanf_test.cpp (+26-10)

diff --git a/libc/src/stdio/scanf_core/int_converter.cpp b/libc/src/stdio/scanf_core/int_converter.cpp
index 136db2a3773e11..cc3ab9da0bbd4f 100644
--- a/libc/src/stdio/scanf_core/int_converter.cpp
+++ b/libc/src/stdio/scanf_core/int_converter.cpp
@@ -124,13 +124,25 @@ int convert_int(Reader *reader, const FormatSection &to_conv) {
 
       if (to_lower(cur_char) == 'x') {
         // This is a valid hex prefix.
+
+        is_number = false;
+        // A valid hex prefix is not necessarily a valid number. For the
+        // conversion to be valid it needs to use all of the characters it
+        // consumes. From the standard:
+        // 7.23.6.2 paragraph 9: "An input item is defined as the longest
+        // sequence of input characters which does not exceed any specified
+        // field width and which is, or is a prefix of, a matching input
+        // sequence."
+        // 7.23.6.2 paragraph 10: "If the input item is not a matching sequence,
+        // the execution of the directive fails: this condition is a matching
+        // failure"
         base = 16;
         if (max_width > 1) {
           --max_width;
           cur_char = reader->getc();
         } else {
-          write_int_with_length(0, to_conv);
-          return READ_OK;
+          // write_int_with_length(0, to_conv);
+          return MATCHING_FAILURE;
         }
 
       } else {
@@ -198,6 +210,9 @@ int convert_int(Reader *reader, const FormatSection &to_conv) {
   // last one back.
   reader->ungetc(cur_char);
 
+  if (!is_number)
+    return MATCHING_FAILURE;
+
   if (has_overflow) {
     write_int_with_length(MAX, to_conv);
   } else {
@@ -207,8 +222,6 @@ int convert_int(Reader *reader, const FormatSection &to_conv) {
     write_int_with_length(result, to_conv);
   }
 
-  if (!is_number)
-    return MATCHING_FAILURE;
   return READ_OK;
 }
 
diff --git a/libc/test/src/stdio/sscanf_test.cpp b/libc/test/src/stdio/sscanf_test.cpp
index 33bb0acba3e662..18addb632067c9 100644
--- a/libc/test/src/stdio/sscanf_test.cpp
+++ b/libc/test/src/stdio/sscanf_test.cpp
@@ -177,13 +177,25 @@ TEST(LlvmLibcSScanfTest, IntConvMaxLengthTests) {
   EXPECT_EQ(ret_val, 1);
   EXPECT_EQ(result, 0);
 
+  result = -999;
+
+  // 0x is a valid prefix, but not a valid number. This should be a matching
+  // failure and should not modify the values.
   ret_val = LIBC_NAMESPACE::sscanf("0x1", "%2i", &result);
-  EXPECT_EQ(ret_val, 1);
-  EXPECT_EQ(result, 0);
+  EXPECT_EQ(ret_val, 0);
+  EXPECT_EQ(result, -999);
 
   ret_val = LIBC_NAMESPACE::sscanf("-0x1", "%3i", &result);
+  EXPECT_EQ(ret_val, 0);
+  EXPECT_EQ(result, -999);
+
+  ret_val = LIBC_NAMESPACE::sscanf("0x1", "%3i", &result);
   EXPECT_EQ(ret_val, 1);
-  EXPECT_EQ(result, 0);
+  EXPECT_EQ(result, 1);
+
+  ret_val = LIBC_NAMESPACE::sscanf("-0x1", "%4i", &result);
+  EXPECT_EQ(ret_val, 1);
+  EXPECT_EQ(result, -1);
 
   ret_val = LIBC_NAMESPACE::sscanf("-0x123", "%4i", &result);
   EXPECT_EQ(ret_val, 1);
@@ -212,7 +224,7 @@ TEST(LlvmLibcSScanfTest, IntConvNoWriteTests) {
   EXPECT_EQ(result, 0);
 
   ret_val = LIBC_NAMESPACE::sscanf("0x1", "%*2i", &result);
-  EXPECT_EQ(ret_val, 1);
+  EXPECT_EQ(ret_val, 0);
   EXPECT_EQ(result, 0);
 
   ret_val = LIBC_NAMESPACE::sscanf("a", "%*i", &result);
@@ -679,13 +691,17 @@ TEST(LlvmLibcSScanfTest, CombinedConv) {
   EXPECT_EQ(result, 123);
   ASSERT_STREQ(buffer, "abc");
 
+  result = -1;
+
+  // 0x is a valid prefix, but not a valid number. This should be a matching
+  // failure and should not modify the values.
   ret_val = LIBC_NAMESPACE::sscanf("0xZZZ", "%i%s", &result, buffer);
-  EXPECT_EQ(ret_val, 2);
-  EXPECT_EQ(result, 0);
-  ASSERT_STREQ(buffer, "ZZZ");
+  EXPECT_EQ(ret_val, 0);
+  EXPECT_EQ(result, -1);
+  ASSERT_STREQ(buffer, "abc");
 
   ret_val = LIBC_NAMESPACE::sscanf("0xZZZ", "%X%s", &result, buffer);
-  EXPECT_EQ(ret_val, 2);
-  EXPECT_EQ(result, 0);
-  ASSERT_STREQ(buffer, "ZZZ");
+  EXPECT_EQ(ret_val, 0);
+  EXPECT_EQ(result, -1);
+  ASSERT_STREQ(buffer, "abc");
 }

nickdesaulniers

Consider linking to https://www.openwall.com/lists/libc-coord/2024/10/15/1 in the commit message.

libc/src/stdio/scanf_core/int_converter.cpp

nickdesaulniers · 2024-10-17T16:46:14Z

libc/src/stdio/scanf_core/int_converter.cpp

        } else {
-          write_int_with_length(0, to_conv);
-          return READ_OK;
+          return MATCHING_FAILURE;


Consider making this an early return if you reverse the conditional.

The if (max_width > 1) condition is a common pattern in scanf, so I'd rather keep the consistency and not reverse this conditional.

llvmbot added the libc label Oct 15, 2024

michaelrj-google requested review from lntue and nickdesaulniers October 15, 2024 22:55

nickdesaulniers reviewed Oct 15, 2024

View reviewed changes

libc/src/stdio/scanf_core/int_converter.cpp Outdated Show resolved Hide resolved

delete commented-out code

d6b2da9

nickdesaulniers approved these changes Oct 17, 2024

View reviewed changes

lntue approved these changes Oct 18, 2024

View reviewed changes

michaelrj-google merged commit 0afe6e4 into llvm:main Oct 18, 2024
7 checks passed

michaelrj-google deleted the libcScanfHexMatch branch October 18, 2024 22:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[libc] Scanf shouldn't match just "0x" for hex int #112440

[libc] Scanf shouldn't match just "0x" for hex int #112440

Uh oh!

michaelrj-google commented Oct 15, 2024 •

edited

Loading

Uh oh!

llvmbot commented Oct 15, 2024

Uh oh!

nickdesaulniers left a comment

Uh oh!

Uh oh!

nickdesaulniers Oct 17, 2024

Uh oh!

michaelrj-google Oct 17, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[libc] Scanf shouldn't match just "0x" for hex int #112440

[libc] Scanf shouldn't match just "0x" for hex int #112440

Uh oh!

Conversation

michaelrj-google commented Oct 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Oct 15, 2024

Uh oh!

nickdesaulniers left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nickdesaulniers Oct 17, 2024

Choose a reason for hiding this comment

Uh oh!

michaelrj-google Oct 17, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

michaelrj-google commented Oct 15, 2024 •

edited

Loading