Skip to content

Conversation

@bozicrHT
Copy link
Contributor

@bozicrHT bozicrHT commented Sep 27, 2025

Annotate printf/scanf and related builtins with the nonnull attribute on their format string parameters. This enables diagnostics when NULL is passed, matching GCC behavior. Updated existing Sema tests and added new ones for coverage. Closes issue #33923

Annotate printf/scanf and related builtins with the nonnull attribute on
their format string parameters. This enables diagnostics when NULL is
passed, matching GCC behavior. Updated existing Sema tests and added new
one for coverage. Closes issue llvm#33923
@llvmbot llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" labels Sep 27, 2025
@llvmbot
Copy link
Member

llvmbot commented Sep 27, 2025

@llvm/pr-subscribers-clang-static-analyzer-1

@llvm/pr-subscribers-clang

Author: Radovan Božić (bozicrHT)

Changes

Annotate printf/scanf and related builtins with the nonnull attribute on their format string parameters. This enables diagnostics when NULL is passed, matching GCC behavior. Updated existing Sema tests and added new one for coverage. Closes issue #33923


Full diff: https://github.com/llvm/llvm-project/pull/160988.diff

10 Files Affected:

  • (modified) clang/include/clang/Basic/Builtins.h (+4)
  • (modified) clang/include/clang/Basic/Builtins.td (+16-15)
  • (modified) clang/include/clang/Basic/BuiltinsBase.td (+1-1)
  • (modified) clang/lib/Basic/Builtins.cpp (+33-15)
  • (modified) clang/lib/Sema/SemaDecl.cpp (+9)
  • (added) clang/test/Sema/format-strings-nonnull.c (+74)
  • (added) clang/test/Sema/format-strings-nonnull.cpp (+40)
  • (modified) clang/test/Sema/format-strings.c (+2-4)
  • (modified) clang/test/SemaCXX/format-strings-0x.cpp (+2)
  • (modified) clang/test/SemaObjC/format-strings-objc.m (+2-1)
diff --git a/clang/include/clang/Basic/Builtins.h b/clang/include/clang/Basic/Builtins.h
index 3a5e31de2bc50..6d26a7b92a0fd 100644
--- a/clang/include/clang/Basic/Builtins.h
+++ b/clang/include/clang/Basic/Builtins.h
@@ -392,6 +392,10 @@ class Context {
   bool performsCallback(unsigned ID,
                         llvm::SmallVectorImpl<int> &Encoding) const;
 
+  /// Return true if this builtin has parameters that must be non-null.
+  /// The parameter indices are appended into 'Indxs'.
+  bool isNonNull(unsigned ID, llvm::SmallVectorImpl<int> &Indxs) const;
+
   /// Return true if this function has no side effects and doesn't
   /// read memory, except for possibly errno or raising FP exceptions.
   ///
diff --git a/clang/include/clang/Basic/Builtins.td b/clang/include/clang/Basic/Builtins.td
index 9bc70ea5e5858..5f9843f482276 100644
--- a/clang/include/clang/Basic/Builtins.td
+++ b/clang/include/clang/Basic/Builtins.td
@@ -3095,104 +3095,105 @@ def StrLen : LibBuiltin<"string.h"> {
 // FIXME: This list is incomplete.
 def Printf : LibBuiltin<"stdio.h"> {
   let Spellings = ["printf"];
-  let Attributes = [PrintfFormat<0>];
+  let Attributes = [PrintfFormat<0>, NonNull<[0]>];
   let Prototype = "int(char const*, ...)";
 }
 
 // FIXME: The builtin and library function should have the same signature.
 def BuiltinPrintf : Builtin {
   let Spellings = ["__builtin_printf"];
-  let Attributes = [NoThrow, PrintfFormat<0>, FunctionWithBuiltinPrefix];
+  let Attributes = [NoThrow, PrintfFormat<0>, FunctionWithBuiltinPrefix,
+                    NonNull<[0]>];
   let Prototype = "int(char const* restrict, ...)";
 }
 
 def FPrintf : LibBuiltin<"stdio.h"> {
   let Spellings = ["fprintf"];
-  let Attributes = [NoThrow, PrintfFormat<1>];
+  let Attributes = [NoThrow, PrintfFormat<1>, NonNull<[1]>];
   let Prototype = "int(FILE* restrict, char const* restrict, ...)";
   let AddBuiltinPrefixedAlias = 1;
 }
 
 def SnPrintf : LibBuiltin<"stdio.h"> {
   let Spellings = ["snprintf"];
-  let Attributes = [NoThrow, PrintfFormat<2>];
+  let Attributes = [NoThrow, PrintfFormat<2>, NonNull<[2]>];
   let Prototype = "int(char* restrict, size_t, char const* restrict, ...)";
   let AddBuiltinPrefixedAlias = 1;
 }
 
 def SPrintf : LibBuiltin<"stdio.h"> {
   let Spellings = ["sprintf"];
-  let Attributes = [NoThrow, PrintfFormat<1>];
+  let Attributes = [NoThrow, PrintfFormat<1>, NonNull<[1]>];
   let Prototype = "int(char* restrict, char const* restrict, ...)";
   let AddBuiltinPrefixedAlias = 1;
 }
 
 def VPrintf : LibBuiltin<"stdio.h"> {
   let Spellings = ["vprintf"];
-  let Attributes = [NoThrow, VPrintfFormat<0>];
+  let Attributes = [NoThrow, VPrintfFormat<0>, NonNull<[0]>];
   let Prototype = "int(char const* restrict, __builtin_va_list)";
   let AddBuiltinPrefixedAlias = 1;
 }
 
 def VfPrintf : LibBuiltin<"stdio.h"> {
   let Spellings = ["vfprintf"];
-  let Attributes = [NoThrow, VPrintfFormat<1>];
+  let Attributes = [NoThrow, VPrintfFormat<1>, NonNull<[1]>];
   let Prototype = "int(FILE* restrict, char const* restrict, __builtin_va_list)";
   let AddBuiltinPrefixedAlias = 1;
 }
 
 def VsnPrintf : LibBuiltin<"stdio.h"> {
   let Spellings = ["vsnprintf"];
-  let Attributes = [NoThrow, VPrintfFormat<2>];
+  let Attributes = [NoThrow, VPrintfFormat<2>, NonNull<[2]>];
   let Prototype = "int(char* restrict, size_t, char const* restrict, __builtin_va_list)";
   let AddBuiltinPrefixedAlias = 1;
 }
 
 def VsPrintf : LibBuiltin<"stdio.h"> {
   let Spellings = ["vsprintf"];
-  let Attributes = [NoThrow, VPrintfFormat<1>];
+  let Attributes = [NoThrow, VPrintfFormat<1>, NonNull<[1]>];
   let Prototype = "int(char* restrict, char const* restrict, __builtin_va_list)";
   let AddBuiltinPrefixedAlias = 1;
 }
 
 def Scanf : LibBuiltin<"stdio.h"> {
   let Spellings = ["scanf"];
-  let Attributes = [ScanfFormat<0>];
+  let Attributes = [ScanfFormat<0>, NonNull<[0]>];
   let Prototype = "int(char const* restrict, ...)";
   let AddBuiltinPrefixedAlias = 1;
 }
 
 def FScanf : LibBuiltin<"stdio.h"> {
   let Spellings = ["fscanf"];
-  let Attributes = [ScanfFormat<1>];
+  let Attributes = [ScanfFormat<1>, NonNull<[1]>];
   let Prototype = "int(FILE* restrict, char const* restrict, ...)";
   let AddBuiltinPrefixedAlias = 1;
 }
 
 def SScanf : LibBuiltin<"stdio.h"> {
   let Spellings = ["sscanf"];
-  let Attributes = [ScanfFormat<1>];
+  let Attributes = [ScanfFormat<1>, NonNull<[1]>];
   let Prototype = "int(char const* restrict, char const* restrict, ...)";
   let AddBuiltinPrefixedAlias = 1;
 }
 
 def VScanf : LibBuiltin<"stdio.h"> {
   let Spellings = ["vscanf"];
-  let Attributes = [VScanfFormat<0>];
+  let Attributes = [VScanfFormat<0>, NonNull<[0]>];
   let Prototype = "int(char const* restrict, __builtin_va_list)";
   let AddBuiltinPrefixedAlias = 1;
 }
 
 def VFScanf : LibBuiltin<"stdio.h"> {
   let Spellings = ["vfscanf"];
-  let Attributes = [VScanfFormat<1>];
+  let Attributes = [VScanfFormat<1>, NonNull<[1]>];
   let Prototype = "int(FILE* restrict, char const* restrict, __builtin_va_list)";
   let AddBuiltinPrefixedAlias = 1;
 }
 
 def VSScanf : LibBuiltin<"stdio.h"> {
   let Spellings = ["vsscanf"];
-  let Attributes = [VScanfFormat<1>];
+  let Attributes = [VScanfFormat<1>, NonNull<[1]>];
   let Prototype = "int(char const* restrict, char const* restrict, __builtin_va_list)";
   let AddBuiltinPrefixedAlias = 1;
 }
diff --git a/clang/include/clang/Basic/BuiltinsBase.td b/clang/include/clang/Basic/BuiltinsBase.td
index 09bc9f89059fe..73918ab167b8d 100644
--- a/clang/include/clang/Basic/BuiltinsBase.td
+++ b/clang/include/clang/Basic/BuiltinsBase.td
@@ -32,7 +32,6 @@ def Const : Attribute<"c">;
 def NoThrow : Attribute<"n">;
 def Pure : Attribute<"U">;
 def ReturnsTwice : Attribute<"j">;
-//  FIXME: gcc has nonnull
 
 // builtin-specific attributes
 // ---------------------------
@@ -85,6 +84,7 @@ def Consteval : Attribute<"EG">;
 // Callback behavior: the first index argument is called with the arguments
 // indicated by the remaining indices.
 class Callback<list<int> ArgIndices> : MultiIndexAttribute<"C", ArgIndices>;
+class NonNull<list<int> ArgIndices> : MultiIndexAttribute<"N", ArgIndices>;
 
 // Prefixes
 // ========
diff --git a/clang/lib/Basic/Builtins.cpp b/clang/lib/Basic/Builtins.cpp
index acd98fe84adf5..a813726b7b848 100644
--- a/clang/lib/Basic/Builtins.cpp
+++ b/clang/lib/Basic/Builtins.cpp
@@ -293,30 +293,48 @@ bool Builtin::Context::isScanfLike(unsigned ID, unsigned &FormatIdx,
   return isLike(ID, FormatIdx, HasVAListArg, "sS");
 }
 
-bool Builtin::Context::performsCallback(unsigned ID,
-                                        SmallVectorImpl<int> &Encoding) const {
-  const char *CalleePos = ::strchr(getAttributesString(ID), 'C');
-  if (!CalleePos)
-    return false;
-
-  ++CalleePos;
-  assert(*CalleePos == '<' &&
-         "Callback callee specifier must be followed by a '<'");
-  ++CalleePos;
+static void parseCommaSeparatedIndices(const char *CurrPos,
+                                       llvm::SmallVectorImpl<int> &Indxs) {
+  assert(*CurrPos == '<' && "Expected '<' to start index list");
+  ++CurrPos;
 
   char *EndPos;
-  int CalleeIdx = ::strtol(CalleePos, &EndPos, 10);
-  assert(CalleeIdx >= 0 && "Callee index is supposed to be positive!");
-  Encoding.push_back(CalleeIdx);
+  int PosIdx = ::strtol(CurrPos, &EndPos, 10);
+  assert(PosIdx >= 0 && "Index is supposed to be positive!");
+  Indxs.push_back(PosIdx);
 
   while (*EndPos == ',') {
     const char *PayloadPos = EndPos + 1;
 
     int PayloadIdx = ::strtol(PayloadPos, &EndPos, 10);
-    Encoding.push_back(PayloadIdx);
+    Indxs.push_back(PayloadIdx);
   }
 
-  assert(*EndPos == '>' && "Callback callee specifier must end with a '>'");
+  assert(*EndPos == '>' && "Index list must end with '>'");
+}
+
+bool Builtin::Context::isNonNull(unsigned ID,
+                                 llvm::SmallVectorImpl<int> &Indxs) const {
+
+  const char *AttrPos = ::strchr(getAttributesString(ID), 'N');
+  if (!AttrPos)
+    return false;
+
+  ++AttrPos;
+  parseCommaSeparatedIndices(AttrPos, Indxs);
+
+  return true;
+}
+
+bool Builtin::Context::performsCallback(unsigned ID,
+                                        SmallVectorImpl<int> &Encoding) const {
+  const char *CalleePos = ::strchr(getAttributesString(ID), 'C');
+  if (!CalleePos)
+    return false;
+
+  ++CalleePos;
+  parseCommaSeparatedIndices(CalleePos, Encoding);
+
   return true;
 }
 
diff --git a/clang/lib/Sema/SemaDecl.cpp b/clang/lib/Sema/SemaDecl.cpp
index 9ef7a2698913d..d9df60af114fb 100644
--- a/clang/lib/Sema/SemaDecl.cpp
+++ b/clang/lib/Sema/SemaDecl.cpp
@@ -17141,6 +17141,15 @@ void Sema::AddKnownFunctionAttributes(FunctionDecl *FD) {
       }
     }
 
+    SmallVector<int, 4> Indxs;
+    if (Context.BuiltinInfo.isNonNull(BuiltinID, Indxs) &&
+        !FD->hasAttr<NonNullAttr>()) {
+      llvm::SmallVector<ParamIdx, 4> ParamIndxs;
+      for (int I : Indxs)
+        ParamIndxs.push_back(ParamIdx(I + 1, FD));
+      FD->addAttr(NonNullAttr::CreateImplicit(Context, ParamIndxs.data(),
+                                              ParamIndxs.size()));
+    }
     if (Context.BuiltinInfo.isReturnsTwice(BuiltinID) &&
         !FD->hasAttr<ReturnsTwiceAttr>())
       FD->addAttr(ReturnsTwiceAttr::CreateImplicit(Context,
diff --git a/clang/test/Sema/format-strings-nonnull.c b/clang/test/Sema/format-strings-nonnull.c
new file mode 100644
index 0000000000000..86ce24a27ca11
--- /dev/null
+++ b/clang/test/Sema/format-strings-nonnull.c
@@ -0,0 +1,74 @@
+// RUN: %clang_cc1 -fsyntax-only --std=c23 -verify -Wnonnull -Wno-format-security %s
+
+#define NULL  (void*)0
+
+typedef struct _FILE FILE;
+typedef __SIZE_TYPE__ size_t;
+typedef __builtin_va_list va_list;
+int printf(char const* restrict, ...);
+int __builtin_printf(char const* restrict, ...);
+int fprintf(FILE* restrict, char const* restrict, ...);
+int snprintf(char* restrict, size_t, char const* restrict, ...);
+int sprintf(char* restrict, char const* restrict, ...);
+int vprintf(char const* restrict, __builtin_va_list);
+int vfprintf(FILE* restrict, char const* restrict, __builtin_va_list);
+int vsnprintf(char* restrict, size_t, char const* restrict, __builtin_va_list);
+int vsprintf(char* restrict, char const* restrict, __builtin_va_list);
+
+int scanf(char const* restrict, ...);
+int fscanf(FILE* restrict, char const* restrict, ...);
+int sscanf(char const* restrict, char const* restrict, ...);
+int vscanf(char const* restrict, __builtin_va_list);
+int vfscanf(FILE* restrict, char const* restrict, __builtin_va_list);
+int vsscanf(char const* restrict, char const* restrict, __builtin_va_list);
+
+
+void check_format_string(FILE *fp, va_list ap) {
+    char buf[256];
+    char* const fmt = NULL;
+
+    printf(fmt);
+    // expected-warning@-1{{null passed to a callee that requires a non-null argument}}
+
+    __builtin_printf(NULL, "xxd");
+    // expected-warning@-1{{null passed to a callee that requires a non-null argument}}
+
+    fprintf(fp, NULL, 25);
+    // expected-warning@-1{{null passed to a callee that requires a non-null argument}}
+
+    sprintf(buf, NULL, 42);
+    // expected-warning@-1{{null passed to a callee that requires a non-null argument}}
+
+    snprintf(buf, 10, 0, 42);
+    // expected-warning@-1{{null passed to a callee that requires a non-null argument}}
+
+    vprintf(fmt, ap);
+    // expected-warning@-1{{null passed to a callee that requires a non-null argument}}
+
+    vfprintf(fp, 0, ap);
+    // expected-warning@-1{{null passed to a callee that requires a non-null argument}}
+
+    vsprintf(buf, nullptr, ap);
+    // expected-warning@-1{{null passed to a callee that requires a non-null argument}}
+
+    vsnprintf(buf, 10, fmt, ap);
+    // expected-warning@-1{{null passed to a callee that requires a non-null argument}}
+
+    scanf(NULL);
+    // expected-warning@-1{{null passed to a callee that requires a non-null argument}}
+
+    fscanf(fp, nullptr);
+    // expected-warning@-1{{null passed to a callee that requires a non-null argument}}
+
+    sscanf(buf, fmt);
+    // expected-warning@-1{{null passed to a callee that requires a non-null argument}}
+
+    vscanf(NULL, ap);
+    // expected-warning@-1{{null passed to a callee that requires a non-null argument}}
+
+    vfscanf(fp, fmt, ap);
+    // expected-warning@-1{{null passed to a callee that requires a non-null argument}}
+
+    vsscanf(buf, NULL, ap);
+    // expected-warning@-1{{null passed to a callee that requires a non-null argument}}
+}
\ No newline at end of file
diff --git a/clang/test/Sema/format-strings-nonnull.cpp b/clang/test/Sema/format-strings-nonnull.cpp
new file mode 100644
index 0000000000000..55a3ed1788c79
--- /dev/null
+++ b/clang/test/Sema/format-strings-nonnull.cpp
@@ -0,0 +1,40 @@
+// RUN: %clang_cc1 -fsyntax-only -verify -Wnonnull %s
+
+#ifdef __cplusplus
+# define EXTERN_C extern "C"
+#else
+# define EXTERN_C extern
+#endif
+
+typedef struct _FILE FILE;
+typedef __SIZE_TYPE__ size_t;
+typedef __builtin_va_list va_list;
+
+EXTERN_C int printf(const char *, ...);
+EXTERN_C int fprintf(FILE *, const char *restrict, ...);
+EXTERN_C int sprintf(char* restrict, char const* res, ...);
+
+EXTERN_C int scanf(char const *restrict, ...);
+EXTERN_C int fscanf(FILE* restrict, char const* res, ...);
+
+void test(FILE *fp) {
+  char buf[256];
+
+  __builtin_printf(__null, "x");
+  // expected-warning@-1 {{null passed to a callee that requires a non-null argument}}
+
+  printf(__null, "xxd");
+  // expected-warning@-1 {{null passed to a callee that requires a non-null argument}}
+
+  fprintf(fp, __null, 42);
+  // expected-warning@-1 {{null passed to a callee that requires a non-null argument}}
+
+  sprintf(buf, __null);
+  // expected-warning@-1 {{null passed to a callee that requires a non-null argument}}
+
+  scanf(__null);
+  // expected-warning@-1 {{null passed to a callee that requires a non-null argument}}
+
+  fscanf(fp, __null);
+  // expected-warning@-1 {{null passed to a callee that requires a non-null argument}}
+}
diff --git a/clang/test/Sema/format-strings.c b/clang/test/Sema/format-strings.c
index 103dd8ab5a85c..164db45fa2053 100644
--- a/clang/test/Sema/format-strings.c
+++ b/clang/test/Sema/format-strings.c
@@ -480,11 +480,9 @@ void pr7981(wint_t c, wchar_t c2) {
 #endif
 }
 
-// -Wformat-security says NULL is not a string literal
 void rdar8269537(void) {
-  // This is likely to crash in most cases, but -Wformat-nonliteral technically
-  // doesn't warn in this case.
-  printf(0); // no-warning
+  printf(0);
+  // expected-warning@-1{{null passed to a callee that requires a non-null argument}}
 }
 
 // Handle functions with multiple format attributes.
diff --git a/clang/test/SemaCXX/format-strings-0x.cpp b/clang/test/SemaCXX/format-strings-0x.cpp
index 7d37f8276f29f..e0ca7a270c993 100644
--- a/clang/test/SemaCXX/format-strings-0x.cpp
+++ b/clang/test/SemaCXX/format-strings-0x.cpp
@@ -14,6 +14,7 @@ void f(char **sp, float *fp) {
   printf("%a", 1.0);
   scanf("%afoobar", fp);
   printf(nullptr);
+  // expected-warning@-1{{null passed to a callee that requires a non-null argument}}
   printf(*sp); // expected-warning {{not a string literal}}
   // expected-note@-1{{treat the string as an argument to avoid this}}
 
@@ -32,4 +33,5 @@ void f(char **sp, float *fp) {
   printf("init list: %d", { 0 }); // expected-error {{cannot pass initializer list to variadic function; expected type from format string was 'int'}}
   printf("void: %d", f(sp, fp)); // expected-error {{cannot pass expression of type 'void' to variadic function; expected type from format string was 'int'}}
   printf(0, { 0 }); // expected-error {{cannot pass initializer list to variadic function}}
+  // expected-warning@-1{{null passed to a callee that requires a non-null argument}}
 }
diff --git a/clang/test/SemaObjC/format-strings-objc.m b/clang/test/SemaObjC/format-strings-objc.m
index 40c1d31b1fd4c..babbb40394267 100644
--- a/clang/test/SemaObjC/format-strings-objc.m
+++ b/clang/test/SemaObjC/format-strings-objc.m
@@ -130,7 +130,7 @@ void rdar10743758(id x) {
   printf(s2); // expected-warning {{more '%' conversions than data arguments}}
 
   const char * const s3 = (const char *)0;
-  printf(s3); // no-warning (NULL is a valid format string)
+  printf(s3); // expected-warning {{null passed to a callee that requires a non-null argument}}
 
   NSString * const ns1 = @"constant string %s"; // expected-note {{format string is defined here}}
   NSLog(ns1); // expected-warning {{more '%' conversions than data arguments}}
@@ -259,6 +259,7 @@ void testByValueObjectInFormat(Foo *obj) {
   printf("%d %d %d", 1L, *obj, 1L); // expected-error {{cannot pass object with interface type 'Foo' by value to variadic function; expected type from format string was 'int'}} expected-warning 2 {{format specifies type 'int' but the argument has type 'long'}}
   printf("%!", *obj); // expected-error {{cannot pass object with interface type 'Foo' by value through variadic function}} expected-warning {{invalid conversion specifier}}
   printf(0, *obj); // expected-error {{cannot pass object with interface type 'Foo' by value through variadic function}}
+  // expected-warning@-1{{null passed to a callee that requires a non-null argument}}
 
   [Bar log2:@"%d", *obj]; // expected-error {{cannot pass object with interface type 'Foo' by value to variadic method; expected type from format string was 'int'}}
 }

@bozicrHT
Copy link
Contributor Author

@Sirraide @erichkeane

Copy link
Collaborator

@erichkeane erichkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very much in @AaronBallman 's range here, particularly with the C stuff. I don't see anything of question, so it is just a matter of whether he thinks this is a good idea. he's supposed to come back any day now (maybe next week?) so please ping him in another week+ if you don't hear from him.

@philnik777
Copy link
Contributor

For reference, this was #158626 before.


vsscanf(buf, NULL, ap);
// expected-warning@-1{{null passed to a callee that requires a non-null argument}}
} No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smallest nit: newline


SmallVector<int, 4> Indxs;
if (Context.BuiltinInfo.isNonNull(BuiltinID, Indxs) &&
!FD->hasAttr<NonNullAttr>()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the hasAttr<NonNullAttr>() check be per parameter?

This reads as if

__attribute__((nonnull)) void printf(const char*, ...)

Would not get the attribute annotation, and similarly

void printf(const char* __attribute__((nonnull)), ...)

would have the attribute added again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ideally it could check at the parameter level, but this works equally well because nonnull can be applied at the function level (e.g. __attribute__((nonnull(idx1, idx2))) int printf(...);). Since I didn’t find a parameter-specific equivalent to AddKnownFunctionAttributes, I went with this approach. I could alternatively iterate over the ParmVarDecls and attach attributes individually if needed.

@Sirraide Sirraide requested a review from AaronBallman October 9, 2025 15:18
Copy link
Collaborator

@AaronBallman AaronBallman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this! I kind of wonder if we want a different approach though.

In C, any invalid argument is UB unless otherwise specified, and null is an invalid argument. 7.1.4p1:

Each of the following statements applies unless explicitly stated otherwise in the detailed descriptions that follow:
— If an argument to a function has an invalid value (such as a value outside the domain of
the function, or a pointer outside the address space of the program, or a null pointer, or a
pointer to non-modifiable storage when the corresponding parameter is not const-qualified) or a type (after default argument promotion) not expected by a variadic function, the behavior is undefined.

So I almost wonder if a better approach is to go the opposite way and require marking arguments which can be null? (We don't have GCC's nonnull_if_nonzero attribute which would also be very helpful if we go this direction; there are a number of functions where null is fine only so long as some count variable is nonzero.)

def FPrintf : LibBuiltin<"stdio.h"> {
let Spellings = ["fprintf"];
let Attributes = [NoThrow, PrintfFormat<1>];
let Attributes = [NoThrow, PrintfFormat<1>, NonNull<[1]>];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the FILE * and the const char * must be nonnull.

def SPrintf : LibBuiltin<"stdio.h"> {
let Spellings = ["sprintf"];
let Attributes = [NoThrow, PrintfFormat<1>];
let Attributes = [NoThrow, PrintfFormat<1>, NonNull<[1]>];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the char * and the const char * must be nonnull.

def VfPrintf : LibBuiltin<"stdio.h"> {
let Spellings = ["vfprintf"];
let Attributes = [NoThrow, VPrintfFormat<1>];
let Attributes = [NoThrow, VPrintfFormat<1>, NonNull<[1]>];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FILE * as well.

def VsPrintf : LibBuiltin<"stdio.h"> {
let Spellings = ["vsprintf"];
let Attributes = [NoThrow, VPrintfFormat<1>];
let Attributes = [NoThrow, VPrintfFormat<1>, NonNull<[1]>];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

char * as well.

def FScanf : LibBuiltin<"stdio.h"> {
let Spellings = ["fscanf"];
let Attributes = [ScanfFormat<1>];
let Attributes = [ScanfFormat<1>, NonNull<[1]>];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FILE * as well.

I'll stop commenting on this, you should take a pass through the rest and fix up accordingly.

@bozicrHT
Copy link
Contributor Author

Thank you for working on this! I kind of wonder if we want a different approach though.

In C, any invalid argument is UB unless otherwise specified, and null is an invalid argument. 7.1.4p1:

Each of the following statements applies unless explicitly stated otherwise in the detailed descriptions that follow: — If an argument to a function has an invalid value (such as a value outside the domain of the function, or a pointer outside the address space of the program, or a null pointer, or a pointer to non-modifiable storage when the corresponding parameter is not const-qualified) or a type (after default argument promotion) not expected by a variadic function, the behavior is undefined.

So I almost wonder if a better approach is to go the opposite way and require marking arguments which can be null? (We don't have GCC's nonnull_if_nonzero attribute which would also be very helpful if we go this direction; there are a number of functions where null is fine only so long as some count variable is nonzero.)

Thank you for the thoughtful feedback! I was wondering - are you suggesting introducing a new attribute (e.g. nullable) to explicitly mark arguments that can accept null? Wouldn’t that approach potentially break existing code and tests by triggering new warnings, since every unannotated pointer would be treated as non-null by default?

It also seems like a larger-scope change and would diverge from GCC’s current behavior, which only checks for null arguments when marked with nonnull.

Extend nonnull coverage to include not only format string parameters,
but also FILE* and char* arguments (e.g. for sscanf, fprintf, etc.).
@AaronBallman
Copy link
Collaborator

Thank you for the thoughtful feedback! I was wondering - are you suggesting introducing a new attribute (e.g. nullable) to explicitly mark arguments that can accept null? Wouldn’t that approach potentially break existing code and tests by triggering new warnings, since every unannotated pointer would be treated as non-null by default?

No, but yes. :-D

I think most pointers in standard library functions cannot be null. So it might be easier to maintain for our tablegen to assume all pointers are nonnull unless we add an annotation to say which parameters can be null. Then when generating the headers from tablegen, we'd add __attribute__((nonnull)) to the parameters which must be nonnull.

But the attribute we should add, but not as part of this PR, is one from GCC: nonnull_if_nonzero. We need that for a number of the APIs where you can pass a null pointer but only when the size argument is zero.

def SnPrintf : LibBuiltin<"stdio.h"> {
let Spellings = ["snprintf"];
let Attributes = [NoThrow, PrintfFormat<2>];
let Attributes = [NoThrow, PrintfFormat<2>, NonNull<[0, 2]>];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one is wrong (as are a few others); the first argument can be null if the second argument is 0. This is a case where we'd want nonnull_if_nonzero.

You should verify these annotations against the latest C standard draft: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3685.pdf

According to the C standard, the first argument to snprintf and
vsnprintf may be null if the size argument is zero.
@bozicrHT
Copy link
Contributor Author

I think most pointers in standard library functions cannot be null. So it might be easier to maintain for our tablegen to assume all pointers are nonnull unless we add an annotation to say which parameters can be null. Then when generating the headers from tablegen, we'd add __attribute__((nonnull)) to the parameters which must be nonnull.

But the attribute we should add, but not as part of this PR, is one from GCC: nonnull_if_nonzero. We need that for a number of the APIs where you can pass a null pointer but only when the size argument is zero.

Ah, I see your point now — that makes sense. The idea is to treat pointer parameters as nonnull by default, and only allow null explicitly in cases like sprintf/vsnprintf, where the buffer can be null if the size argument is 0.

@bozicrHT
Copy link
Contributor Author

bozicrHT commented Nov 4, 2025

ping

Copy link
Collaborator

@AaronBallman AaronBallman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Do you need someone to land the changes on your behalf?

@bozicrHT
Copy link
Contributor Author

bozicrHT commented Nov 5, 2025

LGTM! Do you need someone to land the changes on your behalf?

Yes I do, I don't have commit access!

@philnik777
Copy link
Contributor

@AaronBallman I don't think #158626 (review) has really been addressed yet. If you think it's fine to basically undefine the behaviour of implementations that's fine with me, but I think we should acknowledge that we do that.

@AaronBallman
Copy link
Collaborator

@AaronBallman I don't think #158626 (review) has really been addressed yet.

Thanks for raising the concern!

If you think it's fine to basically undefine the behaviour of implementations that's fine with me, but I think we should acknowledge that we do that.

The behavior is undefined according to the standard, so this is 1) ensuring we get diagnostics for misuse, 2) improving optimization behavior. So it's not really the implementation undefining the behavior, it's the implementation admitting the behavior was already undefined (maybe too find of a distinction?).

gcc seems to treat the parameters as being marked nonnull: https://godbolt.org/z/YTb1ejh8W

@philnik777
Copy link
Contributor

@AaronBallman I don't think #158626 (review) has really been addressed yet.

Thanks for raising the concern!

If you think it's fine to basically undefine the behaviour of implementations that's fine with me, but I think we should acknowledge that we do that.

The behavior is undefined according to the standard, so this is 1) ensuring we get diagnostics for misuse,

I'm 100% on board with this. It's most likely user error which should be diagnosed. However, this can be achieved by adding _Nonnull instead. That doesn't have any optimization implications as opposed to [[gnu::nonnull]].

  1. improving optimization behavior. So it's not really the implementation undefining the behavior, it's the implementation admitting the behavior was already undefined (maybe too find of a distinction?).

What I mean is that some libc implementations seem define their behaviour when a nullptr is passed. This is effectively undone by the compiler adding the [[gnu::nonnull]] and the actual libc has no say in this. For hardened implementations this is a very real issue, see e.g. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121052. If this were to happen to a function libc++ defines I'd fight tooth and nail.

gcc seems to treat the parameters as being marked nonnull: https://godbolt.org/z/YTb1ejh8W

@AaronBallman
Copy link
Collaborator

@AaronBallman I don't think #158626 (review) has really been addressed yet.

Thanks for raising the concern!

If you think it's fine to basically undefine the behaviour of implementations that's fine with me, but I think we should acknowledge that we do that.

The behavior is undefined according to the standard, so this is 1) ensuring we get diagnostics for misuse,

I'm 100% on board with this. It's most likely user error which should be diagnosed. However, this can be achieved by adding _Nonnull instead. That doesn't have any optimization implications as opposed to [[gnu::nonnull]].

True.

  1. improving optimization behavior. So it's not really the implementation undefining the behavior, it's the implementation admitting the behavior was already undefined (maybe too find of a distinction?).

What I mean is that some libc implementations seem define their behaviour when a nullptr is passed.

[citation needed] -- do you have evidence that they intentionally define the behavior? When we did an audit of implementations for deciding whether you can pass null pointer and a zero count, we saw tons of evidence that null with a zero count was intentionally supported. I don't recall seeing anything about situations where there's no count provided, and passing a null pointer in those cases has been undefined behavior for 40+ years.

llvm-libc doesn't define the behavior:

return reinterpret_cast<LIBC_NAMESPACE::File *>(f)->write_unlocked(

musl doesn't define the behavior: https://git.musl-libc.org/cgit/musl/tree/src/stdio/vfprintf.c#n681
BSD CRT doesn't define the behavior: https://github.com/freebsd/freebsd-src/blob/e878ba8eea7206b3a435338c6eed0e4264e0ce14/lib/libc/stdio/vfprintf.c#L277
I couldn't figure out where the actual implementation lives in glibc, but I would be surprised if they defined the behavior either. I stopped looking at libc implementations after that.

This is effectively undone by the compiler adding the [[gnu::nonnull]] and the actual libc has no say in this. For hardened implementations this is a very real issue, see e.g. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121052. If this were to happen to a function libc++ defines I'd fight tooth and nail.

gcc seems to treat the parameters as being marked nonnull: https://godbolt.org/z/YTb1ejh8W

Again, this is matching GCC's behavior and I think that is reasonable given how bad our codegen is otherwise: https://godbolt.org/z/9zh7KTGvG

CC @michaelrj-google for additional input

@philnik777
Copy link
Contributor

  1. improving optimization behavior. So it's not really the implementation undefining the behavior, it's the implementation admitting the behavior was already undefined (maybe too find of a distinction?).

What I mean is that some libc implementations seem define their behaviour when a nullptr is passed.

[citation needed] -- do you have evidence that they intentionally define the behavior?

Is this intentional enough: https://github.com/bminor/glibc/blob/26e48102108284d2474f83f5afee56b994c86d54/stdio-common/vfprintf-internal.c#L1533 ?

The macro expands to

  do
    {
      CHECK_FILE (s, -1);
      if (s->_flags & _IO_NO_WRITES)
	{
	  s->_flags |= 0x0020;
	  __set_errno (9);
	  return -1;
	}
      if (format == ((void *) 0))
	{
	  __set_errno (22);
	  return -1;
	}
    }
  while (0);

This is effectively undone by the compiler adding the [[gnu::nonnull]] and the actual libc has no say in this. For hardened implementations this is a very real issue, see e.g. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121052. If this were to happen to a function libc++ defines I'd fight tooth and nail.

gcc seems to treat the parameters as being marked nonnull: https://godbolt.org/z/YTb1ejh8W

Again, this is matching GCC's behavior and I think that is reasonable given how bad our codegen is otherwise: https://godbolt.org/z/9zh7KTGvG

Does this really matter though? Sure, the CodeGen doesn't look great, but I'd expect by far most format strings are literals. If they're generated by some function that could be annotated with [[gnu::returns_nonnull]] as well to get the same optimization behaviour, but is much safer, since a function can usually actually guarantee that it returns a non-null pointer. I also want to note that this is a single null check after a relatively heavy function, so I doubt this will have much of an impact. libcs can add the attributes themselves if it's deemed important enough. If you think this isn't a problem I don't want to block this, especially since I don't have any stake in this. I mostly want to make sure that this is thought through and we don't set this kind of precedent for C++ functions.

@michaelrj-google
Copy link
Contributor

From the LLVM-libc side: I doubt we'll ever define the behavior for what happens when the format string for a printf call is null beyond adding a LIBC_CRASH_ON_NULLPTR check. As shown in the godbolt link there's a warning when the format string isn't a literal because passing a variable as the format string is considered bad practice.

That being said it does appear that glibc doesn't crash on a null format string. I tried the following: https://godbolt.org/z/rW7o1b5se
It does crash on a null pointer passed to puts though.

Overall conclusion: You're not supposed to pass variables as format strings in general, assuming that they're non-null is probably fine. Some embedded systems have 0 as a valid address though, so we should be careful to only make this optimization when 0 is an invalid address.

@AaronBallman
Copy link
Collaborator

  1. improving optimization behavior. So it's not really the implementation undefining the behavior, it's the implementation admitting the behavior was already undefined (maybe too find of a distinction?).

What I mean is that some libc implementations seem define their behaviour when a nullptr is passed.

[citation needed] -- do you have evidence that they intentionally define the behavior?

Is this intentional enough: https://github.com/bminor/glibc/blob/26e48102108284d2474f83f5afee56b994c86d54/stdio-common/vfprintf-internal.c#L1533 ?

The macro expands to

  do
    {
      CHECK_FILE (s, -1);
      if (s->_flags & _IO_NO_WRITES)
	{
	  s->_flags |= 0x0020;
	  __set_errno (9);
	  return -1;
	}
      if (format == ((void *) 0))
	{
	  __set_errno (22);
	  return -1;
	}
    }
  while (0);

I believe it is intentionally not defined. It's relying on CHECK_FILE which is defined as:

#ifdef IO_DEBUG
# define CHECK_FILE(FILE, RET) do {				\
    if ((FILE) == NULL						\
	|| ((FILE)->_flags & _IO_MAGIC_MASK) != _IO_MAGIC)	\
      {								\
	__set_errno (EINVAL);					\
	return RET;						\
      }								\
  } while (0)
#else
# define CHECK_FILE(FILE, RET) do { } while (0)
#endif

so it depends on whether IO_DEBUG is defined as to whether you get any check for null: https://github.com/bminor/glibc/blob/26e48102108284d2474f83f5afee56b994c86d54/libio/libioP.h#L974

Again, this is matching GCC's behavior and I think that is reasonable given how bad our codegen is otherwise: https://godbolt.org/z/9zh7KTGvG

Does this really matter though?

To me? Not really. To someone who is wondering why our codegen is worse than GCC's? Probably.

Sure, the CodeGen doesn't look great, but I'd expect by far most format strings are literals.

I'd like to agree, but there are common counterexamples too. e.g., localization often requires non-literal format strings. And this isn't just about the format strings, it's also about the FILE * being passed (neither can be null), and those are never literals.

If they're generated by some function that could be annotated with [[gnu::returns_nonnull]] as well to get the same optimization behaviour, but is much safer, since a function can usually actually guarantee that it returns a non-null pointer. I also want to note that this is a single null check after a relatively heavy function, so I doubt this will have much of an impact. libcs can add the attributes themselves if it's deemed important enough. If you think this isn't a problem I don't want to block this, especially since I don't have any stake in this. I mostly want to make sure that this is thought through and we don't set this kind of precedent for C++ functions.

I think this is already the precedent? If the standard defines something as being UB, we've always treated as something to diagnose but is valid to optimize on unless it's going to break a lot of existing real world code. That's why I think it's reasonable to add __attribute__((nonnull)) to it for diagnostic and optimization purposes when inferring the library function semantics. That's just following the standard, after all. I think there are cases where it's reasonable to argue for _Nonnull semantics instead, but I think the default is to optimize on UB rather than to leave the optimizations on the table on the chance there's a hardened library somewhere which wants to define the behavior.

That said, it's certainly more conservative to use _Nonnull instead of __attribute__((nonnull)). But I'm struggling to imagine what valid code would break from being aggressive. It would delete null pointer checks that happen after the pointer has already been passed to the library function, but that code wasn't correct to begin with and I don't imagine it's common that people do:

fprintf(SomeFile, "something");
if (SomeFile) {
  // Good stuff
}

Do you have some usage pattern you're worried about? If we have evidence this will be disruptive, that would be compelling to know about.

@philnik777
Copy link
Contributor

I believe it is intentionally not defined. It's relying on CHECK_FILE which is defined as:

#ifdef IO_DEBUG
# define CHECK_FILE(FILE, RET) do {				\
    if ((FILE) == NULL						\
	|| ((FILE)->_flags & _IO_MAGIC_MASK) != _IO_MAGIC)	\
      {								\
	__set_errno (EINVAL);					\
	return RET;						\
      }								\
  } while (0)
#else
# define CHECK_FILE(FILE, RET) do { } while (0)
#endif

so it depends on whether IO_DEBUG is defined as to whether you get any check for null: https://github.com/bminor/glibc/blob/26e48102108284d2474f83f5afee56b994c86d54/libio/libioP.h#L974

Ah, I didn't see that part. In that case I don't have any examples.

Again, this is matching GCC's behavior and I think that is reasonable given how bad our codegen is otherwise: https://godbolt.org/z/9zh7KTGvG

Does this really matter though?

To me? Not really. To someone who is wondering why our codegen is worse than GCC's? Probably.

Well, I'm not convinced there is a case where this actually matters, given that no implementation I could find annotates their declaration with [[gnu::nonnull]] themselves. Anyways, I don't think this is the fundamental contention here.

Sure, the CodeGen doesn't look great, but I'd expect by far most format strings are literals.

I'd like to agree, but there are common counterexamples too. e.g., localization often requires non-literal format strings. And this isn't just about the format strings, it's also about the FILE * being passed (neither can be null), and those are never literals.

I'm aware that there are cases where it doesn't work (localization was actually the main case I was thinking about). In these cases I'd argue that the function returning the format string should be marked [[gnu::returns_nonnull]] instead, as I said below. I would also agree that it's not always feasible to do such things as well though.

I think this is already the precedent? [...]

I don't think so. AFAICT the currently added attributes simply add semantics guaranteed by some standard/are fundamental parts of the contract of a function (e.g. noreturn), allow removing dead code or some amount of constant folding (e.g. pure), or enable diagnostics (e.g. format). None of these allow exploitation of UB the library could define I think.

[...]
Do you have some usage pattern you're worried about? If we have evidence this will be disruptive, that would be compelling to know about.

I don't. I mostly didn't see any discussion of whether it's a good idea to potentially introduce UB here, which worried me somewhat, especially since I think this patch does introduce a novel idea.

@AaronBallman
Copy link
Collaborator

I think this is already the precedent? [...]

I don't think so. AFAICT the currently added attributes simply add semantics guaranteed by some standard/are fundamental parts of the contract of a function (e.g. noreturn), allow removing dead code or some amount of constant folding (e.g. pure), or enable diagnostics (e.g. format). None of these allow exploitation of UB the library could define I think.

I consider whether a pointer is allowed to be null or not to be a fundamental part of the contract of the function; you don't? Noreturn is a good example -- if a noreturn function actually returns, that's UB and we do "exploit" it: https://godbolt.org/z/5YcMnbf11

I think builtins should model as much semantic information as we would like to glean, both for diagnostics and for optimizations. But if there are good reasons to ignore some of that information, then I think it's reasonable for us to do so. I don't think there are good reasons to ignore that information in this specific case, but we've ignored it in the past.

Do you have some usage pattern you're worried about? If we have evidence this will be disruptive, that would be compelling to know about.

I don't. I mostly didn't see any discussion of whether it's a good idea to potentially introduce UB here, which worried me somewhat, especially since I think this patch does introduce a novel idea.

This does not introduce UB here. It's already UB and we have evidence that standard library implementations treat it as behaviorally undefined: https://godbolt.org/z/bqn16fWTs and optimize on it, even without the standard library marking anything explicitly: https://godbolt.org/z/3zY6MzcY9 So I don't think what this patch is doing is actually novel.

That said, I don't think we have a good plan for hardened libraries. We have no notion of conditionally applying these markings in tablegen, and perhaps we need to think of something for that. e.g., if the user passes -fhardened to the compiler (or whatever we end up exposing), I can imagine we'd want to switch from __attribute__((nonnull)) to _Nonnull semantics in that case. But I'm also not certain we need such a mechanism today for this PR, either.

CC @rjmccall @rnk @efriedma-quic @jyknight for some additional opinions, in case my stance is off-base.

@efriedma-quic
Copy link
Collaborator

There are good reasons to be conservative with nonnull markings... but these particular markings seem very unlikely to cause issues. Assuming this interacts correctly with -fno-delete-null-pointer-checks .

@philnik777
Copy link
Contributor

I think this is already the precedent? [...]

I don't think so. AFAICT the currently added attributes simply add semantics guaranteed by some standard/are fundamental parts of the contract of a function (e.g. noreturn), allow removing dead code or some amount of constant folding (e.g. pure), or enable diagnostics (e.g. format). None of these allow exploitation of UB the library could define I think.

I consider whether a pointer is allowed to be null or not to be a fundamental part of the contract of the function; you don't?

Not necessarily. The standard doesn't make any guarantees, but libraries are allowed to. That's the part the below example doesn't match. __builtin_unreachable() isn't part of any standard, so the compiler defines the contract from scratch. If you take a function that's actually defined in a standard, e.g. exit, that is guaranteed by the standard to not return. We're adding semantic information guaranteed by the standard - we're not assuming what an implementation actually does when it's called out of contract.

Noreturn is a good example -- if a noreturn function actually returns, that's UB and we do "exploit" it: https://godbolt.org/z/5YcMnbf11

I think builtins should model as much semantic information as we would like to glean, both for diagnostics and for optimizations. But if there are good reasons to ignore some of that information, then I think it's reasonable for us to do so. I don't think there are good reasons to ignore that information in this specific case, but we've ignored it in the past.

Do you have some usage pattern you're worried about? If we have evidence this will be disruptive, that would be compelling to know about.

I don't. I mostly didn't see any discussion of whether it's a good idea to potentially introduce UB here, which worried me somewhat, especially since I think this patch does introduce a novel idea.

This does not introduce UB here. It's already UB and we have evidence that standard library implementations treat it as behaviorally undefined: https://godbolt.org/z/bqn16fWTs and optimize on it, even without the standard library marking anything explicitly: https://godbolt.org/z/3zY6MzcY9 So I don't think what this patch is doing is actually novel.

Yes, I agree now that it doesn't introduce UB in this case. What I meant is that that wasn't clear to me given the lack of discussion about it on the PR, and IMO nonnull is an attribute that should always require good evidence that there is no library defining the behaviour of a nullptr. Now that we have that evidence I don't think this particular patch is a bad idea.

That said, I don't think we have a good plan for hardened libraries. We have no notion of conditionally applying these markings in tablegen, and perhaps we need to think of something for that. e.g., if the user passes -fhardened to the compiler (or whatever we end up exposing), I can imagine we'd want to switch from __attribute__((nonnull)) to _Nonnull semantics in that case. But I'm also not certain we need such a mechanism today for this PR, either.

CC @rjmccall @rnk @efriedma-quic @jyknight for some additional opinions, in case my stance is off-base.

I agree.

@philnik777
Copy link
Contributor

I think this is already the precedent? [...]

I don't think so. AFAICT the currently added attributes simply add semantics guaranteed by some standard/are fundamental parts of the contract of a function (e.g. noreturn), allow removing dead code or some amount of constant folding (e.g. pure), or enable diagnostics (e.g. format). None of these allow exploitation of UB the library could define I think.

I consider whether a pointer is allowed to be null or not to be a fundamental part of the contract of the function; you don't?

Not necessarily. The standard doesn't make any guarantees, but libraries are allowed to. That's the part the below example doesn't match. __builtin_unreachable() isn't part of any standard, so the compiler defines the contract from scratch. If you take a function that's actually defined in a standard, e.g. exit, that is guaranteed by the standard to not return. We're adding semantic information guaranteed by the standard - we're not assuming what an implementation actually does when it's called out of contract.

To expand a bit on the __builtin_unreachable part: If you take std::unreachable, the contract basically is that there is no guarantee what happens when it's actually executed. However, in some modes libc++ actually defines what happens when you call std::unreachable, namely that we trap or abort. If Clang would hijack std::unreachable and made it unconditionally equivalent to __builtin_unreachable it would override what the library wanted to happen.

@AaronBallman
Copy link
Collaborator

I consider whether a pointer is allowed to be null or not to be a fundamental part of the contract of the function; you don't?

Not necessarily. The standard doesn't make any guarantees, but libraries are allowed to. That's the part the below example doesn't match. __builtin_unreachable() isn't part of any standard,

It's the implementation for std::unreachable so it does follow the standard. I used it as an example because the point to library builtins is to recognize calls to standard library functions and translate those into __builtin_whatever calls so the rest of the toolchain can reason about the call more easily. So for the compiler, there is potentially no difference between calling std::unreachable and calling __builtin_unreachable because they may the same thing. (We don't have library builtins for most of the C++ standard library because most of the C++ standard library does not lend itself to translation to builtins, but it does happen.)

If Clang would hijack std::unreachable and made it unconditionally equivalent to __builtin_unreachable it would override what the library wanted to happen.

Yes, but this is true of all semantic information for all of our builtins. That's kind of the point to the way our builtins work. We hijack things we think we can do more efficiently than the library can. Any difference between the library semantics and what we predetermine is a problem.

...but more below.

... IMO nonnull is an attribute that should always require good evidence that there is no library defining the behaviour of a nullptr.

I understand that's your assertion but what I don't understand is the justification for it. To me, this is the opposite of how things usually work: if the standard defines it as UB, we optimize based on that unless there's a good reason not to. Your argument is that we need to provide a good reason to optimize based on UB otherwise we shouldn't be doing so. That's a valid stance to take, but isn't how we've approached things in the past and I think requires wider community buy-in.

That said, I don't think we have a good plan for hardened libraries. We have no notion of conditionally applying these markings in tablegen, and perhaps we need to think of something for that. e.g., if the user passes -fhardened to the compiler (or whatever we end up exposing), I can imagine we'd want to switch from __attribute__((nonnull)) to _Nonnull semantics in that case. But I'm also not certain we need such a mechanism today for this PR, either.
CC @rjmccall @rnk @efriedma-quic @jyknight for some additional opinions, in case my stance is off-base.

I agree.

To me, I think this is the crux of the problem. There's a tension between the notion of a builtin statically knowing the semantics of the API and the notion of a specific library implementation wanting to have different (maybe stronger) semantics. We don't have a way for a library to signal "we're doing something beyond the standard semantics", so we have no way from the compiler to know which markings are reasonable. For example, maybe we want to encode whether an API sets errno internally or not (think: using the pure marking), but some libraries do set errno while others do not. But we can't necessarily rely on the library to mark things for us (or opt out of our inferences) because some popular libraries provide no markings (for example, musl or MSVC CRT). I'm not certain what a good solution for this is, so I think we're left doing what we think is best on a case-by-case basis with the static information. I think what I'm hearing on this PR is that we're all comfortable optimizing on it for the formatted IO functions (or am I reading the room wrong)?

Specific to this PR, are we happy with the design of Nonnull in tablegen meaning __attribute__((nonnull)) or do we want something more explicit so we can distinguish between _Nonnull and __attribute__((nonnull)) on a case-by-case basis (in the future)?

@philnik777
Copy link
Contributor

I consider whether a pointer is allowed to be null or not to be a fundamental part of the contract of the function; you don't?

Not necessarily. The standard doesn't make any guarantees, but libraries are allowed to. That's the part the below example doesn't match. __builtin_unreachable() isn't part of any standard,

It's the implementation for std::unreachable so it does follow the standard.

I don't really agree here. It's one possible implementation. libc++'s std::unreachable isn't necessarily equivalent to __builtin_unreachable. That's only not a problem because the compiler doesn't hijack std::unreachable.

I used it as an example because the point to library builtins is to recognize calls to standard library functions and translate those into __builtin_whatever calls so the rest of the toolchain can reason about the call more easily. So for the compiler, there is potentially no difference between calling std::unreachable and calling __builtin_unreachable because they may the same thing. (We don't have library builtins for most of the C++ standard library because most of the C++ standard library does not lend itself to translation to builtins, but it does happen.)

I think it's not a particularly useful example here, since it doesn't hijack anything. If you only modified __builtin_printf and not printf I think the idea of this patch would be much less contentious. Feel free to go wild on the API provided by the compiler.

I understand that's your assertion but what I don't understand is the justification for it. To me, this is the opposite of how things usually work: if the standard defines it as UB, we optimize based on that unless there's a good reason not to. Your argument is that we need to provide a good reason to optimize based on UB otherwise we shouldn't be doing so. That's a valid stance to take, but isn't how we've approached things in the past and I think requires wider community buy-in.

I don't think that's my stance at all. My stance is that the compiler shouldn't assume what UB libraries define and what not, unless there is good evidence that libraries do in fact not define the behaviour. AFAICT this is the first attribute where we assume libraries don't define certain UB. All the other attributes tell the compiler something a standard already guarantees.

To me, I think this is the crux of the problem. There's a tension between the notion of a builtin statically knowing the semantics of the API and the notion of a specific library implementation wanting to have different (maybe stronger) semantics. We don't have a way for a library to signal "we're doing something beyond the standard semantics", so we have no way from the compiler to know which markings are reasonable. For example, maybe we want to encode whether an API sets errno internally or not (think: using the pure marking), but some libraries do set errno while others do not. But we can't necessarily rely on the library to mark things for us (or opt out of our inferences) because some popular libraries provide no markings (for example, musl or MSVC CRT). I'm not certain what a good solution for this is, so I think we're left doing what we think is best on a case-by-case basis with the static information.

I understand that there are libraries that don't annotate their APIs and that may mean the compiler needs to add attributes by hijacking the declarations. That doesn't mean I have to like it, and I'd be especially unhappy if that happened to C++ APIs. OTOH I'd be quite happy if we had some way to opt-in to function hijacking, i.e. have some attribute to say "this function has identical semantics to __builtin_whatever". I realize this requires library buy-in which seems to not be an option for some libcs, but I do think it's feasible for the C++ libraries.

I think what I'm hearing on this PR is that we're all comfortable optimizing on it for the formatted IO functions (or am I reading the room wrong)?

I'm fine with this, yes. I'm also happy to move the discussion somewhere else, maybe your next office hours?

@AaronBallman
Copy link
Collaborator

I'm fine with this, yes. I'm also happy to move the discussion somewhere else, maybe your next office hours?

We had that discussion at my office hours this week, and this is a recap of what we decided on for a way forward (not all of this applies to the changes in this PR):

  1. it should be clear from the marking in Buildins.td whether the nonnull is one that's purely for diagnostic purposes or also includes optimization purposes. e.g., maybe we have NonNull as a base class with an OptimizationsAllowed field, and then have base classes for diagnosing vs optimizing markings. This helps ensure folks in the future know what behavior they're opting the builtin into; currently it is unclear.
  2. we think the default marking should be using the _Nonnull type qualifier instead of __attribute__((nonnull)) because that's the default with the least amount of surprise behind it. However, if there's a case to be made for why optimizing on the null pointer case is important, then we should be fine accepting the optimizing version.
  3. When there's a "hardening mode" in the compiler, it might make sense for it to control whether any optimizing nonnull markings are honored.

Specific to this PR, it sounds like the sentiment is now to use _Nonnull markings because there's not really a case to be made for why optimizing on null matters in real world code for the formatted IO APIs. So it might make sense to rework the changes so that users can write NonNull<NotOptimizing, [0]> in Builtins.td (where NotOptimizing is an enumeration) and then have that marking produce _Nonnull qualifiers instead.

Does this capture what you were expecting @philnik777?

@philnik777
Copy link
Contributor

Yes, sounds good to me.

@AaronBallman
Copy link
Collaborator

Yes, sounds good to me.

@bozicrHT sorry for the incredibly long back-and-forth on this; does this plan also make sense to you?

@bozicrHT
Copy link
Contributor Author

Hi Aaron, thanks - the discussion was very helpful. I agree with the conclusions.
For this PR, I’ll rework the implementation to:

  • switch from emitting __attribute__((nonnull)) to emitting _Nonnull
  • introduce NonNull<NotOptimizing, [0]> (with NotOptimizing as an enum) in Builtins.td, and have that form produce _Nonnull qualifiers on the specified parameters instead of the optimizing attribute.

If this sounds like the right scope for this PR, I’ll proceed.

@AaronBallman
Copy link
Collaborator

Hi Aaron, thanks - the discussion was very helpful. I agree with the conclusions. For this PR, I’ll rework the implementation to:

* switch from emitting `__attribute__((nonnull))` to emitting `_Nonnull`

* introduce `NonNull<NotOptimizing, [0]>` (with `NotOptimizing` as an enum) in `Builtins.td`, and have that form produce `_Nonnull` qualifiers on the specified parameters instead of the optimizing attribute.

If this sounds like the right scope for this PR, I’ll proceed.

That sounds correct to me!

Introduce two modes for the builtin nonnull attribute. Instead of always
emitting the GNU-style `nonnull` attribute, Clang now distinguishes
between:

* NonOptimizing: attaches Clang's `_Nonnull` type qualifier.
* Optimizing: emits GNU-style `nonnull` attribute.
@bozicrHT
Copy link
Contributor Author

Hi @AaronBallman, could you take a look at the updated implementation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:static analyzer clang Clang issues not falling into any other category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants