Skip to content

Conversation

DKLoehr
Copy link
Contributor

@DKLoehr DKLoehr commented Jul 21, 2025

This changes the glob matcher for the sanitizer special case format so that it treats / as matching both forward and back slashes.

When dealing with cross-compiles or build systems that don't normalize slashes, it's possible to run into file paths with inconsistent slashiness, e.g. ../..\v8/include\v8-internal.h when building chromium.

We can match this using the current syntax using this ugly kludge: src:*{/,\\}v8{/,\\}*. However, since the format is explicitly for listing file paths, it makes sense to treat / as denoting a path separator rather than a literal forward slash. This allows us to write the much more natural form src:*/v8/* and have it work on any platform.

This is technically a behavior change, but it seems very unlikely to come up in practice. It will only make a difference if a user has a system where they want to catch a/b but not a\b. Even in the worst case, they can still regain the previous behavior by escaping the slash character in their pattern: src:a\/b.

@llvmbot llvmbot added clang Clang issues not falling into any other category llvm:support labels Jul 21, 2025
@llvmbot
Copy link
Member

llvmbot commented Jul 21, 2025

@llvm/pr-subscribers-compiler-rt-sanitizer
@llvm/pr-subscribers-clang

@llvm/pr-subscribers-llvm-support

Author: Devon Loehr (DKLoehr)

Changes

This changes the glob matcher for the sanitizer special case format so that it treats / as matching both forward and back slashes.

When dealing with cross-compiles or build systems that don't normalize slashes, it's possible to run into file paths with inconsistent slashiness, e.g. ../..\v8/include\v8-internal.h when building chromium.

We can match this using the current syntax using this ugly kludge: src:*{/,\\}v8{/,\\}*. However, since the format is explicitly for listing file paths, it makes sense to treat / as denoting a path separator rather than a literal forward slash. This allows us to write the much more natural form src:*/v8/* and have it work on any platform.

This is technically a behavior change, but it seems very unlikely to come up in practice. It will only make a difference if a user has a system where both a/b and a\b exist, are different files, and they only want to capture the first of them. Even in the worst case, they can still regain the previous behavior by escaping the slash character in their pattern: src:a\/b.


Full diff: https://github.com/llvm/llvm-project/pull/149886.diff

5 Files Affected:

  • (modified) clang/docs/SanitizerSpecialCaseList.rst (+1)
  • (modified) clang/unittests/Basic/DiagnosticTest.cpp (+23)
  • (modified) llvm/docs/ReleaseNotes.md (+4)
  • (modified) llvm/include/llvm/Support/GlobPattern.h (+1)
  • (modified) llvm/lib/Support/GlobPattern.cpp (+4)
diff --git a/clang/docs/SanitizerSpecialCaseList.rst b/clang/docs/SanitizerSpecialCaseList.rst
index 194f2fc5a7825..3aea40ce0715d 100644
--- a/clang/docs/SanitizerSpecialCaseList.rst
+++ b/clang/docs/SanitizerSpecialCaseList.rst
@@ -174,6 +174,7 @@ tool-specific docs.
     # Lines starting with # are ignored.
     # Turn off checks for the source file
     # Entries without sections are placed into [*] and apply to all sanitizers
+    # "/" matches both windows and unix path separators ("/" and "\")
     src:path/to/source/file.c
     src:*/source/file.c
     # Turn off checks for this main file, including files included by it.
diff --git a/clang/unittests/Basic/DiagnosticTest.cpp b/clang/unittests/Basic/DiagnosticTest.cpp
index b0a034e9af1cd..3ab3cc918f5dc 100644
--- a/clang/unittests/Basic/DiagnosticTest.cpp
+++ b/clang/unittests/Basic/DiagnosticTest.cpp
@@ -360,4 +360,27 @@ TEST_F(SuppressionMappingTest, ParsingRespectsOtherWarningOpts) {
   clang::ProcessWarningOptions(Diags, Diags.getDiagnosticOptions(), *FS);
   EXPECT_THAT(diags(), IsEmpty());
 }
+
+TEST_F(SuppressionMappingTest, ForwardSlashMatchesBothDirections) {
+  llvm::StringLiteral SuppressionMappingFile = R"(
+  [unused]
+  src:*clang/*
+  src:*clang/lib/Sema/*=emit
+  src:*clang/lib\\Sema/foo*)";
+  Diags.getDiagnosticOptions().DiagnosticSuppressionMappingsFile = "foo.txt";
+  FS->addFile("foo.txt", /*ModificationTime=*/{},
+              llvm::MemoryBuffer::getMemBuffer(SuppressionMappingFile));
+  clang::ProcessWarningOptions(Diags, Diags.getDiagnosticOptions(), *FS);
+  EXPECT_THAT(diags(), IsEmpty());
+
+  EXPECT_TRUE(Diags.isSuppressedViaMapping(
+      diag::warn_unused_function, locForFile(R"(clang/lib/Basic/foo.h)")));
+  EXPECT_FALSE(Diags.isSuppressedViaMapping(
+      diag::warn_unused_function, locForFile(R"(clang/lib/Sema\bar.h)")));
+  EXPECT_TRUE(Diags.isSuppressedViaMapping(
+      diag::warn_unused_function, locForFile(R"(clang\lib\Sema/foo.h)")));
+  // The third pattern requires a literal backslash before Sema
+  EXPECT_FALSE(Diags.isSuppressedViaMapping(
+      diag::warn_unused_function, locForFile(R"(clang/lib/Sema/foo.h)")));
+}
 } // namespace
diff --git a/llvm/docs/ReleaseNotes.md b/llvm/docs/ReleaseNotes.md
index bb1f88e8480f1..8a883b7329343 100644
--- a/llvm/docs/ReleaseNotes.md
+++ b/llvm/docs/ReleaseNotes.md
@@ -338,6 +338,10 @@ Changes to BOLT
 Changes to Sanitizers
 ---------------------
 
+* The [sanitizer special case list format](https://clang.llvm.org/docs/SanitizerSpecialCaseList.html#format)
+  now treats forward slashes as either a forward or a backslash, to handle
+  paths with mixed unix and window styles.
+
 Other Changes
 -------------
 
diff --git a/llvm/include/llvm/Support/GlobPattern.h b/llvm/include/llvm/Support/GlobPattern.h
index 62ed4a0f23fd9..af92c63331282 100644
--- a/llvm/include/llvm/Support/GlobPattern.h
+++ b/llvm/include/llvm/Support/GlobPattern.h
@@ -35,6 +35,7 @@ namespace llvm {
 ///   expansions are not supported. If \p MaxSubPatterns is empty then
 ///   brace expansions are not supported and characters `{,}` are treated as
 ///   literals.
+/// * `/` matches both unix and windows path separators: `/` and `\`.
 /// * `\` escapes the next character so it is treated as a literal.
 ///
 /// Some known edge cases are:
diff --git a/llvm/lib/Support/GlobPattern.cpp b/llvm/lib/Support/GlobPattern.cpp
index 7004adf461a0c..26b3724863ee8 100644
--- a/llvm/lib/Support/GlobPattern.cpp
+++ b/llvm/lib/Support/GlobPattern.cpp
@@ -231,6 +231,10 @@ bool GlobPattern::SubGlobPattern::match(StringRef Str) const {
         ++S;
         continue;
       }
+    } else if (*P == '/' && (*S == '/' || *S == '\\')) {
+      ++P;
+      ++S;
+      continue;
     } else if (*P == *S || *P == '?') {
       ++P;
       ++S;

@DKLoehr DKLoehr requested review from MaskRay and DavidSpickett July 21, 2025 20:15
@MaskRay
Copy link
Member

MaskRay commented Jul 22, 2025

GlobPattern is also used in binary utilities' symbol/section name matching where we want to be rigid. While I do not oppose to this change, I think it should be opt-in if we add it.

@DavidSpickett
Copy link
Collaborator

I don't have an opinion either way but there are more users of GlobPattern so you should check the impact on them.

@DKLoehr
Copy link
Contributor Author

DKLoehr commented Jul 22, 2025

GlobPattern is also used in binary utilities' symbol/section name matching where we want to be rigid

Thanks for pointing that out! I narrowed the scope of the change so that the parts of the code which want it have to explicitly enable it. Right now that's just the special case list, which is the target of these changes in any case.

@DKLoehr
Copy link
Contributor Author

DKLoehr commented Jul 28, 2025

Friendly ping on this

@MaskRay
Copy link
Member

MaskRay commented Jul 29, 2025

You might consider adding more reviewers to gather broader perspectives on this feature. I'm neutral about it personally.

@jyknight
Copy link
Member

I think this is reasonable to do -- but should only be done on Windows hosts.

@rnk
Copy link
Collaborator

rnk commented Jul 30, 2025

I think this is reasonable. The other alternative would be canonicalizing slashes to forward slashes before comparing against the special case list, but I think I like this approach better.

@DKLoehr DKLoehr force-pushed the slashy branch 2 times, most recently from 90a69e2 to 057324e Compare September 5, 2025 17:21
@@ -231,6 +233,10 @@ bool GlobPattern::SubGlobPattern::match(StringRef Str) const {
++S;
continue;
}
} else if (IsSlashAgnostic && *P == '/' && (*S == '/' || *S == '\\')) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*S == '/' is redundant here and can be removed; it'd be handled in the next clause anyhow.

For completeness, it may make sense on windows to also match \\ against /. Then we can document as "now treats forward and backward slashes as equivalent when matching paths"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should match both ways, because backslashes in patterns already have a meaning ("exactly the following character"). If we escaped the backslash, then we should expect to match exactly a backslash. It's not clear how to also match '/' in that case and still be clear on the rules.

Letting '/' match both seems good to me because it allows us to use forward slash as a generic "path separator" marker, and that's all we need here.

@llvmbot llvmbot added the clang:frontend Language frontend issues, e.g. anything involving "Sema" label Sep 8, 2025
Copy link

github-actions bot commented Sep 8, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@DKLoehr
Copy link
Contributor Author

DKLoehr commented Sep 18, 2025

@jyknight Friendly ping on this.

Copy link
Collaborator

@vitalybuka vitalybuka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to optimize SanitizerSpecialCaseList so we sort by prefixes/suffixes to limit applied matches.

IsSlashAgnostic will be needed to taken into account soring.
Which is possible.

Would be simple just normalize patters on load?
Same for queries?

Copy link
Collaborator

@vitalybuka vitalybuka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this patch will not handle pattern:
path "clang\something" agains patter "src:clang/".

Can you please extend the test?

@vitalybuka
Copy link
Collaborator

vitalybuka commented Oct 2, 2025

There is llvm/unittests/Support/GlobPatternTest.cpp which should be extended.

@DKLoehr
Copy link
Contributor Author

DKLoehr commented Oct 2, 2025

I believe this patch will not handle pattern: path "clang\something" agains patter "src:clang/".
...
There is llvm/unittests/Support/GlobPatternTest.cpp which should be extended.

Thanks for pointing that out, although it took me a little bit to figure out why it wasn't matching. I've added a unit test to the requested file and fixed the prefix bug.


// Store the prefix that does not contain any metacharacter.
size_t PrefixSize = S.find_first_of("?*[{\\");
size_t PrefixSize = S.find_first_of("?*[{\\/");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it breaks optimization for all platforms.

Maybe add a member of GlobPattern::IsSlashAgnostic?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And use it to pessimize only for IsSlashAgnostic here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a member of GlobPattern::IsSlashAgnostic?

This was the original strategy, IIRC I changed it because we wanted to restrict it only to filenames and that information wasn't readily available when we're making it a glob. I suppose we could thread it through if we have to.

@vitalybuka
Copy link
Collaborator

Would be simple just normalize patters on load? Same for queries?

WDYT?

@DKLoehr
Copy link
Contributor Author

DKLoehr commented Oct 6, 2025

Would be simple just normalize patters on load? Same for queries?

I don't know what the implications of this are, but it seems much simpler at this point so we can try that instead. It also makes your sorting much easier.

I don't know what a query is in this context.

@vitalybuka
Copy link
Collaborator

vitalybuka commented Oct 6, 2025

Would be simple just normalize patters on load? Same for queries?

I don't know what the implications of this are, but it seems much simpler at this point so we can try that instead.

I don't know what a query is in this context.

It's query parameter SpecialCaseList::Matcher::match(StringRef Query)

But it needs to be normalized on higher levels, when we know this is path.

Annoying part is that to normalize globs you need to '\' -> '/'.
But to do that you needs some glob parsing to distinguish escape sequence \ and real \ .

And then I don't know which approach will be better.

NOTE: There is very simple option, no sure how acceptable.

  1. Special list always use "/" (for windows and posix)
  2. 'query' on windows replace \ to / in file name.

This one must be easy to do.

And maybe makes sense: pre-processor on windows, e.g. #include " also relies on / only, why not to require the same in ignore lists?

@DKLoehr
Copy link
Contributor Author

DKLoehr commented Oct 6, 2025

I've been experimenting with the simple option (canonicalize file names before matching, and require the patterns to use / as file separators on all systems), but it requires anyone who uses a case list with backslashes to rewrite it, which I suspect would be considered too onerous to merge.

If we don't impose that requirement, then we'd need to edit the globs as well, but I'm hesitant to mess with user-provided patterns because it seems easy to produce unintuitive behavior (and makes debugging harder on their end), e.g. they wonder why their \ in a pattern isn't matching anything.

@vitalybuka
Copy link
Collaborator

I've been experimenting with the simple option (canonicalize file names before matching, and require the patterns to use / as file separators on all systems), but it requires anyone who uses a case list with backslashes to rewrite it, which I suspect would be considered too onerous to merge.

Seems like /\ ambiguity is also error prone. It's glob, how often \ will be confused with escape sequence?

If we don't impose that requirement, then we'd need to edit the globs as well, but I'm hesitant to mess with user-provided patterns because it seems easy to produce unintuitive behavior (and makes debugging harder on their end), e.g. they wonder why their \ in a pattern isn't matching anything.

You have a difference problem now, why "\" does not match as expected (what ever is user expected)?

Given that, I'd prefer we just make windows also rely on /.

@DKLoehr
Copy link
Contributor Author

DKLoehr commented Oct 6, 2025

For globs, there's a very simple rule that's spelled out in GlobPattern.h:

\ escapes the next character so it is treated as a literal.

If we canonicalize everything to forward slashes, then we'd be violating that rule, since foo\\bar would match foo/bar, even though the escaped \ should have been treated as a literal.

The current version of the PR introduces another simple rule:

/ matches both \ and / in filenames on windows

which is (hopefully) easy to understand, and doesn't break anything existing.

Seems like /\ ambiguity is also error prone. It's glob, how often \ will be confused with escape sequence?
You have a difference problem now, why "" does not match as expected (what ever is user expected)?

I'm not sure I totally understand the argument here, but needing to escape backslashes is very common so I would expect people who use windows are used to it.

@vitalybuka
Copy link
Collaborator

vitalybuka commented Oct 6, 2025

For globs, there's a very simple rule that's spelled out in GlobPattern.h:

\ escapes the next character so it is treated as a literal.

If we canonicalize everything to forward slashes, then we'd be violating that rule, since foo\\bar would match foo/bar, even though the escaped \ should have been treated as a literal.

how to match "foo\[bar]" ?

@DKLoehr
Copy link
Contributor Author

DKLoehr commented Oct 6, 2025

how to match "foo[bar]" ?

I'm not sure if I'm interpreting the question correctly, but if you're asking about a file named [bar] in a directory foo on windows, then:

  1. With LLVM trunk, you would write src:foo\\\[bar] (I think you don't need to escape the ])
  2. With the current PR, you would write src:foo/\[bar], OR src:foo\\\[bar] if you know the path will be windows-style (it will fail to match unix-style paths)
  3. If we canonicalized file names before matching them, you would have to write src:foo/\[bar] (src:foo\\\[bar] would not work, and you'd have to update your special case files to use the new syntax)
  4. If we "canonicalized" globs, you could write either src:foo/\[bar] OR src:foo\\\[bar], but the latter would also (incorrectly) match unix-style paths. This is ok as long as you're canonicalizing your paths, but if you apply it to something that isn't a path (or which shouldn't be canonicalized for some reason) then it's wrong.

Basically, if you manipulate the globs you're effectively adding a new rule that \\ matches either / or \, but that contradicts with the existing rule that it should only match exactly \.

I'm entirely ok with option (3); it's simple to implement, simple to state how it works, and doesn't make the glob parser slower or more complicated. The downside is it puts burden on downstream folks to change their files to the new syntax.

That burden might not be that large, because it only applies to files which are windows-specific. If you support multiple platforms, you'll have to use a workaround anyway (in chromium we use {\\,/} to match either separator). So only a subset of people will actually be impacted, but I don't know how to evaluate that so I'd like to get opinions from more experienced community members.

@vitalybuka vitalybuka requested a review from zmodem October 6, 2025 23:35
@vitalybuka
Copy link
Collaborator

vitalybuka commented Oct 6, 2025

how to match "foo[bar]" ?

I'm not sure if I'm interpreting the question correctly, but if you're asking about a file named [bar] in a directory foo on windows, then:

You interpretation is correct.

  1. With LLVM trunk, you would write src:foo\\\[bar] (I think you don't need to escape the ])
  2. With the current PR, you would write src:foo/\[bar], OR src:foo\\\[bar] if you know the path will be windows-style (it will fail to match unix-style paths)

Seems counting \ in Windows version is not very friendly: "src:foo\\\[bar]

  1. If we canonicalized file names before matching them, you would have to write src:foo/\[bar] (src:foo\\\[bar] would not work, and you'd have to update your special case files to use the new syntax)

Yes, I like this one.

I think we should not afraid update special list.
They exist because they are easy to update, and because we don't want to edit source code, which is hard to update.

  1. If we "canonicalized" globs, you could write either src:foo/\[bar] OR src:foo\\\[bar], but the latter would also (incorrectly) match unix-style paths. This is ok as long as you're canonicalizing your paths, but if you apply it to something that isn't a path (or which shouldn't be canonicalized for some reason) then it's wrong.

Basically, if you manipulate the globs you're effectively adding a new rule that \\ matches either / or \, but that contradicts with the existing rule that it should only match exactly \.

I'm entirely ok with option (3); it's simple to implement, simple to state how it works, and doesn't make the glob parser slower or more complicated. The downside is it puts burden on downstream folks to change their files to the new syntax.

That burden might not be that large, because it only applies to files which are windows-specific. If you support multiple platforms, you'll have to use a workaround anyway (in chromium we use {\\,/} to match either separator). So only a subset of people will actually be impacted, but I don't know how to evaluate that so I'd like to get opinions from more experienced community members.

I am in favor of 3.

@rnk @jyknight @zmodem Please confirm that this is acceptable?

@vitalybuka vitalybuka requested review from rnk and jyknight October 6, 2025 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:frontend Language frontend issues, e.g. anything involving "Sema" clang Clang issues not falling into any other category compiler-rt:sanitizer llvm:support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants