Skip to content

Conversation

@ameerj
Copy link
Contributor

@ameerj ameerj commented Sep 30, 2024

Adds ** support for .clang-format-ignore to support similar pattern matching to .gitignore

closes #110160

@llvmbot
Copy link
Member

llvmbot commented Sep 30, 2024

@llvm/pr-subscribers-clang-format

Author: Ameer J (ameerj)

Changes

Adds ** support for .clang-format-ignore to support similar pattern matching to .gitignore

closes #110160


Full diff: https://github.com/llvm/llvm-project/pull/110560.diff

2 Files Affected:

  • (modified) clang/lib/Format/MatchFilePath.cpp (+20-2)
  • (modified) clang/unittests/Format/MatchFilePathTest.cpp (+34)
diff --git a/clang/lib/Format/MatchFilePath.cpp b/clang/lib/Format/MatchFilePath.cpp
index 062b334dcdd8fd..3d7861499b9a46 100644
--- a/clang/lib/Format/MatchFilePath.cpp
+++ b/clang/lib/Format/MatchFilePath.cpp
@@ -49,11 +49,29 @@ bool matchFilePath(StringRef Pattern, StringRef FilePath) {
         return false;
       break;
     case '*': {
-      while (++I < EOP && Pattern[I] == '*') { // Skip consecutive stars.
+      if (I + 1 < EOP && Pattern[I + 1] == '*') {
+        // Handle '**' pattern
+        while (++I < EOP && Pattern[I] == '*') { // Skip consecutive stars.
+        }
+        if (I == EOP)
+          return true; // '**' at the end matches everything
+        if (Pattern[I] == Separator) {
+          // Try to match the rest of the pattern without consuming the
+          // separator for the case where we want to match "zero" directories
+          // e.g. "a/**/b" matches "a/b"
+          if (matchFilePath(Pattern.substr(I + 1), FilePath.substr(J)))
+            return true;
+        }
+        while (J < End) {
+          if (matchFilePath(Pattern.substr(I), FilePath.substr(J)))
+            return true;
+          ++J;
+        }
+        return false;
       }
       const auto K = FilePath.find(Separator, J); // Index of next `Separator`.
       const bool NoMoreSeparatorsInFilePath = K == StringRef::npos;
-      if (I == EOP) // `Pattern` ends with a star.
+      if (++I == EOP) // `Pattern` ends with a star.
         return NoMoreSeparatorsInFilePath;
       // `Pattern` ends with a lone backslash.
       if (Pattern[I] == '\\' && ++I == EOP)
diff --git a/clang/unittests/Format/MatchFilePathTest.cpp b/clang/unittests/Format/MatchFilePathTest.cpp
index 28f665635718e5..a6df090a802128 100644
--- a/clang/unittests/Format/MatchFilePathTest.cpp
+++ b/clang/unittests/Format/MatchFilePathTest.cpp
@@ -164,6 +164,40 @@ TEST_F(MatchFilePathTest, Path) {
   EXPECT_FALSE(match("foo\\", R"(foo*\)"));
 }
 
+TEST_F(MatchFilePathTest, DoubleAsterisk) {
+  EXPECT_TRUE(match("a/b/c/d.cpp", "**b**"));
+  EXPECT_TRUE(match("a/b/c/d.cpp", "**/b/**"));
+  EXPECT_TRUE(match("a/b/c/d_e.cpp", "**d_*"));
+  EXPECT_TRUE(match("a/b/c/d_e.cpp", "**/d_*"));
+  EXPECT_TRUE(match("a/b/c/d_e.cpp", "**d_**"));
+  EXPECT_TRUE(match("a/b/c/d_e.cpp", "**/d_**"));
+
+  EXPECT_TRUE(match("a/b/c/d_e.cpp", "**/b/c/**"));
+
+  EXPECT_TRUE(match("a/b/c/d_e.cpp", "a/**/b/c/d_e.cpp"));
+  EXPECT_TRUE(match("a/b/c/d_e.cpp", "a/**/c/d_e.cpp"));
+  EXPECT_TRUE(match("a/b/c/d_e.cpp", "a/**/b/**/d_e.cpp"));
+  EXPECT_TRUE(match("a/b/c/d_e.cpp", "**/b/**/d_e.cpp"));
+  EXPECT_TRUE(match("a/b/c/d_e.cpp", "a/**/**/b/**"));
+
+  EXPECT_FALSE(match("a/b/c/d_e.cpp", "**/d"));
+  EXPECT_FALSE(match("a/b/c/d_e.cpp", "**/b/d"));
+  EXPECT_FALSE(match("a/b/c/d_e.cpp", "**/b/d/**"));
+  EXPECT_FALSE(match("a/b/c/d_e.cpp", "**/b/c/"));
+
+  // Multiple consecutive asterisks are treated as **
+  EXPECT_TRUE(match("a/b/c/d.cpp", "***b****"));
+  EXPECT_TRUE(match("a/b/c/d.cpp", "****/b/***"));
+  EXPECT_TRUE(match("a/b/c/d_e.cpp", "***d_**"));
+  EXPECT_TRUE(match("a/b/c/d_e.cpp", "****/d_*"));
+  EXPECT_TRUE(match("a/b/c/d_e.cpp", "***/b/c/*****"));
+
+  EXPECT_FALSE(match("a/b/c/d_e.cpp", "*****/d"));
+  EXPECT_FALSE(match("a/b/c/d_e.cpp", "***/b/d"));
+  EXPECT_FALSE(match("a/b/c/d_e.cpp", "*****/b/d/***"));
+  EXPECT_FALSE(match("a/b/c/d_e.cpp", "***/b/c"));
+}
+
 } // namespace
 } // namespace format
 } // namespace clang

@mydeveloperday mydeveloperday requested a review from owenca October 2, 2024 13:14
EXPECT_FALSE(match("a/b/c/d_e.cpp", "*****/d"));
EXPECT_FALSE(match("a/b/c/d_e.cpp", "***/b/d"));
EXPECT_FALSE(match("a/b/c/d_e.cpp", "*****/b/d/***"));
EXPECT_FALSE(match("a/b/c/d_e.cpp", "***/b/c"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be some tests with repetition?

project/tests/lib/tests/module/test_abc.cpp

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are doing this an saying "Its like .gitignore" should we support

/*
!/foo/bar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be some tests with repetition?

I'm not sure I understand what you are asking.
Can you provide examples of the path and the ignore pattern that you expect to be problematic?

if we are doing this an saying "Its like .gitignore" should we support

I am only indicating that the ** glob pattern is implemented in .gitignore, and we follow the same rules they do for this specific pattern.

I do not think the current implementation could easily support theese glob patterns the same way .gitignore does.
My understanding is that there is some "state" where a lower level ! negates an earlier match/ignore?
I think the current implementation only considers one pattern in the file at a time, in isolation of all other lines in the .clang-format-ignore file.

@owenca may understand better and have an opinion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are doing this an saying "Its like .gitignore" should we support

I am only indicating that the ** glob pattern is implemented in .gitignore, and we follow the same rules they do for this specific pattern.

When I added .clang-format-ignore, I chose a POSIX glob-like model and intentionally stayed clear of the .gitignore model, which is, IMO, confusing and overcomplicated. I prefer to not add ** now to partially match .gitignore unless there's a good reason and strong demand for it. There's also a possibility of regression as ** is currently matched the same way as a single *.

I think the current implementation only considers one pattern in the file at a time, in isolation of all other lines in the .clang-format-ignore file.

Yes, the current behavior is to check the patterns in the .clang-format-ignore file one pattern at a time until a match is found.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless there's a good reason

This came up as I have been working on adding support for clang-format within AMD's PAL repository and found clang-format-ignore to be extremely limiting.

There are 2 key reasons I am adding this feature:

  • Simplifying Subdirectory Exclusion: ** allows for easy exclusion of entire subdirectories. Currently you need to specify each depth level in a subdirectory individually.
  • Pattern-Based File Ignoring: ** simplifies ignoring specific file patterns that may appear at any level within a directory tree, which is especially useful when the exact depth of these files can vary

These two cases came up in the PAL repo in the following manner:

  • There are various imported libraries added throughout the codebase. We would need to ignore all of the imported subdirs in their entirety
  • Various files exist throughout the project that should always be excluded. Generated files, shader files, etc.

The current way to ignore these kinds of patterns is something like this:

imported/*
imported/*/*
imported/*/*/*
imported/*/*/*/*
imported/*/*/*/*/*
imported/*/*/*/*/*/*
...

src/*/*/ignore_me.cpp
src/*/*/*/ignore_me.cpp
src/*/*/*/*/ignore_me.cpp
src/*/*/*/*/*/ignore_me.cpp
....

Which can be simplified into

imported/**
src/**/ignore_me.cpp

and strong demand for it.

How do we gauge how much of a demand there is for this feature?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless there's a good reason

This came up as I have been working on adding support for clang-format within AMD's PAL repository and found clang-format-ignore to be extremely limiting.

How is it extremely limited? Just because it doesn't support ** in the middle of the pathnames? How did you manage before clang-format gained the .clang-format-ignore functionality?

There are 2 key reasons I am adding this feature:

  • Simplifying Subdirectory Exclusion: ** allows for easy exclusion of entire subdirectories. Currently you need to specify each depth level in a subdirectory individually.

We have DisableFormat for that.

  • Pattern-Based File Ignoring: ** simplifies ignoring specific file patterns that may appear at any level within a directory tree, which is especially useful when the exact depth of these files can vary

I was conscious of this use case, but you can do something like find . -name dont-format-me >> .clang-format-ignore even though it's less than ideal.

and strong demand for it.

How do we gauge how much of a demand there is for this feature?

See #52975 (comment).

If we were to add ** to match gitignore, we should also handle it at the beginning and the end of a pathname.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you manage before clang-format gained the .clang-format-ignore functionality?

PAL has not used clang-format at all, my contributions have been an effort to support the PAL formatting style so we can start to use the tool.

  • Simplifying Subdirectory Exclusion: ** allows for easy exclusion of entire subdirectories. Currently you need to specify each depth level in a subdirectory individually.

We have DisableFormat for that.

I was not aware of DisableFormat, it's not as intuitive as having everything in the ignore file but it can be a compromise if you're completely opposed to adding the ** change.

  • Pattern-Based File Ignoring: ** simplifies ignoring specific file patterns that may appear at any level within a directory tree, which is especially useful when the exact depth of these files can vary

I was conscious of this use case, but you can do something like find . -name dont-format-me >> .clang-format-ignore even though it's less than ideal.

Like you said, this is not ideal as it relies on anyone adding/moving files that may need to be ignored to regenerate the clang-format-ignore file.

** could handle these cases without a need for manual intervention.

If we were to add ** to match gitignore, we should also handle it at the beginning and the end of a pathname.

Are you seeing a gap in the unit tests I have? I think I have these cases covered but can add more specific unit tests if you have an example.

@ameerj
Copy link
Contributor Author

ameerj commented Oct 11, 2024

@mydeveloperday @owenca
bump for review feedback

@ameerj
Copy link
Contributor Author

ameerj commented Oct 23, 2024

@HazardyKnusperkeks Can I get your eyes on this PR please?

case '*': {
while (++I < EOP && Pattern[I] == '*') { // Skip consecutive stars.
if (I + 1 < EOP && Pattern[I + 1] == '*') {
// Handle '**' pattern
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Handle '**' pattern
// Handle '**' pattern.

Always end in full stop.

while (++I < EOP && Pattern[I] == '*') { // Skip consecutive stars.
if (I + 1 < EOP && Pattern[I + 1] == '*') {
// Handle '**' pattern
while (++I < EOP && Pattern[I] == '*') { // Skip consecutive stars.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a find_if.

@GloryOfNight
Copy link

I want to express support for this feature. Just today I attempted to integrate .clang-format-ignore and was really confused by fact that it cannot ignore recursively. Project I wanted to integrate it to has a lot of generated files which is all over the place. It's very inconvenient have to use that pattern for every directory I want to ignore:

imported/*
imported/*/*
imported/*/*/*

DisableFormat is also inconvenient, since placing .clang-format with disabled formatting all over the project not exactly a solution I would want to go with either.

Just having single .clang-format-ignore file with imported/** would be much more convenient way to do things. IMO.

@owenca
Copy link
Contributor

owenca commented Dec 31, 2024

See #121404.

@ameerj
Copy link
Contributor Author

ameerj commented Dec 31, 2024

Closing in favor of #121404

@ameerj ameerj closed this Dec 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[clang-format] .clang-format-ignore does not match / with * wildcard

6 participants