Skip to content

Conversation

@ilya-biryukov
Copy link
Contributor

@ilya-biryukov ilya-biryukov commented Oct 11, 2024

Currently, any FileID that references a module map file that was required for a compilation is considered as affecting. This misses an important opportunity to reduce the source location space taken by the resulting PCM.

In particular, consider the situation where the same module map file is passed multiple times in the dependency chain:

$ clang -fmodule-map-file=foo.modulemap ... -o mod1.pcm
$ clang -fmodule-map-file=foo.modulemap -fmodule-file=mod1.pcm ... -o mod2.pcm
...
$ clang -fmodule-map-file=foo.modulemap -fmodule-file=mod$((N-1)).pcm ... -o mod$N.pcm

Because foo.modulemap is read before reading any of the .pcm files, we have to create a unique FileID for it when creating each module. However, when reading the .pcm files, we will reuse the FileID loaded from it for the same module map file and the FileID we created can never be used again, but we will still mark it as affecting and it will take the source location space in the output PCM.

For a chain of N dependencies, this results in the file taking N * (size of file) source location space, which could be significant. For examples, we observer internally that some targets that run out of 2GB of source location space end up wasting up to 20% of that space in module maps as described above.

I take extra care to still write the InputFile entries for those files, which has not been done before. It is required for ClangScanDeps to properly function.

The change in the output of one of the ClangScanDeps tests is attributed to that: we now add a bit more module maps to the output in some tricky cases. E.g. in the test case the module map file is required and is referenced by another top-level module map, adding it is redundant, but should not be breaking.

…f source locations

Currently, any FileID that references a module map file that was
required for a compilation is considered as affecting. This misses an
important opportunity to reduce the source location space taken by the
resulting PCM.

In particular, consider the situation where the same module map file is
passed multiple times in the dependency chain:

```shell
$ clang -fmodule-map-file=foo.modulemap ... -o mod1.pcm
$ clang -fmodule-map-file=foo.modulemap -fmodule-file=mod1.pcm ... -o mod2.pcm
...
$ clang -fmodule-map-file=foo.modulemap -fmodule-file=mod$((N-1)).pcm ... -o mod$N.pcm
```

Because `foo.modulemap` is read before reading any of the `.pcm` files,
we have to create a unique `FileID` for it when creating each module.
However, when reading the `.pcm` files, we will reuse the `FileID`
loaded from it for the same module map file and the `FileID` we created
can never be used again, but we will still not mark it as affecting and
it will take the source location space in the output PCM.

For a chain of N dependencies, this results in the file taking
`N * (size of file)` source location space, which could be significant.
For examples, we observer internally that some targets that run out of
2GB of source location space end up wasting up to 20% of that space in
module maps as described above.

I take extra care to still write the InputFile entries for those files,
which has not been done before. It is required for ClangScanDeps to properly
function.

The change in the output of one of the ClangScanDeps tests is attributed
to that: we now add a bit more module maps to the output in some tricky
cases. E.g. in the test case the module map file is required and is
referenced by another top-level module map, adding it is redundant, but
should not be breaking.
@llvmbot llvmbot added clang Clang issues not falling into any other category clang:modules C++20 modules and Clang Header Modules labels Oct 11, 2024
@llvmbot
Copy link
Member

llvmbot commented Oct 11, 2024

@llvm/pr-subscribers-clang-modules

@llvm/pr-subscribers-clang

Author: Ilya Biryukov (ilya-biryukov)

Changes

Currently, any FileID that references a module map file that was required for a compilation is considered as affecting. This misses an important opportunity to reduce the source location space taken by the resulting PCM.

In particular, consider the situation where the same module map file is passed multiple times in the dependency chain:

$ clang -fmodule-map-file=foo.modulemap ... -o mod1.pcm
$ clang -fmodule-map-file=foo.modulemap -fmodule-file=mod1.pcm ... -o mod2.pcm
...
$ clang -fmodule-map-file=foo.modulemap -fmodule-file=mod$((N-1)).pcm ... -o mod$N.pcm

Because foo.modulemap is read before reading any of the .pcm files, we have to create a unique FileID for it when creating each module. However, when reading the .pcm files, we will reuse the FileID loaded from it for the same module map file and the FileID we created can never be used again, but we will still not mark it as affecting and it will take the source location space in the output PCM.

For a chain of N dependencies, this results in the file taking N * (size of file) source location space, which could be significant. For examples, we observer internally that some targets that run out of 2GB of source location space end up wasting up to 20% of that space in module maps as described above.

I take extra care to still write the InputFile entries for those files, which has not been done before. It is required for ClangScanDeps to properly function.

The change in the output of one of the ClangScanDeps tests is attributed to that: we now add a bit more module maps to the output in some tricky cases. E.g. in the test case the module map file is required and is referenced by another top-level module map, adding it is redundant, but should not be breaking.


Full diff: https://github.com/llvm/llvm-project/pull/112015.diff

5 Files Affected:

  • (modified) clang/include/clang/Serialization/ASTWriter.h (+3)
  • (modified) clang/lib/Serialization/ASTReader.cpp (+6)
  • (modified) clang/lib/Serialization/ASTWriter.cpp (+31-10)
  • (modified) clang/test/ClangScanDeps/modules-extern-submodule.c (+1-1)
  • (added) clang/test/Modules/prune-non-affecting-module-map-repeated.cpp (+85)
diff --git a/clang/include/clang/Serialization/ASTWriter.h b/clang/include/clang/Serialization/ASTWriter.h
index 10a50b711043a8..6b03300d59adda 100644
--- a/clang/include/clang/Serialization/ASTWriter.h
+++ b/clang/include/clang/Serialization/ASTWriter.h
@@ -491,6 +491,9 @@ class ASTWriter : public ASTDeserializationListener,
 
   /// Mapping from a source location entry to whether it is affecting or not.
   llvm::BitVector IsSLocAffecting;
+  /// Mapping from a source location entry to whether it must be included as
+  /// input file.
+  llvm::BitVector IsSLocFileEntryAffecting;
 
   /// Mapping from \c FileID to an index into the FileID adjustment table.
   std::vector<FileID> NonAffectingFileIDs;
diff --git a/clang/lib/Serialization/ASTReader.cpp b/clang/lib/Serialization/ASTReader.cpp
index e5a1e20a265616..a4ce57c6b2d9c5 100644
--- a/clang/lib/Serialization/ASTReader.cpp
+++ b/clang/lib/Serialization/ASTReader.cpp
@@ -5862,6 +5862,12 @@ llvm::Error ASTReader::ReadSubmoduleBlock(ModuleFile &F,
       }
 
       CurrentModule->Kind = Kind;
+      // Note that we may be rewriting an existing location and it is important
+      // to keep doing that. In particular, we would like to prefer a
+      // `DefinitionLoc` loaded from the module file instead of the location
+      // created in the current source manager, because it allows the new
+      // location to be marked as "unaffecting" when writing and avoid creating
+      // duplicate locations for the same module map file.
       CurrentModule->DefinitionLoc = DefinitionLoc;
       CurrentModule->Signature = F.Signature;
       CurrentModule->IsFromModuleFile = true;
diff --git a/clang/lib/Serialization/ASTWriter.cpp b/clang/lib/Serialization/ASTWriter.cpp
index c6289493fce1de..135e6308b36390 100644
--- a/clang/lib/Serialization/ASTWriter.cpp
+++ b/clang/lib/Serialization/ASTWriter.cpp
@@ -38,6 +38,7 @@
 #include "clang/AST/TypeLocVisitor.h"
 #include "clang/Basic/Diagnostic.h"
 #include "clang/Basic/DiagnosticOptions.h"
+#include "clang/Basic/FileEntry.h"
 #include "clang/Basic/FileManager.h"
 #include "clang/Basic/FileSystemOptions.h"
 #include "clang/Basic/IdentifierTable.h"
@@ -81,6 +82,7 @@
 #include "llvm/ADT/APSInt.h"
 #include "llvm/ADT/ArrayRef.h"
 #include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/DenseSet.h"
 #include "llvm/ADT/Hashing.h"
 #include "llvm/ADT/PointerIntPair.h"
 #include "llvm/ADT/STLExtras.h"
@@ -166,7 +168,12 @@ static TypeCode getTypeCodeForTypeClass(Type::TypeClass id) {
 
 namespace {
 
-std::optional<std::set<const FileEntry *>>
+struct AffectingModuleMaps {
+  llvm::DenseSet<FileID> DefinitionFileIDs;
+  llvm::DenseSet<const FileEntry *> DefinitionFiles;
+};
+
+std::optional<AffectingModuleMaps>
 GetAffectingModuleMaps(const Preprocessor &PP, Module *RootModule) {
   if (!PP.getHeaderSearchInfo()
            .getHeaderSearchOpts()
@@ -174,10 +181,12 @@ GetAffectingModuleMaps(const Preprocessor &PP, Module *RootModule) {
     return std::nullopt;
 
   const HeaderSearch &HS = PP.getHeaderSearchInfo();
+  const SourceManager &SM = PP.getSourceManager();
   const ModuleMap &MM = HS.getModuleMap();
 
-  std::set<const FileEntry *> ModuleMaps;
-  std::set<const Module *> ProcessedModules;
+  llvm::DenseSet<FileID> ModuleMaps;
+
+  llvm::DenseSet<const Module *> ProcessedModules;
   auto CollectModuleMapsForHierarchy = [&](const Module *M) {
     M = M->getTopLevelModule();
 
@@ -192,13 +201,13 @@ GetAffectingModuleMaps(const Preprocessor &PP, Module *RootModule) {
 
       // The containing module map is affecting, because it's being pointed
       // into by Module::DefinitionLoc.
-      if (auto FE = MM.getContainingModuleMapFile(Mod))
-        ModuleMaps.insert(*FE);
+      if (auto F = MM.getContainingModuleMapFileID(Mod); F.isValid())
+        ModuleMaps.insert(F);
       // For inferred modules, the module map that allowed inferring is not
       // related to the virtual containing module map file. It did affect the
       // compilation, though.
-      if (auto FE = MM.getModuleMapFileForUniquing(Mod))
-        ModuleMaps.insert(*FE);
+      if (auto F = MM.getModuleMapFileIDForUniquing(Mod); F.isValid())
+        ModuleMaps.insert(F);
 
       for (auto *SubM : Mod->submodules())
         Q.push(SubM);
@@ -268,7 +277,16 @@ GetAffectingModuleMaps(const Preprocessor &PP, Module *RootModule) {
   // just ban module map hierarchies where module map defining a (sub)module X
   // includes a module map defining a module that's not a submodule of X.
 
-  return ModuleMaps;
+  llvm::DenseSet<const FileEntry *> ModuleFileEntries;
+  for (FileID MM : ModuleMaps) {
+    if (auto *FE = SM.getFileEntryForID(MM))
+      ModuleFileEntries.insert(FE);
+  }
+
+  AffectingModuleMaps R;
+  R.DefinitionFileIDs = std::move(ModuleMaps);
+  R.DefinitionFiles = std::move(ModuleFileEntries);
+  return std::move(R);
 }
 
 class ASTTypeWriter {
@@ -1772,7 +1790,7 @@ void ASTWriter::WriteInputFiles(SourceManager &SourceMgr,
       continue;
 
     // Do not emit input files that do not affect current module.
-    if (!IsSLocAffecting[I])
+    if (!IsSLocFileEntryAffecting[I])
       continue;
 
     InputFileEntry Entry(*Cache->OrigEntry);
@@ -4924,6 +4942,7 @@ void ASTWriter::computeNonAffectingInputFiles() {
   unsigned N = SrcMgr.local_sloc_entry_size();
 
   IsSLocAffecting.resize(N, true);
+  IsSLocFileEntryAffecting.resize(N, true);
 
   if (!WritingModule)
     return;
@@ -4960,10 +4979,12 @@ void ASTWriter::computeNonAffectingInputFiles() {
       continue;
 
     // Don't prune module maps that are affecting.
-    if (llvm::is_contained(*AffectingModuleMaps, *Cache->OrigEntry))
+    if (AffectingModuleMaps->DefinitionFileIDs.contains(FID))
       continue;
 
     IsSLocAffecting[I] = false;
+    IsSLocFileEntryAffecting[I] =
+        AffectingModuleMaps->DefinitionFiles.contains(*Cache->OrigEntry);
 
     FileIDAdjustment += 1;
     // Even empty files take up one element in the offset table.
diff --git a/clang/test/ClangScanDeps/modules-extern-submodule.c b/clang/test/ClangScanDeps/modules-extern-submodule.c
index 92f2c749dd2c30..e0cfb210d3a33a 100644
--- a/clang/test/ClangScanDeps/modules-extern-submodule.c
+++ b/clang/test/ClangScanDeps/modules-extern-submodule.c
@@ -43,7 +43,7 @@ module third {}
 // CHECK-NEXT:       "command-line": [
 // CHECK-NEXT:         "-cc1",
 // CHECK:              "-fmodule-map-file=[[PREFIX]]/second/second/module.modulemap"
-// CHECK-NOT:          "-fmodule-map-file=[[PREFIX]]/second/second/sub.modulemap"
+// CHECK-NEXT:         "-fmodule-map-file=[[PREFIX]]/second/second/sub.modulemap"
 // CHECK-NOT:          "-fmodule-map-file=[[PREFIX]]/third/module.modulemap"
 // CHECK:              "-fmodule-file=second=[[PREFIX]]/cache/{{.*}}/second-{{.*}}.pcm"
 // CHECK:            ],
diff --git a/clang/test/Modules/prune-non-affecting-module-map-repeated.cpp b/clang/test/Modules/prune-non-affecting-module-map-repeated.cpp
new file mode 100644
index 00000000000000..2c09bc4b01b026
--- /dev/null
+++ b/clang/test/Modules/prune-non-affecting-module-map-repeated.cpp
@@ -0,0 +1,85 @@
+// Check that the same module map file passed to -fmodule-map-file *and*
+// available from one of the `-fmodule-file` does not allocate extra source
+// location space. This optimization is important for using module maps in
+// large codebases to avoid running out of source location space.
+
+// RUN: rm -rf %t && mkdir %t
+// RUN: split-file %s %t
+
+// RUN: %clang_cc1 -xc++ -fmodules -fno-implicit-modules -fmodule-map-file=%t/base.map -fmodule-name=base -emit-module %t/base.map -o %t/base.pcm
+// RUN: %clang_cc1 -xc++ -fmodules -fno-implicit-modules \
+// RUN:   -fmodule-map-file=%t/base.map -fmodule-map-file=%t/mod1.map \
+// RUN:   -fmodule-file=%t/base.pcm \
+// RUN:   -fmodule-name=mod1 -emit-module %t/mod1.map -o %t/mod1.pcm
+// RUN: %clang_cc1 -xc++ -fmodules -fno-implicit-modules \
+// RUN:   -fmodule-map-file=%t/base.map -fmodule-map-file=%t/mod1.map -fmodule-map-file=%t/mod2.map \
+// RUN:   -fmodule-file=%t/mod1.pcm \
+// RUN:   -fmodule-name=mod2 -emit-module %t/mod2.map -o %t/mod2.pcm
+// RUN: %clang_cc1 -xc++ -fmodules -fno-implicit-modules \
+// RUN:   -fmodule-map-file=%t/base.map -fmodule-map-file=%t/mod2.map -fmodule-map-file=%t/mod3.map \
+// RUN:   -fmodule-file=%t/mod2.pcm \
+// RUN:   -fmodule-name=mod3 -emit-module %t/mod3.map -o %t/mod3.pcm
+// RUN: %clang_cc1 -xc++ -fmodules -fno-implicit-modules \
+// RUN:   -fmodule-map-file=%t/base.map -fmodule-map-file=%t/mod3.map -fmodule-map-file=%t/mod4.map \
+// RUN:   -fmodule-file=%t/mod3.pcm \
+// RUN:   -fmodule-name=mod4 -emit-module %t/mod4.map -o %t/mod4.pcm
+// RUN: %clang_cc1 -xc++ -fmodules -fno-implicit-modules -fmodule-map-file=%t/base.map -fmodule-map-file=%t/mod4.map -fmodule-file=%t/mod4.pcm -fsyntax-only -verify %t/check_slocs.cc
+
+//--- base.map
+module base { header "vector.h" }
+//--- mod1.map
+module mod1 { header "mod1.h" }
+//--- mod2.map
+module mod2 { header "mod2.h" }
+//--- mod3.map
+module mod3 { header "mod3.h" }
+//--- mod4.map
+module mod4 { header "mod4.h" }
+//--- check_slocs.cc
+#include "mod4.h"
+#pragma clang __debug sloc_usage // expected-remark {{source manager location address space usage}}
+// expected-note@* {{% of available space}}
+
+// Module map files files that were specified on the command line are entered twice (once when parsing command-line, once loaded from the .pcm)
+// Those that  not specified on the command line must be entered once.
+
+// [email protected]:1 {{file entered 2 times}}
+// [email protected]:1 {{file entered 2 times}}
+// [email protected]:1 {{file entered 1 time}}
+// [email protected]:1 {{file entered 1 time}}
+// [email protected]:1 {{file entered 1 time}}
+// expected-note@* + {{file entered}}
+
+
+//--- vector.h
+#ifndef VECTOR_H
+#define VECTOR_H
+#endif
+
+//--- mod1.h
+#ifndef MOD1
+#define MOD1
+#include "vector.h"
+int mod1();
+#endif
+//--- mod2.h
+#ifndef MOD2
+#define MOD2
+#include "vector.h"
+#include "mod1.h"
+int mod2();
+#endif
+//--- mod3.h
+#ifndef MOD3
+#define MOD3
+#include "vector.h"
+#include "mod2.h"
+int mod3();
+#endif
+//--- mod4.h
+#ifndef MOD4
+#define MOD4
+#include "vector.h"
+#include "mod3.h"
+int mod4();
+#endif

@ilya-biryukov
Copy link
Contributor Author

This is a change that should remove the need for a hack from #88893, reaching the same effect while keeping the command-line interface as it is today.

Copy link
Member

@ChuanqiXu9 ChuanqiXu9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me. Please leave some time to @jansvoboda11

@ilya-biryukov
Copy link
Contributor Author

looks good to me.

Thanks!

Please leave some time to @jansvoboda11

Will do, @jansvoboda11 PTAL when you have a spare minute.

…ation of source locations

Rename second occurrence of F.
Copy link
Contributor

@jansvoboda11 jansvoboda11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for trying to find an alternative solution!

However, when reading the .pcm files, we will reuse the FileID loaded from it for the same module map file and the FileID we created can never be used again, but we will still not mark it as affecting and it will take the source location space in the output PCM.

Did you mean "mark it as affecting" / "not mark it as non-affecting" instead of "not mark it as affecting"?

I take extra care to still write the InputFile entries for those files, which has not been done before. It is required for ClangScanDeps to properly function.

Can you expand on that? Why doesn't it work properly if you stop writing those input files as well?

// CHECK-NEXT: "-cc1",
// CHECK: "-fmodule-map-file=[[PREFIX]]/second/second/module.modulemap"
// CHECK-NOT: "-fmodule-map-file=[[PREFIX]]/second/second/sub.modulemap"
// CHECK-NEXT: "-fmodule-map-file=[[PREFIX]]/second/second/sub.modulemap"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fairly important to not report those on command lines. On Darwin, we have one module map that includes lots of other module maps, so listing them all blows up the command line length (and is unnecessary).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out, I did not realize this was important.

That would require digging into scan-deps a bit, let me try to see if it's possible to match the previous behavior.
In general, I feel that rewriting ModuleDepCollector in a way that walks over the module dependency graph rather than the InputFile is probably the right solution in the long run, but I'm not sure if it's an easy change to make. (I still don't have a good grasp on what outputs are generally expected here, relying on something as internal as InputFile in the serialized AST was a surprise).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the code to keep the old behavior in that case, PTAL.

@jansvoboda11
Copy link
Contributor

I was looking at the affecting module map logic for a reason unrelated to yours, but ended up needing something similar to what you're trying to achieve. Here's an example:

//--- X1.modulemap
module X1 { header "X1.h" }
//--- X2.modulemap
module X2 { header "X2.h" }
//--- A.modulemap
module A { header "A.h" }
//--- B.modulemap
module B { header "B.h" }
//--- X1.h
#include "A.h"
#include "B.h"
//--- X2.h
#include "B.h"
#include "A.h"
//--- A.h
#include "B.h"
//--- B.h

Intuitively, I would expect the set of affecting files between modules X1 and X2 to be the same (at least when it comes to files related to modules A and B).

Assuming implicit module maps, the problem is as follows:

  • In module X1:

    • The first include in X1.h reads A.modulemap and creates local SLocEntry.
    • A.pcm is loaded and Module::DefinitionLoc is re-associated with the loaded SLocEntry for A.modulemap.
    • The second include in X2.h reads B.modulemap and creates local SLocEntry.
    • B.pcm is loaded and Module::DefinitionLoc is re-associated with the loaded SLocEntry for B.modulemap.
    • When computing the set of affecting input files, we start out assuming all files tied to local SLocEntries are affecting, so both A.modulemap and B.modulemap are candidates at the start.
    • Then we say only module map files that are not in the set derived from Module::DefinitionLoc are non-affecting.
    • Both A.modulemap and B.modulemap are in the set, so both end up being affecting.
  • In module X2:

    • The first include in X2.h reads B.modulemap and creates local SLocEntry.
    • B.pcm is loaded and Module::DefinitionLoc is re-associated with the loaded SLocEntry for B.modulemap.
    • A.pcm is transitively loaded and ModuleDefinitionLoc is associated with the loaded SLocEntry for A.modulemap.
    • The second include in X2.h doesn't read any module map file, since all necessary information is known thanks to A.pcm.
    • When computing the set of affecting input files, the initial set only contains B.modulemap (because it's associated with the local SLocEntry), but not A.modulemap (because it's only associated with the loaded SLocEntry).
    • B.modulemap is in the set derived from Module::DefinitionLoc so it ends up being affecting.

This is unexpected and I think it breaks the correctness of clang-scan-deps. I think we might need to take your patch a bit further and make it so that ASTWriter::WriteInputFiles() doesn't care whether the SLocEntry associated with the file through Module::DefinitionLoc is loaded or local.

@ilya-biryukov
Copy link
Contributor Author

Did you mean "mark it as affecting" / "not mark it as non-affecting" instead of "not mark it as affecting"?

Yes, sorry, that was a mistake on my part.

Can you expand on that? Why doesn't it work properly if you stop writing those input files as well?

Because ClangScanDeps relies on the information from those InputFIle to determine which module map files the code depends on, the relevant code is here:

MDC.ScanInstance.getASTReader()->visitInputFileInfos(

If we stop producing those input files, it should not affect the compiler in any way, but clang-scan-deps will be broken (some tests will fail, including the ones that check the final commands can compile the underlying modules)

This is unexpected and I think it breaks the correctness of clang-scan-deps. I think we might need to take your patch a bit further and make it so that ASTWriter::WriteInputFiles() doesn't care whether the SLocEntry associated with the file through Module::DefinitionLoc is loaded or local.

Hm, that would actually make the final outputs much more predictable, so I like this approach. Let me try to prototype it and get back to you.
The only downside is more InputFile entries being reported, but I bet they are a very small fraction of the serialized AST, so it should not really matter.

…ation of source locations

Make sure we do not mark files as top-level unnecessarily
@ilya-biryukov
Copy link
Contributor Author

ilya-biryukov commented Nov 5, 2024

This is unexpected and I think it breaks the correctness of clang-scan-deps. I think we might need to take your patch a bit further and make it so that ASTWriter::WriteInputFiles() doesn't care whether the SLocEntry associated with the file through Module::DefinitionLoc is loaded or local.

I want to dig into it a little more, but the problem is not introduced with this PR, so I suggest to do this in a follow-up in the interest of keeping this PR focused.

The particular challenge on my mind that makes it worth exploring in a different PR is that InputFileRefs are written to the serialized ASTs for two purposes:

  • as ways to implement clang-scan-deps (e.g. IsModuleMapFile and IsTopLevel flags seem to be used solely by clang-scan-deps),
  • as a way to write internal SourceManager data structures (SLocEntry). This seems to be the primary purposes for which they were introduced into the serialized format (I didn't check the history, though, it's just a guess).

It might be desirable to write some loaded locations for the former, but it goes against the idea for the latter.

@jansvoboda11 wdyt about accepting the current PR and continuing the discussion for your case in a separate PR/issue?

@ilya-biryukov
Copy link
Contributor Author

ilya-biryukov commented Nov 7, 2024

@jansvoboda11 friendly ping. We have seen an increased number of files hitting the source location limits with the compiler form the last months or so, having this change in would really help us reduce the pressure.
Please let me know if getting this change in the current form makes sense to you, I don't think it regress any use-case people might see in practice in its current form.

Copy link
Contributor

@jansvoboda11 jansvoboda11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thank you!

@ilya-biryukov ilya-biryukov merged commit f02b1cc into llvm:main Nov 8, 2024
8 checks passed
@ilya-biryukov ilya-biryukov deleted the dedup2 branch November 8, 2024 08:10
@llvm-ci
Copy link
Collaborator

llvm-ci commented Nov 8, 2024

LLVM Buildbot has detected a new failure on builder openmp-s390x-linux running on systemz-1 while building clang at step 6 "test-openmp".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/88/builds/4381

Here is the relevant piece of the build log for the reference
Step 6 (test-openmp) failure: test (failure)
******************** TEST 'libomp :: tasking/issue-94260-2.c' FAILED ********************
Exit Code: -11

Command Output (stdout):
--
# RUN: at line 1
/home/uweigand/sandbox/buildbot/openmp-s390x-linux/llvm.build/./bin/clang -fopenmp   -I /home/uweigand/sandbox/buildbot/openmp-s390x-linux/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -I /home/uweigand/sandbox/buildbot/openmp-s390x-linux/llvm.src/openmp/runtime/test -L /home/uweigand/sandbox/buildbot/openmp-s390x-linux/llvm.build/runtimes/runtimes-bins/openmp/runtime/src  -fno-omit-frame-pointer -mbackchain -I /home/uweigand/sandbox/buildbot/openmp-s390x-linux/llvm.src/openmp/runtime/test/ompt /home/uweigand/sandbox/buildbot/openmp-s390x-linux/llvm.src/openmp/runtime/test/tasking/issue-94260-2.c -o /home/uweigand/sandbox/buildbot/openmp-s390x-linux/llvm.build/runtimes/runtimes-bins/openmp/runtime/test/tasking/Output/issue-94260-2.c.tmp -lm -latomic && /home/uweigand/sandbox/buildbot/openmp-s390x-linux/llvm.build/runtimes/runtimes-bins/openmp/runtime/test/tasking/Output/issue-94260-2.c.tmp
# executed command: /home/uweigand/sandbox/buildbot/openmp-s390x-linux/llvm.build/./bin/clang -fopenmp -I /home/uweigand/sandbox/buildbot/openmp-s390x-linux/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -I /home/uweigand/sandbox/buildbot/openmp-s390x-linux/llvm.src/openmp/runtime/test -L /home/uweigand/sandbox/buildbot/openmp-s390x-linux/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -fno-omit-frame-pointer -mbackchain -I /home/uweigand/sandbox/buildbot/openmp-s390x-linux/llvm.src/openmp/runtime/test/ompt /home/uweigand/sandbox/buildbot/openmp-s390x-linux/llvm.src/openmp/runtime/test/tasking/issue-94260-2.c -o /home/uweigand/sandbox/buildbot/openmp-s390x-linux/llvm.build/runtimes/runtimes-bins/openmp/runtime/test/tasking/Output/issue-94260-2.c.tmp -lm -latomic
# executed command: /home/uweigand/sandbox/buildbot/openmp-s390x-linux/llvm.build/runtimes/runtimes-bins/openmp/runtime/test/tasking/Output/issue-94260-2.c.tmp
# note: command had no output on stdout or stderr
# error: command failed with exit status: -11

--

********************


ilya-biryukov added a commit to ilya-biryukov/llvm-project that referenced this pull request Nov 15, 2024
…d only for textual headers

This is a follow up to llvm#112015 and it reduces the duplication of source
locations further.

We do not need to allocate source location space in the serialized PCMs
for module maps used only to find textual headers. Those module maps are
never referenced from anywhere in the serialized ASTs and are re-read in
other compilations.

We do need the InputFile entry for the in the serialized AST because
clang-scan-deps relies on it. The previous patch introduced a mechanism
to do exactly that.
Groverkss pushed a commit to iree-org/llvm-project that referenced this pull request Nov 15, 2024
…on duplication (llvm#112015)

Currently, any FileID that references a module map file that was
required for a compilation is considered as affecting. This misses an
important opportunity to reduce the source location space taken by the
resulting PCM.

In particular, consider the situation where the same module map file is
passed multiple times in the dependency chain:

```shell
$ clang -fmodule-map-file=foo.modulemap ... -o mod1.pcm
$ clang -fmodule-map-file=foo.modulemap -fmodule-file=mod1.pcm ... -o mod2.pcm
...
$ clang -fmodule-map-file=foo.modulemap -fmodule-file=mod$((N-1)).pcm ... -o mod$N.pcm
```

Because `foo.modulemap` is read before reading any of the `.pcm` files,
we have to create a unique `FileID` for it when creating each module.
However, when reading the `.pcm` files, we will reuse the `FileID`
loaded from it for the same module map file and the `FileID` we created
can never be used again, but we will still mark it as affecting and it
will take the source location space in the output PCM.

For a chain of N dependencies, this results in the file taking `N *
(size of file)` source location space, which could be significant. For
examples, we observer internally that some targets that run out of 2GB
of source location space end up wasting up to 20% of that space in
module maps as described above.

I take extra care to still write the InputFile entries for those files that occupied
source location space before. It is required for correctness of clang-scan-deps.
ilya-biryukov added a commit that referenced this pull request Dec 5, 2024
…d only for textual headers (#116374)

This is a follow up to #112015 and it reduces the unnecessary
duplication of source locations further.

We do not need to allocate source location space in the serialized PCMs
for module maps used only to find textual headers. Those module maps are
never referenced from anywhere in the serialized ASTs and are re-read in
other compilations.
This change should not affect correctness of Clang compilations or
clang-scan-deps in any way.

We do need the InputFile entry in the serialized AST because
clang-scan-deps relies on it. The previous patch introduced a mechanism
to do exactly that.

We have found that to finally remove any duplication of module maps we
use internally in our build system.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clang:modules C++20 modules and Clang Header Modules clang Clang issues not falling into any other category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants