-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[ASTWriter] Detect more non-affecting FileIDs to reduce source location duplication #112015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…f source locations Currently, any FileID that references a module map file that was required for a compilation is considered as affecting. This misses an important opportunity to reduce the source location space taken by the resulting PCM. In particular, consider the situation where the same module map file is passed multiple times in the dependency chain: ```shell $ clang -fmodule-map-file=foo.modulemap ... -o mod1.pcm $ clang -fmodule-map-file=foo.modulemap -fmodule-file=mod1.pcm ... -o mod2.pcm ... $ clang -fmodule-map-file=foo.modulemap -fmodule-file=mod$((N-1)).pcm ... -o mod$N.pcm ``` Because `foo.modulemap` is read before reading any of the `.pcm` files, we have to create a unique `FileID` for it when creating each module. However, when reading the `.pcm` files, we will reuse the `FileID` loaded from it for the same module map file and the `FileID` we created can never be used again, but we will still not mark it as affecting and it will take the source location space in the output PCM. For a chain of N dependencies, this results in the file taking `N * (size of file)` source location space, which could be significant. For examples, we observer internally that some targets that run out of 2GB of source location space end up wasting up to 20% of that space in module maps as described above. I take extra care to still write the InputFile entries for those files, which has not been done before. It is required for ClangScanDeps to properly function. The change in the output of one of the ClangScanDeps tests is attributed to that: we now add a bit more module maps to the output in some tricky cases. E.g. in the test case the module map file is required and is referenced by another top-level module map, adding it is redundant, but should not be breaking.
|
@llvm/pr-subscribers-clang-modules @llvm/pr-subscribers-clang Author: Ilya Biryukov (ilya-biryukov) ChangesCurrently, any FileID that references a module map file that was required for a compilation is considered as affecting. This misses an important opportunity to reduce the source location space taken by the resulting PCM. In particular, consider the situation where the same module map file is passed multiple times in the dependency chain: $ clang -fmodule-map-file=foo.modulemap ... -o mod1.pcm
$ clang -fmodule-map-file=foo.modulemap -fmodule-file=mod1.pcm ... -o mod2.pcm
...
$ clang -fmodule-map-file=foo.modulemap -fmodule-file=mod$((N-1)).pcm ... -o mod$N.pcmBecause For a chain of N dependencies, this results in the file taking I take extra care to still write the InputFile entries for those files, which has not been done before. It is required for ClangScanDeps to properly function. The change in the output of one of the ClangScanDeps tests is attributed to that: we now add a bit more module maps to the output in some tricky cases. E.g. in the test case the module map file is required and is referenced by another top-level module map, adding it is redundant, but should not be breaking. Full diff: https://github.com/llvm/llvm-project/pull/112015.diff 5 Files Affected:
diff --git a/clang/include/clang/Serialization/ASTWriter.h b/clang/include/clang/Serialization/ASTWriter.h
index 10a50b711043a8..6b03300d59adda 100644
--- a/clang/include/clang/Serialization/ASTWriter.h
+++ b/clang/include/clang/Serialization/ASTWriter.h
@@ -491,6 +491,9 @@ class ASTWriter : public ASTDeserializationListener,
/// Mapping from a source location entry to whether it is affecting or not.
llvm::BitVector IsSLocAffecting;
+ /// Mapping from a source location entry to whether it must be included as
+ /// input file.
+ llvm::BitVector IsSLocFileEntryAffecting;
/// Mapping from \c FileID to an index into the FileID adjustment table.
std::vector<FileID> NonAffectingFileIDs;
diff --git a/clang/lib/Serialization/ASTReader.cpp b/clang/lib/Serialization/ASTReader.cpp
index e5a1e20a265616..a4ce57c6b2d9c5 100644
--- a/clang/lib/Serialization/ASTReader.cpp
+++ b/clang/lib/Serialization/ASTReader.cpp
@@ -5862,6 +5862,12 @@ llvm::Error ASTReader::ReadSubmoduleBlock(ModuleFile &F,
}
CurrentModule->Kind = Kind;
+ // Note that we may be rewriting an existing location and it is important
+ // to keep doing that. In particular, we would like to prefer a
+ // `DefinitionLoc` loaded from the module file instead of the location
+ // created in the current source manager, because it allows the new
+ // location to be marked as "unaffecting" when writing and avoid creating
+ // duplicate locations for the same module map file.
CurrentModule->DefinitionLoc = DefinitionLoc;
CurrentModule->Signature = F.Signature;
CurrentModule->IsFromModuleFile = true;
diff --git a/clang/lib/Serialization/ASTWriter.cpp b/clang/lib/Serialization/ASTWriter.cpp
index c6289493fce1de..135e6308b36390 100644
--- a/clang/lib/Serialization/ASTWriter.cpp
+++ b/clang/lib/Serialization/ASTWriter.cpp
@@ -38,6 +38,7 @@
#include "clang/AST/TypeLocVisitor.h"
#include "clang/Basic/Diagnostic.h"
#include "clang/Basic/DiagnosticOptions.h"
+#include "clang/Basic/FileEntry.h"
#include "clang/Basic/FileManager.h"
#include "clang/Basic/FileSystemOptions.h"
#include "clang/Basic/IdentifierTable.h"
@@ -81,6 +82,7 @@
#include "llvm/ADT/APSInt.h"
#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/DenseSet.h"
#include "llvm/ADT/Hashing.h"
#include "llvm/ADT/PointerIntPair.h"
#include "llvm/ADT/STLExtras.h"
@@ -166,7 +168,12 @@ static TypeCode getTypeCodeForTypeClass(Type::TypeClass id) {
namespace {
-std::optional<std::set<const FileEntry *>>
+struct AffectingModuleMaps {
+ llvm::DenseSet<FileID> DefinitionFileIDs;
+ llvm::DenseSet<const FileEntry *> DefinitionFiles;
+};
+
+std::optional<AffectingModuleMaps>
GetAffectingModuleMaps(const Preprocessor &PP, Module *RootModule) {
if (!PP.getHeaderSearchInfo()
.getHeaderSearchOpts()
@@ -174,10 +181,12 @@ GetAffectingModuleMaps(const Preprocessor &PP, Module *RootModule) {
return std::nullopt;
const HeaderSearch &HS = PP.getHeaderSearchInfo();
+ const SourceManager &SM = PP.getSourceManager();
const ModuleMap &MM = HS.getModuleMap();
- std::set<const FileEntry *> ModuleMaps;
- std::set<const Module *> ProcessedModules;
+ llvm::DenseSet<FileID> ModuleMaps;
+
+ llvm::DenseSet<const Module *> ProcessedModules;
auto CollectModuleMapsForHierarchy = [&](const Module *M) {
M = M->getTopLevelModule();
@@ -192,13 +201,13 @@ GetAffectingModuleMaps(const Preprocessor &PP, Module *RootModule) {
// The containing module map is affecting, because it's being pointed
// into by Module::DefinitionLoc.
- if (auto FE = MM.getContainingModuleMapFile(Mod))
- ModuleMaps.insert(*FE);
+ if (auto F = MM.getContainingModuleMapFileID(Mod); F.isValid())
+ ModuleMaps.insert(F);
// For inferred modules, the module map that allowed inferring is not
// related to the virtual containing module map file. It did affect the
// compilation, though.
- if (auto FE = MM.getModuleMapFileForUniquing(Mod))
- ModuleMaps.insert(*FE);
+ if (auto F = MM.getModuleMapFileIDForUniquing(Mod); F.isValid())
+ ModuleMaps.insert(F);
for (auto *SubM : Mod->submodules())
Q.push(SubM);
@@ -268,7 +277,16 @@ GetAffectingModuleMaps(const Preprocessor &PP, Module *RootModule) {
// just ban module map hierarchies where module map defining a (sub)module X
// includes a module map defining a module that's not a submodule of X.
- return ModuleMaps;
+ llvm::DenseSet<const FileEntry *> ModuleFileEntries;
+ for (FileID MM : ModuleMaps) {
+ if (auto *FE = SM.getFileEntryForID(MM))
+ ModuleFileEntries.insert(FE);
+ }
+
+ AffectingModuleMaps R;
+ R.DefinitionFileIDs = std::move(ModuleMaps);
+ R.DefinitionFiles = std::move(ModuleFileEntries);
+ return std::move(R);
}
class ASTTypeWriter {
@@ -1772,7 +1790,7 @@ void ASTWriter::WriteInputFiles(SourceManager &SourceMgr,
continue;
// Do not emit input files that do not affect current module.
- if (!IsSLocAffecting[I])
+ if (!IsSLocFileEntryAffecting[I])
continue;
InputFileEntry Entry(*Cache->OrigEntry);
@@ -4924,6 +4942,7 @@ void ASTWriter::computeNonAffectingInputFiles() {
unsigned N = SrcMgr.local_sloc_entry_size();
IsSLocAffecting.resize(N, true);
+ IsSLocFileEntryAffecting.resize(N, true);
if (!WritingModule)
return;
@@ -4960,10 +4979,12 @@ void ASTWriter::computeNonAffectingInputFiles() {
continue;
// Don't prune module maps that are affecting.
- if (llvm::is_contained(*AffectingModuleMaps, *Cache->OrigEntry))
+ if (AffectingModuleMaps->DefinitionFileIDs.contains(FID))
continue;
IsSLocAffecting[I] = false;
+ IsSLocFileEntryAffecting[I] =
+ AffectingModuleMaps->DefinitionFiles.contains(*Cache->OrigEntry);
FileIDAdjustment += 1;
// Even empty files take up one element in the offset table.
diff --git a/clang/test/ClangScanDeps/modules-extern-submodule.c b/clang/test/ClangScanDeps/modules-extern-submodule.c
index 92f2c749dd2c30..e0cfb210d3a33a 100644
--- a/clang/test/ClangScanDeps/modules-extern-submodule.c
+++ b/clang/test/ClangScanDeps/modules-extern-submodule.c
@@ -43,7 +43,7 @@ module third {}
// CHECK-NEXT: "command-line": [
// CHECK-NEXT: "-cc1",
// CHECK: "-fmodule-map-file=[[PREFIX]]/second/second/module.modulemap"
-// CHECK-NOT: "-fmodule-map-file=[[PREFIX]]/second/second/sub.modulemap"
+// CHECK-NEXT: "-fmodule-map-file=[[PREFIX]]/second/second/sub.modulemap"
// CHECK-NOT: "-fmodule-map-file=[[PREFIX]]/third/module.modulemap"
// CHECK: "-fmodule-file=second=[[PREFIX]]/cache/{{.*}}/second-{{.*}}.pcm"
// CHECK: ],
diff --git a/clang/test/Modules/prune-non-affecting-module-map-repeated.cpp b/clang/test/Modules/prune-non-affecting-module-map-repeated.cpp
new file mode 100644
index 00000000000000..2c09bc4b01b026
--- /dev/null
+++ b/clang/test/Modules/prune-non-affecting-module-map-repeated.cpp
@@ -0,0 +1,85 @@
+// Check that the same module map file passed to -fmodule-map-file *and*
+// available from one of the `-fmodule-file` does not allocate extra source
+// location space. This optimization is important for using module maps in
+// large codebases to avoid running out of source location space.
+
+// RUN: rm -rf %t && mkdir %t
+// RUN: split-file %s %t
+
+// RUN: %clang_cc1 -xc++ -fmodules -fno-implicit-modules -fmodule-map-file=%t/base.map -fmodule-name=base -emit-module %t/base.map -o %t/base.pcm
+// RUN: %clang_cc1 -xc++ -fmodules -fno-implicit-modules \
+// RUN: -fmodule-map-file=%t/base.map -fmodule-map-file=%t/mod1.map \
+// RUN: -fmodule-file=%t/base.pcm \
+// RUN: -fmodule-name=mod1 -emit-module %t/mod1.map -o %t/mod1.pcm
+// RUN: %clang_cc1 -xc++ -fmodules -fno-implicit-modules \
+// RUN: -fmodule-map-file=%t/base.map -fmodule-map-file=%t/mod1.map -fmodule-map-file=%t/mod2.map \
+// RUN: -fmodule-file=%t/mod1.pcm \
+// RUN: -fmodule-name=mod2 -emit-module %t/mod2.map -o %t/mod2.pcm
+// RUN: %clang_cc1 -xc++ -fmodules -fno-implicit-modules \
+// RUN: -fmodule-map-file=%t/base.map -fmodule-map-file=%t/mod2.map -fmodule-map-file=%t/mod3.map \
+// RUN: -fmodule-file=%t/mod2.pcm \
+// RUN: -fmodule-name=mod3 -emit-module %t/mod3.map -o %t/mod3.pcm
+// RUN: %clang_cc1 -xc++ -fmodules -fno-implicit-modules \
+// RUN: -fmodule-map-file=%t/base.map -fmodule-map-file=%t/mod3.map -fmodule-map-file=%t/mod4.map \
+// RUN: -fmodule-file=%t/mod3.pcm \
+// RUN: -fmodule-name=mod4 -emit-module %t/mod4.map -o %t/mod4.pcm
+// RUN: %clang_cc1 -xc++ -fmodules -fno-implicit-modules -fmodule-map-file=%t/base.map -fmodule-map-file=%t/mod4.map -fmodule-file=%t/mod4.pcm -fsyntax-only -verify %t/check_slocs.cc
+
+//--- base.map
+module base { header "vector.h" }
+//--- mod1.map
+module mod1 { header "mod1.h" }
+//--- mod2.map
+module mod2 { header "mod2.h" }
+//--- mod3.map
+module mod3 { header "mod3.h" }
+//--- mod4.map
+module mod4 { header "mod4.h" }
+//--- check_slocs.cc
+#include "mod4.h"
+#pragma clang __debug sloc_usage // expected-remark {{source manager location address space usage}}
+// expected-note@* {{% of available space}}
+
+// Module map files files that were specified on the command line are entered twice (once when parsing command-line, once loaded from the .pcm)
+// Those that not specified on the command line must be entered once.
+
+// [email protected]:1 {{file entered 2 times}}
+// [email protected]:1 {{file entered 2 times}}
+// [email protected]:1 {{file entered 1 time}}
+// [email protected]:1 {{file entered 1 time}}
+// [email protected]:1 {{file entered 1 time}}
+// expected-note@* + {{file entered}}
+
+
+//--- vector.h
+#ifndef VECTOR_H
+#define VECTOR_H
+#endif
+
+//--- mod1.h
+#ifndef MOD1
+#define MOD1
+#include "vector.h"
+int mod1();
+#endif
+//--- mod2.h
+#ifndef MOD2
+#define MOD2
+#include "vector.h"
+#include "mod1.h"
+int mod2();
+#endif
+//--- mod3.h
+#ifndef MOD3
+#define MOD3
+#include "vector.h"
+#include "mod2.h"
+int mod3();
+#endif
+//--- mod4.h
+#ifndef MOD4
+#define MOD4
+#include "vector.h"
+#include "mod3.h"
+int mod4();
+#endif
|
|
This is a change that should remove the need for a hack from #88893, reaching the same effect while keeping the command-line interface as it is today. |
ChuanqiXu9
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me. Please leave some time to @jansvoboda11
Thanks!
Will do, @jansvoboda11 PTAL when you have a spare minute. |
…ation of source locations Rename second occurrence of F.
jansvoboda11
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for trying to find an alternative solution!
However, when reading the .pcm files, we will reuse the FileID loaded from it for the same module map file and the FileID we created can never be used again, but we will still not mark it as affecting and it will take the source location space in the output PCM.
Did you mean "mark it as affecting" / "not mark it as non-affecting" instead of "not mark it as affecting"?
I take extra care to still write the InputFile entries for those files, which has not been done before. It is required for ClangScanDeps to properly function.
Can you expand on that? Why doesn't it work properly if you stop writing those input files as well?
| // CHECK-NEXT: "-cc1", | ||
| // CHECK: "-fmodule-map-file=[[PREFIX]]/second/second/module.modulemap" | ||
| // CHECK-NOT: "-fmodule-map-file=[[PREFIX]]/second/second/sub.modulemap" | ||
| // CHECK-NEXT: "-fmodule-map-file=[[PREFIX]]/second/second/sub.modulemap" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fairly important to not report those on command lines. On Darwin, we have one module map that includes lots of other module maps, so listing them all blows up the command line length (and is unnecessary).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out, I did not realize this was important.
That would require digging into scan-deps a bit, let me try to see if it's possible to match the previous behavior.
In general, I feel that rewriting ModuleDepCollector in a way that walks over the module dependency graph rather than the InputFile is probably the right solution in the long run, but I'm not sure if it's an easy change to make. (I still don't have a good grasp on what outputs are generally expected here, relying on something as internal as InputFile in the serialized AST was a surprise).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the code to keep the old behavior in that case, PTAL.
|
I was looking at the affecting module map logic for a reason unrelated to yours, but ended up needing something similar to what you're trying to achieve. Here's an example: Intuitively, I would expect the set of affecting files between modules X1 and X2 to be the same (at least when it comes to files related to modules A and B). Assuming implicit module maps, the problem is as follows:
This is unexpected and I think it breaks the correctness of clang-scan-deps. I think we might need to take your patch a bit further and make it so that |
Yes, sorry, that was a mistake on my part.
Because
If we stop producing those input files, it should not affect the compiler in any way, but clang-scan-deps will be broken (some tests will fail, including the ones that check the final commands can compile the underlying modules)
Hm, that would actually make the final outputs much more predictable, so I like this approach. Let me try to prototype it and get back to you. |
…ation of source locations Make sure we do not mark files as top-level unnecessarily
I want to dig into it a little more, but the problem is not introduced with this PR, so I suggest to do this in a follow-up in the interest of keeping this PR focused. The particular challenge on my mind that makes it worth exploring in a different PR is that
It might be desirable to write some loaded locations for the former, but it goes against the idea for the latter. @jansvoboda11 wdyt about accepting the current PR and continuing the discussion for your case in a separate PR/issue? |
|
@jansvoboda11 friendly ping. We have seen an increased number of files hitting the source location limits with the compiler form the last months or so, having this change in would really help us reduce the pressure. |
jansvoboda11
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, thank you!
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/88/builds/4381 Here is the relevant piece of the build log for the reference |
…d only for textual headers This is a follow up to llvm#112015 and it reduces the duplication of source locations further. We do not need to allocate source location space in the serialized PCMs for module maps used only to find textual headers. Those module maps are never referenced from anywhere in the serialized ASTs and are re-read in other compilations. We do need the InputFile entry for the in the serialized AST because clang-scan-deps relies on it. The previous patch introduced a mechanism to do exactly that.
…on duplication (llvm#112015) Currently, any FileID that references a module map file that was required for a compilation is considered as affecting. This misses an important opportunity to reduce the source location space taken by the resulting PCM. In particular, consider the situation where the same module map file is passed multiple times in the dependency chain: ```shell $ clang -fmodule-map-file=foo.modulemap ... -o mod1.pcm $ clang -fmodule-map-file=foo.modulemap -fmodule-file=mod1.pcm ... -o mod2.pcm ... $ clang -fmodule-map-file=foo.modulemap -fmodule-file=mod$((N-1)).pcm ... -o mod$N.pcm ``` Because `foo.modulemap` is read before reading any of the `.pcm` files, we have to create a unique `FileID` for it when creating each module. However, when reading the `.pcm` files, we will reuse the `FileID` loaded from it for the same module map file and the `FileID` we created can never be used again, but we will still mark it as affecting and it will take the source location space in the output PCM. For a chain of N dependencies, this results in the file taking `N * (size of file)` source location space, which could be significant. For examples, we observer internally that some targets that run out of 2GB of source location space end up wasting up to 20% of that space in module maps as described above. I take extra care to still write the InputFile entries for those files that occupied source location space before. It is required for correctness of clang-scan-deps.
…d only for textual headers (#116374) This is a follow up to #112015 and it reduces the unnecessary duplication of source locations further. We do not need to allocate source location space in the serialized PCMs for module maps used only to find textual headers. Those module maps are never referenced from anywhere in the serialized ASTs and are re-read in other compilations. This change should not affect correctness of Clang compilations or clang-scan-deps in any way. We do need the InputFile entry in the serialized AST because clang-scan-deps relies on it. The previous patch introduced a mechanism to do exactly that. We have found that to finally remove any duplication of module maps we use internally in our build system.
Currently, any FileID that references a module map file that was required for a compilation is considered as affecting. This misses an important opportunity to reduce the source location space taken by the resulting PCM.
In particular, consider the situation where the same module map file is passed multiple times in the dependency chain:
Because
foo.modulemapis read before reading any of the.pcmfiles, we have to create a uniqueFileIDfor it when creating each module. However, when reading the.pcmfiles, we will reuse theFileIDloaded from it for the same module map file and theFileIDwe created can never be used again, but we will still mark it as affecting and it will take the source location space in the output PCM.For a chain of N dependencies, this results in the file taking
N * (size of file)source location space, which could be significant. For examples, we observer internally that some targets that run out of 2GB of source location space end up wasting up to 20% of that space in module maps as described above.I take extra care to still write the InputFile entries for those files, which has not been done before. It is required for ClangScanDeps to properly function.
The change in the output of one of the ClangScanDeps tests is attributed to that: we now add a bit more module maps to the output in some tricky cases. E.g. in the test case the module map file is required and is referenced by another top-level module map, adding it is redundant, but should not be breaking.