-
Notifications
You must be signed in to change notification settings - Fork 15k
[llvm-strings] Use small buffer instead of reading whole file #163073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-llvm-binary-utilities Author: None (aokblast) ChangesTo prevent large memory footprint, we use small buffer and a seperated string to print. This can largely reduce the memory consumption when processing. In the following example, the test.file is 1G regular file, which is possible in the daily workload. cat ./test.file | llvm-strings --radix=o > out && Reading whole file: Small buffer implementation: Also, it does not affect too much runtime: time llvm-strings --radix=o > out < ./test.file Reading whole file: Small buffer implementation: Full diff: https://github.com/llvm/llvm-project/pull/163073.diff 1 Files Affected:
diff --git a/llvm/tools/llvm-strings/llvm-strings.cpp b/llvm/tools/llvm-strings/llvm-strings.cpp
index 9979b93de8427..2be070d43f294 100644
--- a/llvm/tools/llvm-strings/llvm-strings.cpp
+++ b/llvm/tools/llvm-strings/llvm-strings.cpp
@@ -18,7 +18,9 @@
#include "llvm/Option/ArgList.h"
#include "llvm/Option/Option.h"
#include "llvm/Support/CommandLine.h"
+#include "llvm/Support/Errno.h"
#include "llvm/Support/Error.h"
+#include "llvm/Support/FileSystem.h"
#include "llvm/Support/Format.h"
#include "llvm/Support/InitLLVM.h"
#include "llvm/Support/MemoryBuffer.h"
@@ -88,7 +90,9 @@ static void parseIntArg(const opt::InputArgList &Args, int ID, T &Value) {
}
}
-static void strings(raw_ostream &OS, StringRef FileName, StringRef Contents) {
+static void strings(raw_ostream &OS, StringRef FileName,
+ sys::fs::file_t FileHandle) {
+ SmallString<sys::fs::DefaultReadChunkSize> Buffer;
auto print = [&OS, FileName](unsigned Offset, StringRef L) {
if (L.size() < static_cast<size_t>(MinLength))
return;
@@ -110,19 +114,39 @@ static void strings(raw_ostream &OS, StringRef FileName, StringRef Contents) {
OS << L << '\n';
};
- const char *B = Contents.begin();
- const char *P = nullptr, *E = nullptr, *S = nullptr;
- for (P = Contents.begin(), E = Contents.end(); P < E; ++P) {
- if (isPrint(*P) || *P == '\t') {
- if (S == nullptr)
- S = P;
- } else if (S) {
- print(S - B, StringRef(S, P - S));
- S = nullptr;
+ unsigned Offset = 0, LocalOffset = 0, CurSize = 0;
+ Buffer.resize(sys::fs::DefaultReadChunkSize);
+ auto FillBuffer = [&Buffer, FileHandle, &FileName]() -> unsigned {
+ Expected<size_t> ReadBytesOrErr = sys::fs::readNativeFile(
+ FileHandle,
+ MutableArrayRef(Buffer.begin(), sys::fs::DefaultReadChunkSize));
+ if (!ReadBytesOrErr) {
+ errs() << FileName << ": "
+ << errorToErrorCode(ReadBytesOrErr.takeError()).message() << '\n';
+ return 0;
+ }
+ return *ReadBytesOrErr;
+ };
+ std::string StringBuffer;
+ while (true) {
+ if (LocalOffset == CurSize) {
+ CurSize = FillBuffer();
+ if (CurSize == 0)
+ break;
+ LocalOffset = 0;
+ }
+ char C = Buffer[LocalOffset++];
+ if (isPrint(C) || C == '\t') {
+ StringBuffer.push_back(C);
+ } else if (StringBuffer.size()) {
+ print(Offset, StringRef(StringBuffer.c_str(), StringBuffer.size()));
+ Offset += StringBuffer.size();
+ StringBuffer.clear();
+ ++Offset;
+ } else {
+ ++Offset;
}
}
- if (S)
- print(S - B, StringRef(S, E - S));
}
int main(int argc, char **argv) {
@@ -174,13 +198,22 @@ int main(int argc, char **argv) {
InputFileNames.push_back("-");
for (const auto &File : InputFileNames) {
- ErrorOr<std::unique_ptr<MemoryBuffer>> Buffer =
- MemoryBuffer::getFileOrSTDIN(File, /*IsText=*/true);
- if (std::error_code EC = Buffer.getError())
- errs() << File << ": " << EC.message() << '\n';
- else
- strings(llvm::outs(), File == "-" ? "{standard input}" : File,
- Buffer.get()->getMemBufferRef().getBuffer());
+ sys::fs::file_t FD;
+ if (File == "-") {
+ FD = sys::fs::getStdinHandle();
+ } else {
+ Expected<sys::fs::file_t> FDOrErr =
+ sys::fs::openNativeFileForRead(File, sys::fs::OF_None);
+ ;
+ if (!FDOrErr) {
+ errs() << File << ": "
+ << errorToErrorCode(FDOrErr.takeError()).message() << '\n';
+ continue;
+ }
+ FD = *FDOrErr;
+ }
+
+ strings(llvm::outs(), File == "-" ? "{standard input}" : File, FD);
}
return EXIT_SUCCESS;
|
To prevent large memory footprint, we use small buffer and a seperated string to print. This can largely reduce the memory consumption when processing. In the following example, the test.file is 1G regular file, which is possible in the daily workload. cat ./test.file | llvm-strings --radix=o > out && ps -o command,vsz,rss | grep strings Reading whole file: llvm-strings 5129824 3999920 Small buffer implementation: llvm-strings 17604 5476 Also, it does not affect too much runtime: time llvm-strings --radix=o > out < ./test.file Reading whole file: ________________________________________________________ Executed in 5.67 secs fish external usr time 4.38 secs 2.40 millis 4.38 secs sys time 1.28 secs 0.00 millis 1.28 secs Small buffer implementation: ________________________________________________________ Executed in 5.87 secs fish external usr time 5.65 secs 2.59 millis 5.65 secs sys time 0.21 secs 0.00 millis 0.21 secs
ace3d74
to
42984b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial thought is that this is overcomplicating things. Could you have strings
just take a std::istream
object and use std::istream::read
or something similar? You'd pass in either std::cin
or std::ifstream
, if using a file.
That is really strange. In my experiment, it takes 10sec to process a 1GB file. Even if I use basic_streambuf::putsetbuf to set the buffer size to 4k * 4 like my original implementation. |
Without seeing your code, I can't see what's going on, but I wouldn't expect there to be a need to call
Are you sure you're blaming the right function (keeping in mind that inlining is potentially going on)? I note that the flame graph has cut off the names of most of the functions, so I can't really see the detail. More useful would be looking at a line-by-line breakdown of the llvm-project/libcxx/include/istream Line 664 in cf55dfb
|
Here is my code.
Thanks! Yes, that is what I am doing now. Since most of the function is inlined, I have to build llvm-strings in debug build to figure it out. |
Hopefully this goes without saying, but debug builds are not going to provide the same performance characteristics as release builds. |
To prevent large memory footprint, we use small buffer and a seperated string to print. This can largely reduce the memory consumption when processing.
In the following example, the test.file is 1G regular file, which is possible in the daily workload.
cat ./test.file | llvm-strings --radix=o > out &&
ps -o command,vsz,rss | grep strings
Reading whole file:
llvm-strings 5129824 (5G) 3999920 (3G)
Small buffer implementation:
llvm-strings 17604 (16M) 5476 (5M)
Also, it does not affect too much runtime:
time llvm-strings --radix=o > out < ./test.file
Reading whole file:
Executed in 5.67 secs fish external
usr time 4.38 secs 2.40 millis 4.38 secs
sys time 1.28 secs 0.00 millis 1.28 secs
Small buffer implementation:
Executed in 5.87 secs fish external
usr time 5.65 secs 2.59 millis 5.65 secs
sys time 0.21 secs 0.00 millis 0.21 secs