Skip to content

Conversation

aokblast
Copy link
Contributor

@aokblast aokblast commented Oct 5, 2025

When stdin is executed like ./prog < file, we are able to mmap the buffer so that we can reduce the total memory usage if we try to open a large file.

@llvmbot
Copy link
Member

llvmbot commented Oct 5, 2025

@llvm/pr-subscribers-llvm-binary-utilities

@llvm/pr-subscribers-llvm-support

Author: None (aokblast)

Changes

When stdin is executed like ./prog < file, we are able to mmap the buffer so that we can reduce the total memory usage if we try to open a large file.


Full diff: https://github.com/llvm/llvm-project/pull/162013.diff

1 Files Affected:

  • (modified) llvm/lib/Support/MemoryBuffer.cpp (+21)
diff --git a/llvm/lib/Support/MemoryBuffer.cpp b/llvm/lib/Support/MemoryBuffer.cpp
index 1c4645ad83641..97ea5f205b0ee 100644
--- a/llvm/lib/Support/MemoryBuffer.cpp
+++ b/llvm/lib/Support/MemoryBuffer.cpp
@@ -573,6 +573,27 @@ ErrorOr<std::unique_ptr<MemoryBuffer>> MemoryBuffer::getSTDIN() {
   // FIXME: That isn't necessarily true, we should try to mmap stdin and
   // fallback if it fails.
   sys::ChangeStdinMode(sys::fs::OF_Text);
+  std::error_code EC;
+  sys::fs::file_type Type;
+  sys::fs::file_status Status;
+  EC = sys::fs::status(sys::fs::getStdinHandle(), Status);
+  if (EC)
+    return EC;
+
+  Type = Status.type();
+
+  /*
+   * If the FD is regular file or block file,
+   * we try to create a mmap buffer first.
+   * If failed, rollback to read and copy.
+   */
+  if (Type == sys::fs::file_type::regular_file ||
+      Type == sys::fs::file_type::block_file) {
+    std::unique_ptr<MemoryBuffer> Result(new MemoryBufferMMapFile<MemoryBuffer>(
+        false, sys::fs::getStdinHandle(), Status.getSize(), 0, EC));
+    if (!EC)
+      return std::move(Result);
+  }
 
   return getMemoryBufferForStream(sys::fs::getStdinHandle(), "<stdin>");
 }

@aokblast
Copy link
Contributor Author

aokblast commented Oct 5, 2025

One thing that I am not sure is if we need to overwrite operator new like new (NamedBufferAlloc(BufferName)). If so, what is the proper BufferName? stdin?

@aokblast aokblast force-pushed the support/add_mmap_backend_when_possible branch 4 times, most recently from ccaa772 to a9d7f6c Compare October 6, 2025 08:32
@aokblast
Copy link
Contributor Author

aokblast commented Oct 6, 2025

Currently, some programs rely *c = '\0' to find the end of a buffer instead of the provided size.
In my experiment, therefore, it is impossible for us to mmap in all cases since some file may not have '\0' in the end of the file and requires copy buffer and append '\0' manually.
However, it would be a huge performace improvement if we use mmap when processing a huge file.
Can I rely on the test-suite's results to make sure which program rely on *c = '\0' and therefore provide mmap acceleration for other programs?

@jh7370
Copy link
Collaborator

jh7370 commented Oct 6, 2025

Currently, some programs rely *c = '\0' to find the end of a buffer instead of the provided size. In my experiment, therefore, it is impossible for us to mmap in all cases since some file may not have '\0' in the end of the file and requires copy buffer and append '\0' manually. However, it would be a huge performace improvement if we use mmap when processing a huge file. Can I rely on the test-suite's results to make sure which program rely on *c = '\0' and therefore provide mmap acceleration for other programs?

Without knowing the context, it sounds to me like a tool that ignores the provided size and looks for a trailing null terminator has a bug in it (what if the input contains null bytes other than at the end of the data?), so the tool should be updated. What tools rely on this behaviour?

@aokblast
Copy link
Contributor Author

aokblast commented Oct 6, 2025

Currently, some programs rely *c = '\0' to find the end of a buffer instead of the provided size. In my experiment, therefore, it is impossible for us to mmap in all cases since some file may not have '\0' in the end of the file and requires copy buffer and append '\0' manually. However, it would be a huge performace improvement if we use mmap when processing a huge file. Can I rely on the test-suite's results to make sure which program rely on *c = '\0' and therefore provide mmap acceleration for other programs?

Without knowing the context, it sounds to me like a tool that ignores the provided size and looks for a trailing null terminator has a bug in it (what if the input contains null bytes other than at the end of the data?), so the tool should be updated. What tools rely on this behaviour?

I am checking that. When I add the trailing zero test to judge if we can mmap, it passes the test-suite. When I remove the test, CI failed. I think these programs just have some behavior like to find the end of a string in string sections.

@aokblast
Copy link
Contributor Author

aokblast commented Oct 6, 2025

This is the CI result for removing trailing zero check.
https://github.com/llvm/llvm-project/actions/runs/18273318718/job/52019747511
It seems that Linux failed in a DAP test.
And windows failed in llc, however, the test results in Windows shows no symbol name and I don't have a Windows machine for debugging.

@aokblast
Copy link
Contributor Author

aokblast commented Oct 6, 2025

I plan to run the CI that turns off RequiresNullTermination by defualt again after sucess this time.

When stdin is executed like ./prog < file, we are possible to mmap the
buffer so that we can reduce the total memory usage if we try to open a
large file.

For example, programs like llvm-strings iterates the buffer until the
end of given size.
@aokblast aokblast force-pushed the support/add_mmap_backend_when_possible branch from f15f5b4 to c56c44d Compare October 6, 2025 09:52
@aokblast
Copy link
Contributor Author

aokblast commented Oct 6, 2025

Oops, I am so sorry for the noise... I forget to rebase main before push to this branch.

@aokblast
Copy link
Contributor Author

aokblast commented Oct 6, 2025

Emm, the problem only happens on Windows this time. I think l need to get a windows machine since it is definately a WIndows related bug unrelated with my patch.

Copy link
Collaborator

@jh7370 jh7370 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've given this a bit of thought and while there may be some use to the change you're trying to make for other tools, I actually think it's not the right solution for llvm-strings, since it doesn't fix the general case of reading from stdin. I actually think the correct thing to do would be to change llvm-strings to not try to read in the whole file up front, but rather read it in as required (presumably using some kind of stream implementation or similar). This will require changes to the core logic of the program, but it will reduce the memory usage for files as well as stdin.

@aokblast
Copy link
Contributor Author

aokblast commented Oct 7, 2025

I've given this a bit of thought and while there may be some use to the change you're trying to make for other tools, I actually think it's not the right solution for llvm-strings, since it doesn't fix the general case of reading from stdin. I actually think the correct thing to do would be to change llvm-strings to not try to read in the whole file up front, but rather read it in as required (presumably using some kind of stream implementation or similar). This will require changes to the core logic of the program, but it will reduce the memory usage for files as well as stdin.

Thanks for your reply! Yes, you are right. These are two seperate problems. So I am considering doing two things:

  1. Fix the Windows problem in this patch since this patch is useful in other cases besides llvm-strings.
  2. Change llvm-strings logic to use read directly instead of Memory Buffer. I think it is not really hard since llvm-strings is a 200 lines program.

@jh7370
Copy link
Collaborator

jh7370 commented Oct 7, 2025

Make sure to do the two things in separate PRs, so that they can be reviewed on their own merits (you can keep this one for the non-llvm-strings case, if you want). You should also give performance examples of how the fixes improve memory consumption in some example cases. In the llvm-strings case, you probably also want to show the runtime impact of the overall change you make.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants