sort: deduplicate file descriptors in merge mode#11961
sort: deduplicate file descriptors in merge mode#11961nonontb wants to merge 10 commits intouutils:mainfrom
Conversation
New tests are added for merging duplicate files.
|
GNU testsuite comparison: |
Merging this PR will not alter performance
Comparing Footnotes
|
| @@ -0,0 +1,3 @@ | |||
| 1 | |||
There was a problem hiding this comment.
please generate the files on the fly
ebda4db to
a9f249e
Compare
… alias Mmap as MemoryMap
a9f249e to
3517780
Compare
| @@ -0,0 +1,6 @@ | |||
| 1 | |||
There was a problem hiding this comment.
please generate this one the fly too
There was a problem hiding this comment.
Yes, I forgot to delete it but the test is "on the fly"
#[test]
fn test_merge_mixed_stdin_and_files() {
let (at, mut ucmd) = at_and_ucmd!();
at.write("merge_duplicates_1.txt", "1\n3\n5\n");
// Verify that sort -m allows mixing stdin with files (GNU Coreutils compatible)
ucmd.arg("-m")
.arg("-")
.arg("merge_duplicates_1.txt")
.pipe_in("apricot\nelderberry\nkiwi\n")
.succeeds()
.stdout_is("1\n3\n5\napricot\nelderberry\nkiwi\n");
}
| // it gets opened for writing. This allows reading the original content | ||
| // via memory-map while writing to the same file, without needing a temp copy. | ||
| let output_as_input = if let Some(name) = output.as_output_name() { | ||
| let output_path = Path::new(name).canonicalize()?; |
There was a problem hiding this comment.
maybe move this into a function?
|
did you look if we have a benchmark covering this? thanks |
I just did and it seems there not relevant benchmark test in I suppose it would be better to add this bench test in another issue to have some reference numbers before benchmarking this PR ? |
What This Does
This PR makes sort -m (merge mode) use less (minimum?) opened files.
The Problem
Before:
If you ran sort -m file.txt file.txt file.txt, the program opened file.txt three times eagerly — once for every time it appeared on the command line.
With lots of duplicates or a tight system limit on open files, this could fail.
If you tried to merge a file that was also your output file, the program had to create a temporary copy behind the scenes, using one more file.
GNU version has no issue running the test in #5714
The Fix
Now the program opens each unique file only once and Lazily and use Mmap (memmap2 - unsafe) to manage one FD for all input file duplicates including re-use of output file as inputs.
Result
Fix #5714