Skip to content

Conversation

@rezib
Copy link
Contributor

@rezib rezib commented Mar 21, 2025

Dear mpifileutils developers,

This is my proposal to add support of hardlinks in many mpifileutils commands: dwalk, dcp, dcmp, dsync and dtar.

During tree walk with details, regular files with more than one nlink are temporarily placed in a hardlinks flist. This flist is then globally ordered by names and ranked to select one reference path per inode, and flag all other paths to this inodes as hardlinks. The sorted hardlinks flist is finally merged in global flist with all other items. The paths name ordering is performed to ensure reproducibility between two similar trees, thus minimizing the differences for dcmp and dsync eventually.

Note

You may find more implementations details in respective commits messages.

The pull request introduces a cache format v5, to support encoding of files nlink and hardlinks references paths.

This pull request also includes a functional test suite that relies on Python standard unittest library. This suite is designed to be easy to execute:

  • Set two environment variables to define respectively the path to mpifileutils binaries and arguments provided to mpirun, eg:
$ export MFU_BIN=~/dev/bin
$ export MFU_MPIRUN_ARGS="--bind-to none --oversubscribe -N 4"
  • And run all the tests:
$ python3 -m unittest discover -v test
  • Or:
$ pytest  # require pytest

It is also designed to be easy to integrate in continuous integration systems. The pull request even provides a GitHub action workflow to execute this test suite on every pull requests and merges in main branch (example run).

For the record, this test suite has already helped detect and fix the following bugs:

Please let me know what you think! I can also remove the tests and GitHub actions workflow if you don't like the technical approach.

Important

Note this feature does not work properly without this fix for a bug in DTCMP: LLNL/dtcmp#20

Important

There is one limitation with dcp/dsync --dereference when symlinks point to path with more than one link. In this specific case, mpifileutils will consider the symlink as one more additional path to the same inode and create one more hardlink on this inodes in destination directory. For reference, this case is coverered by test test_dsync_symlink_dereference_target_nlinks.

Note

I would like to emphasize that this work is sponsored by @cea-hpc.

fix #417 #336

@carbonneau1
Copy link
Collaborator

Going through your changes.

@rezib
Copy link
Contributor Author

rezib commented Jun 13, 2025

Important

There is one limitation with dcp/dsync --dereference when symlinks point to path with more than one link. In this specific case, mpifileutils will consider the symlink as one more additional path to the same inode and create one more hardlink on this inodes in destination directory. For reference, this case is coverered by test test_dsync_symlink_dereference_target_nlinks.

FWIW, I checked the behavior of GNU cp and rsync with similar options and it produces the same results.

For reference, here is my setup-test.sh script for this test:

#!/bin/sh

# Purge previous testing data
if [ -d orig ]; then
  rm orig/*
  rmdir orig
fi

if [ -d dest ]; then
  rm dest/*
  rmdir dest
fi

# Create fresh testing data
mkdir orig
echo foo > orig/foo
ln orig/foo orig/bar
ln -s foo orig/baz

#cp -R --dereference --preserve=all orig dest
#rsync -a --copy-links --hard-links orig/ dest
  • With GNU cp:
$ sh setup-test.sh
$ cp -R --dereference --preserve=all orig dest && ls -lirn orig dest
orig:
total 8
28201162 -rw-rw-r-- 2 1000 1000 4 Jun 13 16:53 foo
28201168 lrwxrwxrwx 1 1000 1000 3 Jun 13 16:53 baz -> foo
28201162 -rw-rw-r-- 2 1000 1000 4 Jun 13 16:53 bar

dest:
total 12
28201217 -rw-rw-r-- 3 1000 1000 4 Jun 13 16:53 foo
28201217 -rw-rw-r-- 3 1000 1000 4 Jun 13 16:53 baz
28201217 -rw-rw-r-- 3 1000 1000 4 Jun 13 16:53 bar
  • With rsync:
$ sh setup-test.sh
$ rsync -a --copy-links --hard-links orig/ dest && ls -lirn orig dest
orig:
total 8
28201162 -rw-rw-r-- 2 1000 1000 4 Jun 13 16:54 foo
28201168 lrwxrwxrwx 1 1000 1000 3 Jun 13 16:54 baz -> foo
28201162 -rw-rw-r-- 2 1000 1000 4 Jun 13 16:54 bar

dest:
total 12
28201217 -rw-rw-r-- 3 1000 1000 4 Jun 13 16:54 foo
28201217 -rw-rw-r-- 3 1000 1000 4 Jun 13 16:54 baz
28201217 -rw-rw-r-- 3 1000 1000 4 Jun 13 16:54 bar

In all cases, the dereferenced symlink baz is transformed into an additional hardlink on the inode pointed by foo and bar.

Copy link

@ajmes ajmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CEA uses this patch to migrate data between 2 Lustre FSs. The patch seems to work fine for the initial sync, but it can crash when updating hardlinks.

uint64_t count = mfu_flist_size(dst_list);
char* recvptr = (char*) recvbuf;
for (int i = 0; i < (int) ranks; i++) {
for (int j = 0; j<recvcounts[i]; j++) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line causes a segfault, when all the refs of a file are removed from the FS source.

(gdb) bt
#0  00007f77898a7332 in __strcmp_avx2 () from /lib64/libc.so.6
#1  000055dca3bc0109 in dsync_remove_hardlinks_with_removed_refs (mfu_src_file=..., mfu_dst_file=..., dst_map=..., dst_remove_list=...,
    dst_list=..., src_map=..., src_cp_list=..., src_list=...) at /usr/src/debug/mpifileutils-0.12-1.el8.x86_64/src/dsync/dsync.c:1697
#2  dsync_strmap_compare (src_list=..., src_map=..., dst_list=..., dst_map=..., link_list=..., link_map=..., strlen_prefix=...,
    copy_opts=..., src_path=..., dest_path=..., link_path=..., mfu_src_file=..., mfu_dst_file=...)
    at /usr/src/debug/mpifileutils-0.12-1.el8.x86_64/src/dsync/dsync.c:2057
#3  000055dca3bbc950 in main (argc=..., argv=...) at /usr/src/debug/mpifileutils-0.12-1.el8.x86_64/src/dsync/dsync.c:3649

(gdb) frame 1
#1  000055dca3bc0109 in dsync_remove_hardlinks_with_removed_refs (mfu_src_file=..., mfu_dst_file=..., dst_map=..., dst_remove_list=...,
    dst_list=..., src_map=..., src_cp_list=..., src_list=...) at /usr/src/debug/mpifileutils-0.12-1.el8.x86_64/src/dsync/dsync.c:1697
1697                    if(strcmp(ref, removed_ref))
(gdb) l
1692                    if(type != MFU_TYPE_HARDLINK)
1693                        continue;
1694
1695                    /* skip item if reference does not match */
1696                    const char *ref = mfu_flist_file_get_ref(dst_list, dst_index);
1697                    if(strcmp(ref, removed_ref))
1698                        continue;
1699
1700                    /* skip item if type already differs */
1701                    dsync_state state;

Here we iterate on the byte array instead of the reference array.

This should look like:

    for (int i = 0; i < (int) ranks; i++) {
        size_t ref_size = char_extent * chars;
        char *rank_next = recvptr + recvcounts[i];

        for (;recvptr < rank_next; recvptr+=ref_size) {
...
        }
        recvptr = rank_next;

Maybe we should add a regression test for this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the pointer arithmetic over the the MPI_allgatherv() was wrong with chars. Fixed in 79b2219.

@rezib rezib force-pushed the for-upstream/hardlinks-support branch from be4954f to 8ddf45e Compare September 11, 2025 14:38
@rezib
Copy link
Contributor Author

rezib commented Sep 11, 2025

Hello @ajmes, thanks for the review and the report!

The bug was actually covered by the tests but for reason I can't explain I have never seen it fail. That's why real world test is always better that small CI test cases :)

I added some test cases to increase probability of failure. I also fixed another bug discovered while adding these test cases where modification on source hardlink reference could lead to double copy (and error eventually) of hardlinks in destination.

Copy link

@ajmes ajmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the patch update. I’ve noticed some additional corner case issues

while (current != NULL) {
elem_t* next = current->next;
mfu_free(&current->file);
mfu_free(&current->ref);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This segfault here during the tests with v5 cache format. If detail=0, ref pointer is never init (dwalk --input, dcp --input):

(gdb) bt
#0  0x00007fc31b08e675 in __GI___libc_free (mem=0x70616d6d206d6574) at malloc.c:3375
#1  0x00007fc31b88d374 in mfu_free (p=0xab73860) at /home/eaujames/mpifileutils/src/common/mfu_util.c:205
#2  0x00007fc31b85f4d1 in list_delete (flist=0xac5bce0) at /home/eaujames/mpifileutils/src/common/mfu_flist.c:410
#3  0x00007fc31b85f926 in mfu_flist_free (pbflist=0x7ffca0c8c1c0)
    at /home/eaujames/mpifileutils/src/common/mfu_flist.c:568
#4  0x0000000000402b0d in main (argc=3, argv=0x7ffca0c8c7a8) at /home/eaujames/mpifileutils/src/dwalk/dwalk.c:712
#2  0x00007fc31b85f4d1 in list_delete (flist=0xac5bce0) at /home/eaujames/mpifileutils/src/common/mfu_flist.c:410
410             mfu_free(&current->ref);
$13 = {file = 0x0, depth = 2, type = MFU_TYPE_DIR, detail = 0, mode = 8223706451981198701, 
  uid = 8386109825618634597, gid = 7379464839710535529, atime = 8390891584458421024, 
  atime_nsec = 7595413268515939429, mtime = 8247607283625304179, mtime_nsec = 8223706426295346546, 
  ctime = 2334102048954741093, ctime_nsec = 8316291872320417652, size = 2330684568022181236, 
  nlink = 8320808369521453414, 
  ref = 0x70616d6d206d6574 <error: Cannot access memory at address 0x70616d6d206d6574>, 
  next = 0xab720f0, obj_id_lo = 7378413688017811826, obj_id_hi = 2967907589870613857}

test:

test_dwalk.py:36: in run_dwalk                     
    return self.run_cmd(cmd)                       
...
E           AssertionError: Command error:                                                                                 
E            - command: mpirun --bind-to none --oversubscribe -N 8 /usr/local/bin/dwalk --input /tmp/tmpobbg88zs           
E            - exit code: 139                                                                                              
E            - stdout:                                                                                                     
E           [2025-09-11T11:42:18] Reading from input file: /tmp/tmpobbg88zs                                                
E           [2025-09-11T11:42:18] Read 9 files in 0.005 seconds (1762.452 files/sec)                                       
E           [2025-09-11T11:42:18] Items: 9                                                                                 
E           [2025-09-11T11:42:18]   Directories: 2                                                                         
E           [2025-09-11T11:42:18]   Files: 5                                                                               
E           [2025-09-11T11:42:18]   Links: 2                                                                               
E           [2025-09-11T11:42:18]   Hardlinks: 0                                                                           
E                                                                                                                          
E            - stderr:                                                                                                     
E           [eaujamesFR0130:322806:0:322806] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))    
E           ==== backtrace (tid: 322806) ====                                                                              
E            0  /lib64/libucs.so.0(ucs_handle_error+0x2e4) [0x7fc31a7b5a74]                                                
E            1  /lib64/libucs.so.0(+0x1876d) [0x7fc31a7b776d]                                                              
E            2  /lib64/libucs.so.0(+0x1893d) [0x7fc31a7b793d]                                                              
E            3  /lib64/libc.so.6(+0x1a490) [0x7fc31b026490]                                                                
E            4  /lib64/libc.so.6(__libc_free+0x25) [0x7fc31b08e675]                                                        
E            5  /usr/local/lib64/libmfu.so.4.0.0(mfu_free+0x31) [0x7fc31b88d374]                                           
E            6  /usr/local/lib64/libmfu.so.4.0.0(+0xb4d1) [0x7fc31b85f4d1]                                                 
E            7  /usr/local/lib64/libmfu.so.4.0.0(mfu_flist_free+0x23) [0x7fc31b85f926]                                     
E            8  /usr/local/bin/dwalk(main+0xb98) [0x402b0d]                                                                
E            9  /lib64/libc.so.6(+0x3488) [0x7fc31b00f488]                                                                 
E           10  /lib64/libc.so.6(__libc_start_main+0x8b) [0x7fc31b00f54b]                                                  
E           11  /usr/local/bin/dwalk(_start+0x25) [0x401385]                                                               
E           =================================                                                                              
E           --------------------------------------------------------------------------                                     
E           prterun noticed that process rank 0 with PID 322806 on node eaujamesFR0130 exited on                           
E           signal 11 (Segmentation fault).                                                                                
E           --------------------------------------------------------------------------                                     
                                                                                                                           
lib.py:408: AssertionError                                                                                                 

With detail=0, ref is not set to NULL in list_elem_unpack(), so there is garbage in current->ref and mfu_free() try to free a random address.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can definitely add:

diff --git a/src/common/mfu_flist_io.c b/src/common/mfu_flist_io.c
index 37039bf..13ebefb 100644
--- a/src/common/mfu_flist_io.c
+++ b/src/common/mfu_flist_io.c
@@ -356,7 +356,8 @@ static size_t list_elem_unpack(const void* buf, int detail, uint64_t chars, elem
         mfu_unpack_io_uint64(&ptr, &elem->size);
         mfu_unpack_io_uint64(&ptr, &elem->nlink);
         mfu_unpack_sized_str(&ptr, &elem->ref, chars);
-    }
+    } else
+        elem->ref = NULL;
 
     size_t bytes = (size_t)(ptr - start);
     return bytes;

But I don't see a code path leading to list_elem_unpack() being called with detail=0… It is called in read_cache_v5()list_insert_ptr(detail=1)list_elem_unpack(detail=1). I can't find any other code path to this function.

How did you manage to produce this segfault with the tests? I'm not able to reproduce it 😟

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't think this came from read_cache_v5().

    def test_walk_input_lite(self):                                            
        with tempfile.NamedTemporaryFile(mode="w+") as fh:                     
            cache = Path(fh.name)                                              
            self.run_dwalk(output=cache, lite=True)                            
            proc = self.run_dwalk(input=cache)                                 
            # lite cache do not contain hardlinks, then dwalk misses it.       
            self.assertInProcStdout(                                           

This test uses "--lite" option (without stat), so detail is set to 0 in the cache file:

int mfu_flist_walk_paths(uint64_t num_paths, const char** paths,            
                          mfu_walk_opts_t* walk_opts, mfu_flist bflist,     
                          mfu_file_t* mfu_file)                             
{                                                                           
...
    flist->detail = 0;                             
    if (walk_opts->use_stat) {                     

Then we use write_cache_readdir_variable to write the cache file and read_cache_variable() to read it. read_cache_variable does not set ref:
read_cache_variable ->list_insert_decode() -> list_elem_decode().

So, I am more in favor of:

@@ -407,7 +407,8 @@ static void list_delete(flist_t* flist)          
     while (current != NULL) {                                       
         elem_t* next = current->next;                               
         mfu_free(&current->file);                                   
-        mfu_free(&current->ref);                                    
+       if (current->detail)                                         
+               mfu_free(&current->ref);                             
         mfu_free(&current);                                         
         current = next;                                             
     }                                                               

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed:

MFU_TYPE_FILE = 2, /* regular file */
MFU_TYPE_DIR = 3, /* directory */
MFU_TYPE_LINK = 4, /* symlink */
MFU_TYPE_HARDLINK = 5, /* hardlink */
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting a new type that does not reflect posix type can have side effect:
e.g:

$ touch test_tree0/file{0..9}
$ ln test_tree0/file0 test_tree0/file0_ln1
$ ln test_tree0/file0 test_tree0/file0_ln2
$ mpirun -np 8 --hostfile hosts.mpi dsync -S test_tree0/ test_tree1/
...
[2025-09-11T18:47:28]   Directories: 0
[2025-09-11T18:47:28]   Files: 10
[2025-09-11T18:47:28]   Links: 0
[2025-09-11T18:47:28]   Hardlinks: 2

# unlink a secondary ref (MFU_TYPE_HARDLINK)
# rm test_tree0/file0_ln2
$ mpirun -np 8 --hostfile hosts.mpi dcmp test_tree0/ test_tree1/
Number of items that exist in both directories: 12 (Src: 12 Dest: 12)
Number of items that exist only in one directory: N/A (Src: 0 Dest: 1)
Number of items that exist in both directories and have the same type: 12 (Src: 12 Dest: 12)
Number of items that exist in both directories and have different types: 0 (Src: 0 Dest: 0)
Number of items that exist in both directories and have the same content: 12 (Src: 12 Dest: 12)
Number of items that exist in both directories and have different contents: 0 (Src: 0 Dest: 0)

# unlink the primary ref (MFU_TYPE_FILE)
$ ln test_tree0/file0 test_tree0/file0_ln2
$ rm test_tree0/file0
$ mpirun -np 8 --hostfile hosts.mpi dcmp  --out EXIST=COMMON@TYPE=DIFFER:out --text test_tree0/ test_tree1/
Number of items that exist in both directories and have different types: 1 (Src: 1 Dest: 1), dumped to "out"
$ cat out
-rw-r--r-- eaujames eaujames   0.000   B Sep 11 2025 18:45 /tmp/test_tree0/file0_ln1
-rw-r--r-- eaujames eaujames   0.000   B Sep 11 2025 18:45 /tmp/test_tree1/file0_ln1
$ stat --printf "%n %i %F\n" test_tree*/*_ln*
test_tree0/file0_ln1 195619 regular empty file
test_tree0/file0_ln2 195619 regular empty file
test_tree1/file0_ln1 195758 regular empty file
test_tree1/file0_ln2 195758 regular empty file

Here we mark abitrary an entry as primary (MFU_TYPE_FILE) and the others as secondary (MFU_TYPE_HARDLINK). But that does not make sense for a Posix FS (all the entries point to the same inode), the type should be unchanged.

Can we save this information somewhere else? For example, we can use elem_t::detail to store detail flags.
e.g:

enum mfu_detail_e {
    MFU_DETAIL_NULL     = 0x0, /* detail not set */
    MFU_DETAIL_SET  = 0x1, /* element contains stat data */
    MFU_DETAIL_LINK = 0x2, /* element contains a valid link ref*/
};

This would enable to support hardlink for other types of inode (e.g: fifo).

Copy link
Contributor Author

@rezib rezib Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting a new type that does not reflect posix type can have side effect: e.g:

$ touch test_tree0/file{0..9}
$ ln test_tree0/file0 test_tree0/file0_ln1
$ ln test_tree0/file0 test_tree0/file0_ln2
$ mpirun -np 8 --hostfile hosts.mpi dsync -S test_tree0/ test_tree1/
...
[2025-09-11T18:47:28]   Directories: 0
[2025-09-11T18:47:28]   Files: 10
[2025-09-11T18:47:28]   Links: 0
[2025-09-11T18:47:28]   Hardlinks: 2

# unlink a secondary ref (MFU_TYPE_HARDLINK)
# rm test_tree0/file0_ln2
$ mpirun -np 8 --hostfile hosts.mpi dcmp test_tree0/ test_tree1/
Number of items that exist in both directories: 12 (Src: 12 Dest: 12)
Number of items that exist only in one directory: N/A (Src: 0 Dest: 1)
Number of items that exist in both directories and have the same type: 12 (Src: 12 Dest: 12)
Number of items that exist in both directories and have different types: 0 (Src: 0 Dest: 0)
Number of items that exist in both directories and have the same content: 12 (Src: 12 Dest: 12)
Number of items that exist in both directories and have different contents: 0 (Src: 0 Dest: 0)

This output looks good to me so far, doesn't it?

# unlink the primary ref (MFU_TYPE_FILE)
$ ln test_tree0/file0 test_tree0/file0_ln2
$ rm test_tree0/file0
$ mpirun -np 8 --hostfile hosts.mpi dcmp  --out EXIST=COMMON@TYPE=DIFFER:out --text test_tree0/ test_tree1/
Number of items that exist in both directories and have different types: 1 (Src: 1 Dest: 1), dumped to "out"
$ cat out
-rw-r--r-- eaujames eaujames   0.000   B Sep 11 2025 18:45 /tmp/test_tree0/file0_ln1
-rw-r--r-- eaujames eaujames   0.000   B Sep 11 2025 18:45 /tmp/test_tree1/file0_ln1
$ stat --printf "%n %i %F\n" test_tree*/*_ln*
test_tree0/file0_ln1 195619 regular empty file
test_tree0/file0_ln2 195619 regular empty file
test_tree1/file0_ln1 195758 regular empty file
test_tree1/file0_ln2 195758 regular empty file

Yes, with the current logic:

  • test_tree0/file0_ln1 is considered a MFU_TYPE_FILE in source while test_tree1/file0_ln1 is considered MFU_TYPE_HARDLINK in destination → type difference
  • test_tree0/file0_ln2 has a reference to test_tree0/file0_ln1 in source while test_tree1/file0_ln2 has a reference to test_tree1/file0 → content difference

Sure, this leads to false positive differences in this case but hey, how could we best handle this case with a distributed algorithm?

Here we mark abitrary an entry as primary (MFU_TYPE_FILE) and the others as secondary (MFU_TYPE_HARDLINK). But that does not make sense for a Posix FS (all the entries point to the same inode), the type should be unchanged.

Can we save this information somewhere else? For example, we can use elem_t::detail to store detail flags. e.g:

enum mfu_detail_e {
    MFU_DETAIL_NULL     = 0x0, /* detail not set */
    MFU_DETAIL_SET  = 0x1, /* element contains stat data */
    MFU_DETAIL_LINK = 0x2, /* element contains a valid link ref*/
};

This would enable to support hardlink for other types of inode (e.g: fifo).

You're right in theory, except mpifileutils doesn't support fifo yet 😅 What's opinion of mpifileutils maintainers (@ofaaland, @carbonneau1) on this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# unlink the primary ref (MFU_TYPE_FILE)
$ ln test_tree0/file0 test_tree0/file0_ln2
$ rm test_tree0/file0
$ mpirun -np 8 --hostfile hosts.mpi dcmp  --out EXIST=COMMON@TYPE=DIFFER:out --text test_tree0/ test_tree1/
Number of items that exist in both directories and have different types: 1 (Src: 1 Dest: 1), dumped to "out"
$ cat out
-rw-r--r-- eaujames eaujames   0.000   B Sep 11 2025 18:45 /tmp/test_tree0/file0_ln1
-rw-r--r-- eaujames eaujames   0.000   B Sep 11 2025 18:45 /tmp/test_tree1/file0_ln1
$ stat --printf "%n %i %F\n" test_tree*/*_ln*
test_tree0/file0_ln1 195619 regular empty file
test_tree0/file0_ln2 195619 regular empty file
test_tree1/file0_ln1 195758 regular empty file
test_tree1/file0_ln2 195758 regular empty file

Yes, with the current logic:

* `test_tree0/file0_ln1` is considered a `MFU_TYPE_FILE` in source while `test_tree1/file0_ln1` is considered `MFU_TYPE_HARDLINK` in destination → type difference

* `test_tree0/file0_ln2` has a reference to `test_tree0/file0_ln1` in source while `test_tree1/file0_ln2` has a reference to `test_tree1/file0` → content difference

Sure, this leads to false positive differences in this case but hey, how could we best handle this case with a distributed algorithm?

FYI, I added in a test in ca9896f to document the behavior in this case.

During tree walk with details, regular files with more than one nlink
are temporarily placed in a hardlinks flist. This flist is then globally
ordered by names and ranked to select one reference path per inode, and
flag all other paths to this inodes as hardlinks. The sorted hardlinks
flist is finally merged in global flist with all other items.

The paths name ordering is performed to ensure reproducibility between
two similar trees, thus minimizing the differences for dcmp and dsync
eventually.

This commit introduces a new structure inodes_hardlink_map_t used to
temporarily associate paths to inodes in reference/hardlinks solving
logic.

The type elem_t receives 2 new members: nlink, the number of links on an
inode, and ref, the reference path to this inode. The ref is NULL except
on hardlinks.

This commit also introduces a new filetype MFU_TYPE_HARDLINK, which is
used to distinguish hardlinks to inodes from reference paths which have
MFU_TYPE_FILE type.

The packed flist element now contains the filetype, even when details
are enabled, as there is now way to determine if an element is a regular
file or a hardlink based on stat result.

New functions mfu_[un]pack_sized_str() are introduced to manage packing
and unpackaging of optional strings with maximum length.

Signed-off-by: Rémi Palancher <[email protected]>
Add support for hardlinks in dcp.

This function renames existing functions mfu_create_hardlink[s]() to
mfu_create_hardlink[s]_dest() to reflect their purpose related to
--link-dest option.

Two new functions mfu_create_hardlink[s]() are introduced to create all
hardlinks in destination directory with the appropriate link
destination. The summary at the end of copy is modified to mention
hardlinks operations.

Signed-off-by: Rémi Palancher <[email protected]>
Add support of hardlinks in dcmp. The reference paths of hardlinks in
source and destination are compared. If not equal, strmap is updated to
flag them as different.

The branchs in items comparison logic is now based on filetype recorded
in flist rather than the file mode as there is no way to distinguish
reference paths and hardlinks with just the mode, both are regular
files.

Signed-off-by: Rémi Palancher <[email protected]>
Add support of hardlinks in dsync. The reference paths of hardlinks in
source and destination are compared. If not equal, strmap is updated to
flag them as different.

The branchs in items comparison logic is now based on filetype recorded
in flist rather than the file mode as there is no way to distinguish
reference paths and hardlinks with just the mode, both are regular
files.

Additional logic is added with dsync_remove_hardlinks_with_removed_ref()
function to detect hardlinks whose references paths are marked for
deletion in destination. In this case, all the hardlinks pointing to
this reference are also marked for being replaced to avoid residual
links pointing to wrong inodes.

Signed-off-by: Rémi Palancher <[email protected]>
Add support for hardlinks in dtar, in all supported create and extract
algorithms.

New structure entry_list_t is introduced, it is used in some extract
algorithms to fill a temporary a list of hardlinks entries to create in
a second pass, after all other files are created.

Signed-off-by: Rémi Palancher <[email protected]>
Introduce cache format v5 which supports hardlinks encoding with nlink
and reference paths.

New read_cache_v5() is basically similar to read_cache_v4() except the
calls to list_elem_pack_size[_le4]() and list_insert_ptr[_le4]().

Signed-off-by: Rémi Palancher <[email protected]>
When dcp reads input list from cache, place files with more than one
links in a temporary list and resolve hardlinks, similary to the logic
implemented in walk with details.

Signed-off-by: Rémi Palancher <[email protected]>
This command adds many functional tests of dcmp, dcp, dsync, dtar and
dwalk, executed and automatically validated with Python standard
unittest library. This is designed to be easy to execute and integrate
in continuous integration systems.

Set two environment variables to define respectively the path to
mpifileutils binaries and arguments provided to mpirun, eg:

  $ export MFU_BIN=~/dev/bin
  $ export MFU_MPIRUN_ARGS="--bind-to none --oversubscribe -N 4"

And run all the tests:

  $ python3 -m unittest discover -v test

Or:

  $ pytest  # require pytest

The suite has utilities to check similarity between two trees, with the
possibility to specific paths and attributes (eg. mtime). It is also
possible to assert specific command outputs.

Most tests are run against a specific testing file tree to cover many
cases. Other tests are run with a file tree generated by dfilemaker.

Signed-off-by: Rémi Palancher <[email protected]>
Add continuous integration workflow to build and install lwgrp,
libcircle, dtcmp and mpifileutils and execute Python test suite in
github actions for all pull requests and merges in main branch.

Signed-off-by: Rémi Palancher <[email protected]>
Fix gathered removed hardlinks references buffer read logic and pointer
arithmetic with chars.

Signed-off-by: Rémi Palancher <[email protected]>
Check content of hardlinks do not already differ before adding them in
source copy list, in order to avoid multiple copies of this hardlink
eventually.

Signed-off-by: Rémi Palancher <[email protected]>
Add more dsync test cases to cover more corner cases when reference
files are changed in source tree.

Signed-off-by: Rémi Palancher <[email protected]>
Set char* ref to NULL when elem_t are decoded without detail.

Signed-off-by: Rémi Palancher <[email protected]>
The elem_t->ref attribute is assigned only when the element is
initialized with detail. Skip mfu_free() when the structure is
initializd without detail.

Signed-off-by: Rémi Palancher <[email protected]>
@rezib rezib force-pushed the for-upstream/hardlinks-support branch from 8ddf45e to a1bd23d Compare September 12, 2025 12:46
Copy link

@ajmes ajmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We hit another SEGFAULT in dsync_remove_hardlinks_with_removed_refs().

The crash occured at:

1715                 dsync_strmap_item_update(src_map, key, DCMPF_CONTENT, DCMPS_DIFFER);
static void dsync_strmap_item_update(
    strmap* map,
    const char *key,
    dsync_field field,
    dsync_state state)
{
    /* Should be long enough for 64 bit number and DCMPF_MAX */
    char new_val[21 + DCMPF_MAX];
    /* lookup item from map */
    const char* val = strmap_get(map, key);
    /* copy existing index over */
    assert(field < DCMPF_MAX);
assert(strlen(val) + 1 <= sizeof(new_val));

strmap_get() returned a NULL pointer. Then strlen(NULL) crashed dsync.

I have applied the following hot-fix for our migrations:

diff --git a/src/dsync/dsync.c b/src/dsync/dsync.c
index a84aa92..7f33ae2 100644
--- a/src/dsync/dsync.c
+++ b/src/dsync/dsync.c
@@ -418,6 +418,8 @@ static void dsync_strmap_item_update(
 
     /* lookup item from map */
     const char* val = strmap_get(map, key);
+    if (!val)
+           return;
 
     /* copy existing index over */
     assert(field < DCMPF_MAX);

This occcured on a "trash" entry: we use a robinhood policy to automaticaly put file in a trash folder. This policy use "lfs fid2path" to find all the hardlinks in Lustre and then it executes "mv" for each entry. I think the crash can be caused by racy entry renames.

dsync_strmap_item_update(dst_map, key, DCMPF_CONTENT, DCMPS_DIFFER);

if (!options.dry_run) {
mfu_flist_file_copy(src_list, src_index, src_cp_list);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A segmentation fault happens in here.

Add tests for dsync corner cases when hardlink reference is different in
source and new hardlink is present in destination. Test both with and
without --delete option.

Signed-off-by: Rémi Palancher <[email protected]>
When the content of an hardlink reference is different in destination,
dsync iterates over the other hardlinks to mark them with different
content as well. In this loop, we must skip the hardlink which do not
have corresponding entry in source tree. They are handled in
dsync_only_dst() when --delete is set eventually.

Signed-off-by: Rémi Palancher <[email protected]>
@rezib
Copy link
Contributor Author

rezib commented Sep 16, 2025

We hit another SEGFAULT in dsync_remove_hardlinks_with_removed_refs().

The crash occured at:

1715                 dsync_strmap_item_update(src_map, key, DCMPF_CONTENT, DCMPS_DIFFER);

[…]

Thanks @ajmes for reporting it!

Indeed, this segfault is triggered when a reference is different between source and destination and a new hardlink is present on this reference in destination. I was able to reproduce it with tests introduced in 90ff407.

This is fixed in d9d0cde.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dsync and probably dcp doesn't handle hardlinks

3 participants