Skip to content

Conversation

luxiaoyong
Copy link
Contributor

I have a poblem with the tree execution mode, when i use rcopy params and copy a big file(more than 12M) from two romote host to local, problem like below:

command: clush -o -q -w host1,host2 -b -S --rcopy /home/collect.tar.gz --dest /home/tmp/

output:
Exception in thread Task-2:
Traceback (most recent call last):
File "env/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "env/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "env/lib/python3.8/site-packages/ClusterShell/Task.py", line 390, in _thread_start
self.excepthook(*sys.exc_info())
File "env/lib/python3.8/site-packages/ClusterShell/CLI/Clush.py", line 822, in clush_excepthook
raise exp
File "env/lib/python3.8/site-packages/ClusterShell/Task.py", line 388, in _thread_start
self._resume()
File "env/lib/python3.8/site-packages/ClusterShell/Task.py", line 790, in _resume
self._run(self.timeout)
File "env/lib/python3.8/site-packages/ClusterShell/Task.py", line 403, in _run
self._engine.run(timeout)
File "env/lib/python3.8/site-packages/ClusterShell/Engine/Engine.py", line 723, in run
self.runloop(timeout)
File "env/lib/python3.8/site-packages/ClusterShell/Engine/EPoll.py", line 157, in runloop
client._handle_read(sname)
File "env/lib/python3.8/site-packages/ClusterShell/Worker/Exec.py", line 192, in _handle_read
node_msgline(key, msg, sname) # handle full msg line
File "env/lib/python3.8/site-packages/ClusterShell/Worker/Exec.py", line 166, in _on_nodeset_msgline
self.worker._on_node_msgline(nodes, msg, sname)
File "env/lib/python3.8/site-packages/ClusterShell/Worker/Worker.py", line 277, in _on_node_msgline
self.eh.ev_read(self, node, sname, msg)
File "env/lib/python3.8/site-packages/ClusterShell/Communication.py", line 258, in ev_read
self.recv(msg)
File "env/lib/python3.8/site-packages/ClusterShell/Propagation.py", line 270, in recv
self.recv_ctl(msg)
File "env/lib/python3.8/site-packages/ClusterShell/Propagation.py", line 376, in recv_ctl
metaworker._on_remote_node_close(node, rc, self.gateway)
File "env/lib/python3.8/site-packages/ClusterShell/Worker/Tree.py", line 459, in _on_remote_node_close
bnode, len(tmptar.getmembers()),
File "env/lib/python3.8/tarfile.py", line 1791, in getmembers
self._load() # all members, we first have to
File "env/lib/python3.8/tarfile.py", line 2379, in _load
tarinfo = self.next()
File "env/lib/python3.8/tarfile.py", line 2312, in next
raise ReadError("unexpected end of data")
tarfile.ReadError: unexpected end of data

When files are copied from multiple remote nodes to a local node and the size of the copied files is large (for example, 12 M), the transmission uses the fragment mode, and the transmission of multiple nodes will not be ended at the same time. When one of the nodes finishes, it will receive the RET message, and the _on_remote_node_close function will be triggered. In this case, the nodes that have not finished the transmission will also extract files, leading to the error.
I fixed this bug by modifying the this code, It is not clear whether these modifications will cause other abnormalities. I hope you can help review the code. Thank you very much.

@thiell thiell self-requested a review August 3, 2025 07:09
@thiell thiell self-assigned this Aug 3, 2025
@thiell thiell added the Lib/Tree label Aug 3, 2025
@thiell thiell added this to the 1.9.4 milestone Aug 3, 2025
luxiaoyong and others added 2 commits August 3, 2025 00:13
Fix defect in Tree._on_remote_node_close() for rcopy.
Now when a remote node closes, just finish extracting locally
without impacting other nodes.

Closes cea-hpc#545.
@thiell thiell removed their request for review August 3, 2025 07:24
@thiell thiell merged commit 815a8c2 into cea-hpc:master Aug 3, 2025
7 checks passed
@thiell
Copy link
Collaborator

thiell commented Aug 3, 2025

@luxiaoyong I just simplified the code a bit further but kept your logic. Sorry for the delay with this and thank you!

thiell added a commit to thiell/clustershell that referenced this pull request Aug 4, 2025
Add test for rcopy with file as the source and with multiple targets.

Test the change from cea-hpc#545.
thiell added a commit to thiell/clustershell that referenced this pull request Aug 4, 2025
Add test for rcopy with file as the source and with multiple targets.

Test the change from cea-hpc#545.
thiell added a commit to thiell/clustershell that referenced this pull request Aug 4, 2025
Add test for rcopy with file as the source and with multiple targets.

Test the change from cea-hpc#545.
@thiell thiell mentioned this pull request Aug 4, 2025
thiell added a commit to thiell/clustershell that referenced this pull request Aug 4, 2025
Add test for rcopy with file as the source and with multiple targets.

Test the change from cea-hpc#545.
github-merge-queue bot pushed a commit that referenced this pull request Aug 4, 2025
* tests: use mock hostname command remotely

As `hostname` is used by rcopy in tree mode via tar, mock it remotely
so we can simulate multiple target nodes in tests.

* tests: improve rcopy tree tests

Add test for rcopy with file as the source and with multiple targets.

Test the change from #545.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants