Skip to content

Conversation

@LanderOtto
Copy link
Collaborator

@LanderOtto LanderOtto commented Mar 21, 2025

This commit fixes a race condition in data transfers and fixes the duplication of the same output multiple times.
This error occurs when it is necessary to get a unique output path. If this method raises the FilExistsError, it does not wait for the data location of the source as happens in the transfer_data method. Thus, the data could not exist when the checksum method is performed.

A similar issue was fixed in the #328. It reappeared after the introduction of the StreamFlowPath class. Before starting a data transfer, it is necessary to wait for the source data location that becomes available. When the StreamFlowPath class was introduced, this wait was moved inside the resolve method. However, only the RemoteStreamFlowPath performed the wait.
Now, the resolve method of the LocalStreamFlowPath also performs the wait before resolving the path, and the same optimization of the resolve method of the RemoteStreamFlowPath is done. If the path is already present in the DataManager and is PRIMARY, it is returned directly.

Moreover, this commit introduces a fix for future similar errors. Many StreamFlowPath methods, including the checksum, check if the file is available in the DataLocation. An attribute that stores this information was added to reduce the need to query the DataManager each time.

This race condition issue has another problem. We said that before computing the operation on the path, we must wait for the DataLocation to become available. However, having as outputs the same file twice, it is possible to have the following case: the two tasks perform the get_unique_output_path, the first registers the path in the workflow dictionary, and the second will raise the FileExistsError. After that, the second should wait for the first to complete, but maybe the first did not create the DataLocation yet. Before this commit, the second continued the execution, raising errors.

@codecov
Copy link

codecov bot commented Mar 21, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.22%. Comparing base (b228997) to head (8b09893).

✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #703      +/-   ##
==========================================
+ Coverage   72.15%   72.22%   +0.06%     
==========================================
  Files          89       89              
  Lines       11778    11812      +34     
  Branches     2074     2079       +5     
==========================================
+ Hits         8499     8531      +32     
- Misses       2797     2798       +1     
- Partials      482      483       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@LanderOtto LanderOtto force-pushed the fix/wait-resolve branch 7 times, most recently from 1cf7405 to 026b741 Compare March 21, 2025 18:06
The same problem was resolved in #328. This issue reappeared after the introduction of the `StreamFlowPath` class. When a file transfer is needed, before it is necessary to wait for the source data location to become available. With the introduction of the `StreamFlowPath` class, this wait was moved inside the `resolve` method. However, only the `RemoteStreamFlowPath` was performing the wait. Now, the `resolve` method of the `LocalStreamFlowPath` also performs the wait before resolving the path.
@LanderOtto
Copy link
Collaborator Author

LanderOtto commented Mar 22, 2025

Another solution can be to move the wait inside the checksum method of the StreamFlowPath. In this case, the wait should be performed in each method of the StreamFlowPath to avoid the same problem in future.

However, a deadlock can occur. In the current implementation of the DefaultDataManager, the DataLocation is created and registered, and after that, it is checked if the file is a symbolic link. If we add the wait in each method of the StreamFlowPath, it creates a cycle when the DataLocation is not available until the symbol link check is not performed.

edit. resolved and implemented

@LanderOtto LanderOtto marked this pull request as draft March 22, 2025 23:07
@LanderOtto LanderOtto changed the title Fixed race condition when a local copied is performed Fixed race condition when a copied is performed Mar 23, 2025
@LanderOtto LanderOtto marked this pull request as ready for review March 23, 2025 04:34
@LanderOtto LanderOtto changed the title Fixed race condition when a copied is performed Fixed race condition when a copies is performed Apr 4, 2025
@LanderOtto LanderOtto changed the title Fixed race condition when a copies is performed Fixed race condition when a copy is performed Apr 4, 2025
@LanderOtto LanderOtto marked this pull request as draft July 4, 2025 14:50
@GlassOfWhiskey GlassOfWhiskey force-pushed the master branch 7 times, most recently from b21f362 to e7820f1 Compare January 12, 2026 08:25
@GlassOfWhiskey GlassOfWhiskey force-pushed the master branch 3 times, most recently from 256e723 to b30eaba Compare January 23, 2026 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants