Skip to content

Commit 1453b6e

Browse files
authored
Don't create checkpoint in multiprocessing case if the job failed (#692)
* don't create checkpoint in multiprocessing case if the job failed * update changelog
1 parent 29dda58 commit 1453b6e

File tree

3 files changed

+12
-5
lines changed

3 files changed

+12
-5
lines changed

cluster_tools/Changelog.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ For upgrade instructions, please check the respective *Breaking Changes* section
1616
### Changed
1717

1818
### Fixed
19-
19+
- Fixed that the ProcessPoolExecutor by the cluster tools would also create a checkpoint if the job failed. This was a regression introduced by #686. [#692](https://github.com/scalableminds/webknossos-libs/pull/692)
2020

2121
## [0.9.18](https://github.com/scalableminds/webknossos-libs/releases/tag/v0.9.18) - 2022-04-06
2222
[Commits](https://github.com/scalableminds/webknossos-libs/compare/v0.9.17...v0.9.18)

cluster_tools/cluster_tools/__init__.py

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -168,10 +168,16 @@ def _execute_and_persist_function(output_pickle_path, *args, **kwargs):
168168
result = False, exc
169169
logging.warning(f"Job computation failed with:\n{exc.__repr__()}")
170170

171-
with open(output_pickle_path, "wb") as file:
172-
pickling.dump(result, file)
173-
174171
if result[0]:
172+
# Only pickle the result in the success case, since the output
173+
# is used as a checkpoint.
174+
# Note that this behavior differs a bit from the cluster executor
175+
# which will always serialize the output (even exceptions) to
176+
# disk. However, the output will have a .preliminary prefix at first
177+
# which is only removed in the success case so that a checkpoint at
178+
# the desired target only exists if the job was successful.
179+
with open(output_pickle_path, "wb") as file:
180+
pickling.dump(result, file)
175181
return result[1]
176182
else:
177183
raise result[1]

cluster_tools/cluster_tools/schedulers/cluster_executor.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -224,13 +224,14 @@ def _completion(self, jobid, failed_early):
224224

225225
if success:
226226
# Remove the .preliminary postfix since the job was finished
227-
# successfully. # Therefore, the result can be used as a checkpoint
227+
# successfully. Therefore, the result can be used as a checkpoint
228228
# by users of the clustertools.
229229
os.rename(preliminary_outfile_name, outfile_name)
230230
logging.debug("Pickle file renamed to {}.".format(outfile_name))
231231

232232
fut.set_result(result)
233233
else:
234+
# Don't remove the .preliminary postfix since the job failed.
234235
fut.set_exception(RemoteException(result, jobid))
235236

236237
# Clean up communication files.

0 commit comments

Comments
 (0)