You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* cancel all running slurm and pbs jobs in case the executor is killed
* avoid using the logging module during shutdown to avoid additional errors
* try to close multiprocessing logging handler on SystemExit
* revert logger closing
* do not react to sigterm and do not call sys.exit to allow clean shutdown by calling process
* update changelog
* adapt test
* use logging during shutdown since it shouldn't cause additional errors
* apply PR feedback
* format
* fix exception during shutdown for non-array jobs
* add args and kwargs to ignored-argument-names for pylint
* Merge branch 'master' of github.com:scalableminds/webknossos-libs into cancel-cluster-jobs
* signal jobs with SIGINT instead of SIGTERM to allow to cancel recursively scheduled jobs
* Only send SIGINT to running jobs as scancel stalls otherwise. Use scancel without a signal parameter to cancel pending jobs.
* Cancel pending jobs even if canceling running jobs did not yield exit code 0
* Do not interfere with existing SIGINT handlers. Call it after signal handling in case one exists.
* Avoid dead lock in executor shutdown
* First cancel the pending slurm jobs, then the running ones to avoid race conditions
* fix handle_kill call in tests
* Merge branch 'master' of github.com:scalableminds/webknossos-libs into cancel-cluster-jobs
* improve troubleshooting instructions in dockered slurm README
* Add test for slurm job cancellation and prepare slurm version update
* fix typing
* use new slurm docker image with updated slurm version
* fix linting
* correctly restore SLURM_MAX_RUNNING_SIZE env variable in tests
Copy file name to clipboardExpand all lines: cluster_tools/Changelog.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,6 +14,7 @@ For upgrade instructions, please check the respective *Breaking Changes* section
14
14
### Added
15
15
16
16
### Changed
17
+
- When using the slurm or pbs distribution strategy, scheduled jobs are automatically canceled when aborting a run, i.e. if the SIGINT signal is received. [#838](https://github.com/scalableminds/webknossos-libs/pull/838)
Copy file name to clipboardExpand all lines: cluster_tools/dockered-slurm/README.md
+4Lines changed: 4 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,6 +32,8 @@ Run `docker-compose` to instantiate the cluster:
32
32
$ docker-compose up -d
33
33
```
34
34
35
+
> Note: If you encounter permission errors (`Failed to check keyfile "/etc/munge/munge.key": Permission denied`), follow the steps from the "Deleting the Cluster" section and run the previous command again.
36
+
35
37
## Register the Cluster with SlurmDBD
36
38
37
39
To register the cluster to the slurmdbd daemon, run the `register_cluster.sh`
@@ -48,6 +50,8 @@ $ ./register_cluster.sh
48
50
> You can check the status of the cluster by viewing the logs: `docker-compose
49
51
> logs -f`
50
52
53
+
> Note: If you encounter an error that the daemon is not running (`Error response from daemon: Container <...> is not running`), the start of the containers was not successful. Check the logs using `docker-compose logs -f` and revisit the last step.
54
+
51
55
## Accessing the Cluster
52
56
53
57
Use `docker exec` to run a bash shell on the controller container:
0 commit comments