Skip to content

Conversation

@AlecThomson
Copy link

Closes #640

Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me! I do wonder if we should expose cancel_command as a keyword argument on the SLURMCluster and pass that down to SLURMJob, in case users want to override it with something else.

@AlecThomson
Copy link
Author

I found that the HTCondor class already had something similar. I've added this to the base Job class in core. The downside is that this adds some extra boilerplate.

I've reworked the HTCondorJob and SLURMJob to make use of the new functionality

@AlecThomson AlecThomson changed the title Graceful Slurm job cancellation Generalise submission and cancellation arguments May 25, 2024
use_stdin = dask.config.get("jobqueue.%s.use-stdin" % self.config_name)
self.use_stdin = use_stdin

if self.submit_command_extra is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is all this code both in the Core class and the class inheriting it?

@guillaumeeb
Copy link
Member

Also, the test failures in slurm (and maybe others) looks related to this change.

@jacobtomlinson
Copy link
Member

CI is now mostly happy on main so I've just merged in so we can see up to date CI failures.

@AlecThomson
Copy link
Author

Sorry for the long lead time on this, everyone. I got myself tripped up between the methods on the base and inheriting classes - as noted by @guillaumeeb's comment. I believe I've got this all sorted now. Hopefully the CI will catch any lingering issues.

@jacobtomlinson
Copy link
Member

Thanks @AlecThomson! Looks like there are some linting issues (make sure you run pre-commit install) and some slurm issues.

@AlecThomson
Copy link
Author

Hmm - looks like some kind of timing error on the test. I don't quite understand why it's failing... 🤔

>                   assert time() < start + QUEUE_WAIT
E                   assert 1723021595.3378592 < (1723021535.2718754 + 60)
E                    +  where 1723021595.3378592 = time()

https://github.com/dask/dask-jobqueue/actions/runs/10278490598/job/28450471395#step:7:425

@jacobtomlinson
Copy link
Member

jacobtomlinson commented Aug 7, 2024

It is calling cluster.scale(n) and then waiting for the cluster to scale. The time assertion is just a timeout, so it's not scaling to the correct number in the time allowed.

Note: We don't use client.wait_for_workers(n) because that checks for "at least n workers" so doesn't wait when scaling down (xref dask/distributed#6374).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

More graceful job cancellation

3 participants