-
Notifications
You must be signed in to change notification settings - Fork 135
feat ( queue ) : add queue validator #857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| } | ||
| } | ||
|
|
||
| // Validate total children quotas don't exceed parent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually not an invalid state to have over-provisioning of quotas in child queues.
The scheduler enforces all the quota restrictions on both levels, so sometimes admins like to over-provision child queue under a parent queue, with the assumption that not all child-queues are fully utilized all the time. If they ARE requesting more resources than the parent queue allows, the scheduler simply resolves it within that queue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining.
In that case please correct me if im wrong, Child CPU/GPU/Memory quotas > parent quota -> This can be a WARNING instead of an ERROR.
cc @enoodle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @itsomri - it is a valid state to have a queue limiting the quota of all of its sub queues. I would also not warn that it "may cause pod scheduling failures" because it might be the intention:
For example I may have a queue for many researchers to use interactive jobs and I might want to limit their total GPU request to 5, but each one can also request up to 5. The pods will not fail to schedule but will just have to wait for their turn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this whole validation should be behind a feature flag that will be opt-in
| } | ||
| return nil, nil | ||
| } | ||
| // TODO: Remove after QueueValidator is fully integrated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will be missing after this PR for it to be fully integrated?
Description
Summary of Working Validation:
Related Issues
Fixes #81
Checklist
Breaking Changes
Additional Notes
cc @enoodle