-
Notifications
You must be signed in to change notification settings - Fork 117
Open
Open
Feature
Copy link
Milestone
Description
ReFrame Version: 4.8.4
Python version: 3.12.3
Scheduler: slurm
What I want to achieve: I want to test all nodes of a node list. I have >20 tests per node and 500 nodes
Problem:
- Some of the nodes I want to test are being allocated already by some users.
- If I launch my tests with
--distribute=idle, the nodes already allocated will not be tested - If I launch my tests with
--distribute=availand I am not lucky,ReFramemight queue a lot of tests on the nodes allocated by other users. At some point, even ifsystem.partitions.max_jobsis high enough, I will reach the limit of job that I am allowed to submit (MaxSubmitinsacctmgr show association). I end up with a lot of jobs queued becauseReFrametried to launch the tests on the nodes that were allocated
squeue -u $USER --state=pending | wc -l
568
but almost no test is running
squeue -u $USER --state=running | wc -l
2
in short: my pipeline of tests is totally stuck even if some nodes are idle! To rephrase that and present the problem from a different perspective: ReFrame submitted the jobs to the nodelist in such a way that it did not prioritize the runs on nodes that were idle. As some nodes were allocated and ReFrame queued jobs on them, I ended up reaching the maximum number of jobs allowed for my slurm account.
In the issues, I did not find a similar problem.
Has anyone an idea how to overcome this issue? That would be really helpful!
Metadata
Metadata
Assignees
Type
Projects
Status
Todo