-
Notifications
You must be signed in to change notification settings - Fork 0
Test with a large run #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
accepting review suggestion - thanks. Co-authored-by: James A. Fellows Yates <[email protected]>
Update pipeline configs
Updated the configuration file due to recent simplification of the seadragon clusters.
Correct the inconsistency between the maximum parameter and the queue limit.
Correct the inconsistency between the maximum parameter and the queue limit.
Update seadragon.config
Consensus is to have cleanup off by default
Update icr_alma.config
…nto lvclark-patch1 Need to pull automated linting
Update Seattle Children's profile for new HPC
Fix the bug where Nextflow fails to retrieve resource values when they are not explicitly set in the task.
Update seadragon.config
…s and adding queue size
Update engaging.config by changing partition, updating resource limit…
Update roslin.config - Remove -l rl9=false option
Removed cacheDir as we do not have a common directory anymore
Update mjolnir_globe.config
Updated unibe_ibu institutional profile
Update eva.config
- add error strategy - match process to max time limit - limit queue size, not submit rate - as not pulling, singulairy do not need binding, hard coded cache, or timeout
| maxErrors = '-1' | ||
|
|
||
| errorStrategy = { task.exitStatus in [137,255] ? 'retry' : 'terminate' } | ||
| withName: ".*" { time = 5.d } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting a fixed time limit of 5 days for all processes could lead to inefficient resource usage. Some tasks might finish much quicker...
Additionally, processes that require more resources (time, memory, or CPUs) are given lower priority in the queue. This means if the cluster is busy, you will risk waiting for too long before getting a chance to run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. There maybe a better way to increase time without hard coding to 5.d. For example use it with task.attempt
| maxRetries = 3 | ||
| maxErrors = '-1' | ||
|
|
||
| errorStrategy = { task.exitStatus in [137,255] ? 'retry' : 'terminate' } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before merging to nf-core, we had
errorStrategy = { task.exitStatus in [143,137,104,134,139,140,247,255] ? 'retry' : 'finish' }
The reviewer told us that we don't need to set this, as usually each pipeline has its own errorStrategty.
To be honest, I have run through 255 code error, and retrying was a waste of resources..
The only solution was to increase the mem/cpu or time!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. The issue here is that the current config ties the allocated memory to the number of cpus, which is not what some processes expect. This resulted in an error 255, in my case, and I could increase the mem/cpu by using a retry error strategy and task.attempt in a custom config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, retry does request more resources if the process resource allocation is done with # * task.attempt which is the case in sarek.
| max_time = 5.d | ||
| // max_memory = 256.GB | ||
| // max_cpus = 30 | ||
| // max_time = 5.d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These values specify resource limits for tasks running on the compute node. This is useful for setting upper bounds on resource usage, especially when using dynamic resource allocation.
Any process asking for excessive resources, will fail!
Also this Allow for dynamic resource allocation within these limits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is, using the max_* in params applies globally to the whole workflow. I could be wrong. Do we want to limit a run to these resources?
| pullTimeout = 2.h | ||
| cacheDir = '/data/scratch/shared/SINGULARITY-DOWNLOAD/nextflow/.singularity' | ||
| // pullTimeout = 2.h | ||
| // cacheDir = '/data/scratch/shared/SINGULARITY-DOWNLOAD/nextflow/.singularity' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im confused why would you avoid using the cacheDir?
To my understanding, runOptions, pullTimeOut and cacheDir will only be taken into account if pulling an image.
In the case where there are no images to pull, nothing will happen.
Again, this is a generic config that is meant to work for most cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. The existance of these run-options gave the impression that I could just excute the run without pre downloading the containers, which still fails. I needed to predownload the containers to a separate cache, and override the cashe dir param in a custom config.
| // singularity containers launching at once, they | ||
| // cause an singularity error with exit code 255. | ||
| submitRateLimit = "2 sec" | ||
| // submitRateLimit = "2 sec" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the above comment states, Alma gets overwhelmed when many processes are fired simultaneously.
Submitting one job each 2 sec seems acceptable.
By unsetting it, we allow an "unlimited" number of jobs to be launched simultaneously, which is not good!
Im curious as why would you want to remove this? and how beneficial it could be for the long runs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the need to limit simultanous launches. I just think "2 sec" penalizes smaller quick processes. queueSize may be a suitable alternative.
| // cause an singularity error with exit code 255. | ||
| submitRateLimit = "2 sec" | ||
| // submitRateLimit = "2 sec" | ||
| queueSize = 50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok for queueSize, but again I don't see how it could be beneficial for long runs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above
The following changes were made to work in a large sarek run
params). I suspect the intension is to limit resources per process onlyerrorStrategy, otherwisemaxRetriesdoes not kick in. I've only encountered137and255, which are related to memory. Some sarek processes would have other code, but these are not included. Since we do modify the memory assignment relative to what sarek wants, it makes since to handle them.timelimit using a wild card".*"queueSize, not submit rate. This seems to be a prefered solution across other configs to avoid penalizing simple processes.singularityis still not pulling and I had to reuse local images. If this is the case, I thinkpullTimeoutwouldn't be neededcacheDiris a bit restrictive without a write permission to itautoMountshandles binding