Skip to content

Conversation

@MahShaaban
Copy link

The following changes were made to work in a large sarek run

  • Remove global limits (defined in params). I suspect the intension is to limit resources per process only
  • Add an errorStrategy, otherwise maxRetries does not kick in. I've only encountered 137 and 255, which are related to memory. Some sarek processes would have other code, but these are not included. Since we do modify the memory assignment relative to what sarek wants, it makes since to handle them.
  • Match process to time limit using a wild card ".*"
  • Limit container launching using queueSize, not submit rate. This seems to be a prefered solution across other configs to avoid penalizing simple processes.
  • In my test, singularity is still not pulling and I had to reuse local images. If this is the case, I think
    • pullTimeout wouldn't be needed
    • Hard coding cacheDir is a bit restrictive without a write permission to it
    • autoMounts handles binding

rachelicr and others added 30 commits February 19, 2025 17:23
accepting review suggestion - thanks.

Co-authored-by: James A. Fellows Yates <[email protected]>
Updated the configuration file due to recent simplification of the seadragon clusters.
Correct the inconsistency between the maximum parameter and the queue limit.
Correct the inconsistency between the maximum parameter and the queue limit.
Consensus is to have cleanup off by default
…nto lvclark-patch1

Need to pull automated linting
Update Seattle Children's profile for new HPC
Fix the bug where Nextflow fails to retrieve resource values when they are not explicitly set in the task.
bumproo and others added 14 commits March 4, 2025 07:30
Update engaging.config by changing partition, updating resource limit…
Update roslin.config - Remove -l rl9=false option
Removed cacheDir as we do not have a common directory anymore
Updated unibe_ibu institutional profile
	- add error strategy
	- match process to max time limit
	- limit queue size, not submit rate
	- as not pulling, singulairy do not need binding, hard coded cache, or timeout
maxErrors = '-1'

errorStrategy = { task.exitStatus in [137,255] ? 'retry' : 'terminate' }
withName: ".*" { time = 5.d }
Copy link

@msarkis-icr msarkis-icr Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting a fixed time limit of 5 days for all processes could lead to inefficient resource usage. Some tasks might finish much quicker...
Additionally, processes that require more resources (time, memory, or CPUs) are given lower priority in the queue. This means if the cluster is busy, you will risk waiting for too long before getting a chance to run

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. There maybe a better way to increase time without hard coding to 5.d. For example use it with task.attempt

maxRetries = 3
maxErrors = '-1'

errorStrategy = { task.exitStatus in [137,255] ? 'retry' : 'terminate' }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before merging to nf-core, we had
errorStrategy = { task.exitStatus in [143,137,104,134,139,140,247,255] ? 'retry' : 'finish' }
The reviewer told us that we don't need to set this, as usually each pipeline has its own errorStrategty.

To be honest, I have run through 255 code error, and retrying was a waste of resources..
The only solution was to increase the mem/cpu or time!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. The issue here is that the current config ties the allocated memory to the number of cpus, which is not what some processes expect. This resulted in an error 255, in my case, and I could increase the mem/cpu by using a retry error strategy and task.attempt in a custom config

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, retry does request more resources if the process resource allocation is done with # * task.attempt which is the case in sarek.

max_time = 5.d
// max_memory = 256.GB
// max_cpus = 30
// max_time = 5.d
Copy link

@msarkis-icr msarkis-icr Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values specify resource limits for tasks running on the compute node. This is useful for setting upper bounds on resource usage, especially when using dynamic resource allocation.
Any process asking for excessive resources, will fail!

Also this Allow for dynamic resource allocation within these limits.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is, using the max_* in params applies globally to the whole workflow. I could be wrong. Do we want to limit a run to these resources?

pullTimeout = 2.h
cacheDir = '/data/scratch/shared/SINGULARITY-DOWNLOAD/nextflow/.singularity'
// pullTimeout = 2.h
// cacheDir = '/data/scratch/shared/SINGULARITY-DOWNLOAD/nextflow/.singularity'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im confused why would you avoid using the cacheDir?
To my understanding, runOptions, pullTimeOut and cacheDir will only be taken into account if pulling an image.
In the case where there are no images to pull, nothing will happen.

Again, this is a generic config that is meant to work for most cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. The existance of these run-options gave the impression that I could just excute the run without pre downloading the containers, which still fails. I needed to predownload the containers to a separate cache, and override the cashe dir param in a custom config.

// singularity containers launching at once, they
// cause an singularity error with exit code 255.
submitRateLimit = "2 sec"
// submitRateLimit = "2 sec"
Copy link

@msarkis-icr msarkis-icr Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the above comment states, Alma gets overwhelmed when many processes are fired simultaneously.
Submitting one job each 2 sec seems acceptable.

By unsetting it, we allow an "unlimited" number of jobs to be launched simultaneously, which is not good!

Im curious as why would you want to remove this? and how beneficial it could be for the long runs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the need to limit simultanous launches. I just think "2 sec" penalizes smaller quick processes. queueSize may be a suitable alternative.

// cause an singularity error with exit code 255.
submitRateLimit = "2 sec"
// submitRateLimit = "2 sec"
queueSize = 50

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok for queueSize, but again I don't see how it could be beneficial for long runs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above

@MahShaaban MahShaaban requested a review from msarkis-icr March 27, 2025 09:05
msarkis-icr pushed a commit that referenced this pull request Apr 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.