Test with a large run #1

MahShaaban · 2025-03-09T17:36:42Z

The following changes were made to work in a large sarek run

Remove global limits (defined in params). I suspect the intension is to limit resources per process only
Add an errorStrategy, otherwise maxRetries does not kick in. I've only encountered 137 and 255, which are related to memory. Some sarek processes would have other code, but these are not included. Since we do modify the memory assignment relative to what sarek wants, it makes since to handle them.
Match process to time limit using a wild card ".*"
Limit container launching using queueSize, not submit rate. This seems to be a prefered solution across other configs to avoid penalizing simple processes.
In my test, singularity is still not pulling and I had to reuse local images. If this is the case, I think
- pullTimeout wouldn't be needed
- Hard coding cacheDir is a bit restrictive without a write permission to it
- autoMounts handles binding

accepting review suggestion - thanks. Co-authored-by: James A. Fellows Yates <[email protected]>

ICR Alma

Update pipeline configs

Updated the configuration file due to recent simplification of the seadragon clusters.

Correct the inconsistency between the maximum parameter and the queue limit.

Update seadragon.config

Consensus is to have cleanup off by default

Update icr_alma.config

…nto lvclark-patch1 Need to pull automated linting

Update Seattle Children's profile for new HPC

Fix the bug where Nextflow fails to retrieve resource values when they are not explicitly set in the task.

Update seadragon.config

…s and adding queue size

Update engaging.config by changing partition, updating resource limit…

Update roslin.config - Remove -l rl9=false option

Removed cacheDir as we do not have a common directory anymore

Update mjolnir_globe.config

Updated unibe_ibu institutional profile

Update eva.config

- add error strategy - match process to max time limit - limit queue size, not submit rate - as not pulling, singulairy do not need binding, hard coded cache, or timeout

msarkis-icr · 2025-03-10T10:35:00Z

conf/icr_alma.config

    maxErrors     = '-1'
+
+    errorStrategy   = { task.exitStatus in [137,255] ? 'retry' : 'terminate' }
+    withName: ".*" { time = 5.d }


Setting a fixed time limit of 5 days for all processes could lead to inefficient resource usage. Some tasks might finish much quicker...
Additionally, processes that require more resources (time, memory, or CPUs) are given lower priority in the queue. This means if the cluster is busy, you will risk waiting for too long before getting a chance to run

Agreed. There maybe a better way to increase time without hard coding to 5.d. For example use it with task.attempt

msarkis-icr · 2025-03-10T10:41:02Z

conf/icr_alma.config

    maxRetries    = 3
    maxErrors     = '-1'
+
+    errorStrategy   = { task.exitStatus in [137,255] ? 'retry' : 'terminate' }


before merging to nf-core, we had
errorStrategy = { task.exitStatus in [143,137,104,134,139,140,247,255] ? 'retry' : 'finish' }
The reviewer told us that we don't need to set this, as usually each pipeline has its own errorStrategty.

To be honest, I have run through 255 code error, and retrying was a waste of resources..
The only solution was to increase the mem/cpu or time!

I see. The issue here is that the current config ties the allocated memory to the number of cpus, which is not what some processes expect. This resulted in an error 255, in my case, and I could increase the mem/cpu by using a retry error strategy and task.attempt in a custom config

In addition, retry does request more resources if the process resource allocation is done with # * task.attempt which is the case in sarek.

msarkis-icr · 2025-03-10T10:47:28Z

conf/icr_alma.config

-    max_time = 5.d
+    // max_memory = 256.GB
+    // max_cpus = 30
+    // max_time = 5.d


These values specify resource limits for tasks running on the compute node. This is useful for setting upper bounds on resource usage, especially when using dynamic resource allocation.
Any process asking for excessive resources, will fail!

Also this Allow for dynamic resource allocation within these limits.

My understanding is, using the max_* in params applies globally to the whole workflow. I could be wrong. Do we want to limit a run to these resources?

msarkis-icr · 2025-03-10T10:50:57Z

conf/icr_alma.config

-    pullTimeout = 2.h
-    cacheDir = '/data/scratch/shared/SINGULARITY-DOWNLOAD/nextflow/.singularity'
+    // pullTimeout = 2.h
+    // cacheDir = '/data/scratch/shared/SINGULARITY-DOWNLOAD/nextflow/.singularity'


Im confused why would you avoid using the cacheDir?
To my understanding, runOptions, pullTimeOut and cacheDir will only be taken into account if pulling an image.
In the case where there are no images to pull, nothing will happen.

Again, this is a generic config that is meant to work for most cases.

True. The existance of these run-options gave the impression that I could just excute the run without pre downloading the containers, which still fails. I needed to predownload the containers to a separate cache, and override the cashe dir param in a custom config.

msarkis-icr · 2025-03-10T10:58:13Z

conf/icr_alma.config

    // singularity containers launching at once, they
    // cause an singularity error with exit code 255.
-    submitRateLimit = "2 sec"
+    // submitRateLimit = "2 sec"


As the above comment states, Alma gets overwhelmed when many processes are fired simultaneously.
Submitting one job each 2 sec seems acceptable.

By unsetting it, we allow an "unlimited" number of jobs to be launched simultaneously, which is not good!

Im curious as why would you want to remove this? and how beneficial it could be for the long runs?

I see the need to limit simultanous launches. I just think "2 sec" penalizes smaller quick processes. queueSize may be a suitable alternative.

msarkis-icr · 2025-03-10T11:01:21Z

conf/icr_alma.config

    // cause an singularity error with exit code 255.
-    submitRateLimit = "2 sec"
+    // submitRateLimit = "2 sec"
+    queueSize           = 50


ok for queueSize, but again I don't see how it could be beneficial for long runs?

Update pipeline configs

rachelicr and others added 30 commits February 19, 2025 17:23

Create icr_alma institutional config

e9b7912

Profile refinement based on reviewer feedback

d0e2ad4

Update docs/icr_alma.md

687d015

accepting review suggestion - thanks. Co-authored-by: James A. Fellows Yates <[email protected]>

move max_* from process to params

628318a

Merge pull request nf-core#845 from ICR-RSE-Group/icr_alma

f420f16

ICR Alma

[automated] Update pipeline configs

916b701

Merge pull request nf-core#846 from nf-core/create-pull-request/patch

e6fab87

Update pipeline configs

Update seadragon.config

c043b33

Updated the configuration file due to recent simplification of the seadragon clusters.

Update seadragon.config

bc9fd34

Correct the inconsistency between the maximum parameter and the queue limit.

Update seadragon.config

b51ed18

Correct the inconsistency between the maximum parameter and the queue limit.

Merge pull request nf-core#847 from jiawku/master

add653f

Update seadragon.config

Update icr_alma.config

3b2f8cf

Consensus is to have cleanup off by default

Update Seattle Children's profile for new HPC

71c3dfa

Additional locations to list profile

2089f5b

Merge pull request nf-core#848 from ICR-RSE-Group/icr_alma

c6d5e1d

Update icr_alma.config

Profile name

c596674

Merge branch 'master' into lvclark-patch1

4dbea4c

[automated] Fix code linting

cbb07e5

Dummy value for params.assoc to pass testing

6e1eab7

Undo commit for merge

1bea41d

Merge branch 'lvclark-patch1' of https://github.com/lvclark/configs i…

71d0114

…nto lvclark-patch1 Need to pull automated linting

Dummy value for params.assoc to pass testing

34243b7

More specific config URL

3aa836b

Assoc as environmental variable, and no workDir

3e359c4

Merge pull request nf-core#849 from lvclark/lvclark-patch1

1cab186

Update Seattle Children's profile for new HPC

Update seadragon.config

c553e68

Fix the bug where Nextflow fails to retrieve resource values when they are not explicitly set in the task.

Merge pull request nf-core#850 from jiawku/patch-1

76b7b66

Update seadragon.config

Update engaging.config by changing partition, updating resource limit…

26c4436

…s and adding queue size

Update engaging.config to correct profile contact to git username

b8ce297

Update ki_luria.config so profile contact is correct

3a09763

bumproo and others added 14 commits March 4, 2025 07:30

Update engaging.config fixing line 24 space number

b4d1e74

Merge pull request nf-core#851 from nf-core/bumproo-patch-1

f662d22

Update engaging.config by changing partition, updating resource limit…

Update roslin.config - Remove -l rl9=false option

f08341b

Fixing typos

eafc679

Merge pull request nf-core#853 from sguizard/master

2919ef4

Update roslin.config - Remove -l rl9=false option

Updated cluster profile

490bf14

Updated documentation of unibe_ibu profile

f36f295

Removed beforeScript and added information about project to docs

ada21e6

Update mjolnir_globe.config

a2c974a

Removed cacheDir as we do not have a common directory anymore

Merge pull request nf-core#857 from bentpetersendk/master

e3a72c4

Update mjolnir_globe.config

Merge pull request nf-core#855 from alexnater/unibe_ibu

49336c8

Updated unibe_ibu institutional profile

Update eva.config

13def2e

Merge pull request nf-core#858 from nf-core/update_eager_gatk_hc_eva

f75197d

Update eva.config

modify for test with large run

62b5da4

- add error strategy - match process to max time limit - limit queue size, not submit rate - as not pulling, singulairy do not need binding, hard coded cache, or timeout

msarkis-icr requested changes Mar 10, 2025

View reviewed changes

MahShaaban requested a review from msarkis-icr March 27, 2025 09:05

msarkis-icr pushed a commit that referenced this pull request Apr 22, 2025

Merge pull request #1 from jzinno/create-pull-request/patch

788bf32

Update pipeline configs

Test with a large run #1

Are you sure you want to change the base?

Test with a large run #1

Uh oh!

Conversation

MahShaaban commented Mar 9, 2025

Uh oh!

msarkis-icr Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msarkis-icr Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msarkis-icr Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

msarkis-icr Mar 10, 2025 •

edited

Loading

msarkis-icr Mar 10, 2025 •

edited

Loading

msarkis-icr Mar 10, 2025 •

edited

Loading