-
Notifications
You must be signed in to change notification settings - Fork 145
Open
Milestone
Description
I created a new ESMValTool environment today. When I ran recipes on our compute servers, the recipes failed because the job took longer than the time requested. This is because there were about a million of the following messages:
[1755878403.551204] [sla-cpu-r-36:1290789:0] ib_md.c:282 UCX ERROR ibv_reg_mr(address=0x7ff761bfa000, length=37773312, access=0x10000f) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 8192 kbytes)
[1755878403.551314] [sla-cpu-r-36:1290789:0] mpool.c:269 UCX ERROR Failed to allocate memory pool (name=rc_recv_desc) chunk: Input/output error
[1755878403.565116] [sla-cpu-r-36:1290789:0] ib_md.c:282 UCX ERROR ibv_reg_mr(address=0x7ff768602000, length=544768, access=0xf) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 8192 kbytes)
[1755878403.565266] [sla-cpu-r-36:1290789:0] mpool.c:269 UCX ERROR Failed to allocate memory pool (name=ud_recv_skb) chunk: Input/output error
[1755878403.565566] [sla-cpu-r-36:1290789:0] ib_md.c:282 UCX ERROR ibv_reg_mr(address=0x7ff762dd9000, length=19034112, access=0x10000f) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 8192 kbytes)
[1755878403.565593] [sla-cpu-r-36:1290789:0] mpool.c:269 UCX ERROR Failed to allocate memory pool (name=ud_recv_skb) chunk: Input/output error
I compared the new ESMValTool environment with the v2.12.0 environment we have at the MO (which can successfully run recipes on our compute servers), and the new environment contained a package called ucx. So I removed this package:
removed specs:
- ucx
The following packages will be downloaded:
package | build
---------------------------|-----------------
mpich-4.2.3 | h670b19f_100 13.1 MB conda-forge
------------------------------------------------------------
Total: 13.1 MB
The following packages will be REMOVED:
ucx-1.18.0-hfd9a62f_3
The following packages will be DOWNGRADED:
mpich 4.3.0-h1a8bee6_100 --> 4.2.3-h670b19f_100
Proceed ([y]/n)? y
The recipes then ran on our compute servers using this new, updated environment.
@ESMValGroup/technical-lead-development-team do you foresee any issues if we exclude ucx from future ESMValTool environments? Would it be possible to do this before the release, please? @jlenh
Metadata
Metadata
Assignees
Labels
No labels