Skip to content

ucx package prevents recipes from running at the MO #4165

@ehogan

Description

@ehogan

I created a new ESMValTool environment today. When I ran recipes on our compute servers, the recipes failed because the job took longer than the time requested. This is because there were about a million of the following messages:

[1755878403.551204] [sla-cpu-r-36:1290789:0]           ib_md.c:282  UCX  ERROR ibv_reg_mr(address=0x7ff761bfa000, length=37773312, access=0x10000f) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 8192 kbytes)
[1755878403.551314] [sla-cpu-r-36:1290789:0]           mpool.c:269  UCX  ERROR Failed to allocate memory pool (name=rc_recv_desc) chunk: Input/output error
[1755878403.565116] [sla-cpu-r-36:1290789:0]           ib_md.c:282  UCX  ERROR ibv_reg_mr(address=0x7ff768602000, length=544768, access=0xf) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 8192 kbytes)
[1755878403.565266] [sla-cpu-r-36:1290789:0]           mpool.c:269  UCX  ERROR Failed to allocate memory pool (name=ud_recv_skb) chunk: Input/output error
[1755878403.565566] [sla-cpu-r-36:1290789:0]           ib_md.c:282  UCX  ERROR ibv_reg_mr(address=0x7ff762dd9000, length=19034112, access=0x10000f) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 8192 kbytes)
[1755878403.565593] [sla-cpu-r-36:1290789:0]           mpool.c:269  UCX  ERROR Failed to allocate memory pool (name=ud_recv_skb) chunk: Input/output error

I compared the new ESMValTool environment with the v2.12.0 environment we have at the MO (which can successfully run recipes on our compute servers), and the new environment contained a package called ucx. So I removed this package:

  removed specs:
    - ucx


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    mpich-4.2.3                |     h670b19f_100        13.1 MB  conda-forge
    ------------------------------------------------------------
                                           Total:        13.1 MB

The following packages will be REMOVED:

  ucx-1.18.0-hfd9a62f_3

The following packages will be DOWNGRADED:

  mpich                                  4.3.0-h1a8bee6_100 --> 4.2.3-h670b19f_100 


Proceed ([y]/n)? y

The recipes then ran on our compute servers using this new, updated environment.

@ESMValGroup/technical-lead-development-team do you foresee any issues if we exclude ucx from future ESMValTool environments? Would it be possible to do this before the release, please? @jlenh

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions