Skip to content

Avoid running out of shared memory when doing calculations in parallel.Β #110

@mdales

Description

@mdales

I was running a particularly large task on a server with 3TB of RAM and 768 cores, and let Yirgacheffe pick the default level of parallelism, which will be one child task per core, so 768 child processes.

By default Ubuntu had set /dev/shm to be 1.5TB.

I was processing a raster that was 1M pixels wide, and float 32, and Yirgacheffe's default chunk size is 512 rows. That means we'd allocate a shared memory segment of around 2GB, and indeed if I look in /dev/shm I can see close to that:

-rw-------  1 mwd24 mwd24 2.5G Dec  3 15:50 psm_fd7a2399
-rw-------  1 mwd24 mwd24 2.5G Dec  3 15:55 psm_fd99b152
-rw-------  1 mwd24 mwd24 2.5G Dec  3 15:50 psm_fdbde05b
-rw-------  1 mwd24 mwd24 2.5G Dec  3 15:50 psm_ffaba569

768 * 2.5 GB is 1.9TB, which is more than the system allows, so Yirgacheffe failed.

Unfortunately, the error you get is not caught cleanly by the Python run time, rather the OS terminated the process with a SIGBUS:

'python3 ./prepare_layers/make_h…' terminated by signal SIGBUS (Misaligned address error)

This meant that the Python context manager didn't run, meaning we leaked all the shared memory segments in /dev/shm.

Yirgacheffe needs to avoid pushing the limits here, as the consequences are quite troublesome, particularly given it's meant to be used by non-systems engineers.

Options:

  • Check how much shared mem space there is and throttle the number of concurrent processes
  • Adjust the chunk size and keep the same number of processes (one could argue the chunk size value was set for more narrow rasters than this one and so 512 is too big when your raster is 1M pixels wide)
  • Just check before the next allocation and fail gracefully, making it the user who needs to decided how to best solve this between those and other options

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions