Avoid running out of shared memory when doing calculations in parallel.

I was running a particularly large task on a server with 3TB of RAM and 768 cores, and let Yirgacheffe pick the default level of parallelism, which will be one child task per core, so 768 child processes.

By default Ubuntu had set /dev/shm to be 1.5TB. 

I was processing a raster that was 1M pixels wide, and float 32, and Yirgacheffe's default chunk size is 512 rows. That means we'd allocate a shared memory segment of around 2GB, and indeed if I look in `/dev/shm` I can see close to that:

```
-rw-------  1 mwd24 mwd24 2.5G Dec  3 15:50 psm_fd7a2399
-rw-------  1 mwd24 mwd24 2.5G Dec  3 15:55 psm_fd99b152
-rw-------  1 mwd24 mwd24 2.5G Dec  3 15:50 psm_fdbde05b
-rw-------  1 mwd24 mwd24 2.5G Dec  3 15:50 psm_ffaba569
```

768 * 2.5 GB is 1.9TB, which is more than the system allows, so Yirgacheffe failed.

Unfortunately, the error you get is not caught cleanly by the Python run time, rather the OS terminated the process with a SIGBUS:

```
'python3 ./prepare_layers/make_h…' terminated by signal SIGBUS (Misaligned address error)
```

This meant that the Python context manager didn't run, meaning we leaked all the shared memory segments in `/dev/shm`.

Yirgacheffe needs to avoid pushing the limits here, as the consequences are quite troublesome, particularly given it's meant to be used by non-systems engineers.

Options:

* Check how much shared mem space there is and throttle the number of concurrent processes
* Adjust the chunk size and keep the same number of processes (one could argue the chunk size value was set for more narrow rasters than this one and so 512 is too big when your raster is 1M pixels wide)
* Just check before the next allocation and fail gracefully, making it the user who needs to decided how to best solve this between those and other options

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid running out of shared memory when doing calculations in parallel. #110

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Avoid running out of shared memory when doing calculations in parallel. #110

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions