Skip to content

Memory leak in daemon runners? #4603

@giovannipizzi

Description

@giovannipizzi

I have been running ~2000 (QE Relax) WorkChains with AiiDA 1.5.0.
Now, everything is finished, verdi process list is empty, and I have zero messages also in RabbitMQ (as a double check, see below).

However, my 8 workers are still using a lot of memory:

$ verdi daemon status
Profile: dispero2020
Daemon is running as PID 5782 since 2020-11-29 17:54:56
Active workers [8]:
  PID    MEM %    CPU %  started
-----  -------  -------  -------------------
 5786    2.965      0    2020-11-29 17:54:56
 5787    2.855      0    2020-11-29 17:54:56
 5788    2.907      0.2  2020-11-29 17:54:56
 5789    2.766      0    2020-11-29 17:54:56
 5790    2.908      0.1  2020-11-29 17:54:57
 5791    2.655      0    2020-11-29 17:54:57
 5792    2.749      0    2020-11-29 17:54:57
 5793    2.866      0    2020-11-29 17:54:57

Note that this means ~1.8GB RAM/worker, so a total of ~15GB used!

I initially reported this already in #4598 but I thought it was due to the overload described there.
Instead, this time everything went smoothly with no excepted jobs.

Therefore, I am assuming that this is a memory leak, with some resources not properly released.
Considering the size of the data, this is similar to the size of the corresponding file repository. Indeed, ArrayData should still have some 'caching' of the arrays in memory, so maybe an ArrayData node might keep in memory all arrays? Maybe this is the cause?
We might want to remove that caching, but still I think the daemon should explicitly delete or remove nodes from memory once they are not used, I believe. Maybe they remain because they stay in some DB session? (This is Django)

It would be good if someone could investigate this (maybe the next task for @chrisjsewell, but also feedback from @muhrin @sphuber @unkcpz is appreciated)

For completeness: as expected, if I stop the daemon and then restart it with verdi daemon start 8, I get a low memory usage:

$ verdi daemon status
Profile: dispero2020
Daemon is running as PID 56574 since 2020-12-01 13:00:33
Active workers [8]:
  PID    MEM %    CPU %  started
-----  -------  -------  -------------------
56578    0.145        0  2020-12-01 13:00:33
56579    0.144        0  2020-12-01 13:00:33
56580    0.144        0  2020-12-01 13:00:33
56581    0.144        0  2020-12-01 13:00:33
56582    0.145        0  2020-12-01 13:00:33
56583    0.144        0  2020-12-01 13:00:33
56584    0.144        0  2020-12-01 13:00:33
56585    0.145        0  2020-12-01 13:00:33

This is to show that there are no more messages (also before restarting the daemon)
rabbitmq-mgmt

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions