-
Notifications
You must be signed in to change notification settings - Fork 19
Description
I am using timeexecution with a ThreadedBackend on ElasticsearchBackend in a project at CERN. It is great software!
For reference, this is where timeexecution gets configured at app's boot up:
https://github.com/inspirehep/inspire-next/blob/899dabc588159dd9d45e7202692c675f073f0fe0/inspirehep/utils/ext.py#L76
The entire project runs in a Celery instance (which might live for days before getting restarted). Occasionally, it stops sending metrics to ES.
When this happens, the logs show that all metrics are discarded (log entry like: [2019-01-10 12:24:18,816: WARNING/ForkPoolWorker-16] Discard metric inspirehep.modules.records.receivers.push_to_orcid) because the queue is full and it does not get consumed:
https://github.com/kpn-digital/py-timeexecution/blob/d1b005a337fb6705aa804da6c26b7d9d477b62fc/time_execution/backends/threaded.py#L43-L44
My guess is that the cause is that the consumer thread started by the method start_worker has died (or hanged).
Note that the same project also runs in a Gunicorn instance where the issue never happens.
The problem is solved when Celery is restarted.
INVESTIGATION
Celery has a MainProcess (main) and a number (2 in our case) of ForkPoolWorker-N (worker or child) processes. A worker process might be killed by the main when using too much memory (log entry like: [2019-01-10 11:47:20,010: ERROR/ForkPoolWorker-1] child process exiting after exceeding memory limit (2760384KiB / 1370112KiB)).
When a worker process dies it is replaced by a new one. I noticed that every time the issue happens, it's right after a worker process has been killed because using too much memory. I was able to reproduce this issue only a few times and not in a deterministic way (even setting a very low threshold for memory, thus triggering the kill very frequently).
To complicate things: the consumer thread is owned by the MainProcess and the write method is executed in a ForkPoolWorker-N process (I added explicit logs to prove this).
POSSIBLE SOLUTION
except Full:
if not self.thread: # and maybe: or not self.thread.is_alive()
self.start_worker()
logger.warning("Discard metric %s", name)I am not going to make such PR yet as I was not able to deterministically reproduce the issue, but I want to keep track of it in here.
UPDATE 15/01/2019
The possible solution mentioned above did not work, but I am now able to reproduce the issue systematically.
Celery has a main process and a number (depending on the configuration) of worker processes.
This is our case:

I added more logging statement and acknowledged that:
- the consumer thread started here lives in the main Celery process. That is because the
ThreadedBackendclass is instantiated on app's boot up (thread with pid=32450 in the htop screenshot). - the thread safe queue is also started in the
__init__(I guess it is the thread with pid=32447 in the htop screenshot). - the producer code lives in the Celery worker processes (processes with pids=1017,966 in the htop screenshot).
When a Celery worker process exceeds the memory limit set in the configuration, what happens is:
- the Celery worker process is killed and replaced by a new process.
- the producer code in the new Celery worker process keeps on queueing metrics until it gets full (a few seconds later).
- the consumer thread in the main Celery process sees an empty queue and thus no metrics is sent.
I worked out a solution where the consumer thread, the queue and the producer code live in the Celery worker processes. I have been testing it manually and in one canary production machine for a couple of days and it works well (while all the other non-canary machines actually are affected by the issue). The solution is backward compatible. PR coming soon.