Queue full and stop sending metrics to backend w/ ThreadedBackend in a Celery project

I am using timeexecution with a `ThreadedBackend` on `ElasticsearchBackend` in a project at CERN. It is great software!
For reference, this is where timeexecution gets configured at app's boot up:
https://github.com/inspirehep/inspire-next/blob/899dabc588159dd9d45e7202692c675f073f0fe0/inspirehep/utils/ext.py#L76

The entire project runs in a **Celery** instance (which might live for days before getting restarted). Occasionally, it stops sending metrics to ES.
When this happens, the logs show that all metrics are discarded (log entry like: `[2019-01-10 12:24:18,816: WARNING/ForkPoolWorker-16] Discard metric inspirehep.modules.records.receivers.push_to_orcid`) because the queue is full and it does not get consumed:
https://github.com/kpn-digital/py-timeexecution/blob/d1b005a337fb6705aa804da6c26b7d9d477b62fc/time_execution/backends/threaded.py#L43-L44
My guess is that the cause is that the consumer thread started by the method `start_worker` has died (or hanged).

Note that the same project also runs in a Gunicorn instance where the issue never happens.
The problem is solved when Celery is restarted.

**INVESTIGATION**
Celery has a `MainProcess` (main) and a number (2 in our case) of `ForkPoolWorker-N` (worker or child) processes. A worker process might be killed by the main when using too much memory (log entry like: `[2019-01-10 11:47:20,010: ERROR/ForkPoolWorker-1] child process exiting after exceeding memory limit (2760384KiB / 1370112KiB)`).

When a worker process dies it is replaced by a new one. I noticed that every time the issue happens, it's right after a worker process has been killed because using too much memory. I was able to reproduce this issue only a few times and not in a deterministic way (even setting a very low threshold for memory, thus triggering the kill very frequently).

To complicate things: the consumer thread is owned by the `MainProcess` and the `write` method is executed in a `ForkPoolWorker-N` process (I added explicit logs to prove this).

**POSSIBLE SOLUTION**
```python
        except Full:
            if not self.thread:  # and maybe: or not self.thread.is_alive()
                self.start_worker()
            logger.warning("Discard metric %s", name)
```

I am not going to make such PR yet as I was not able to deterministically reproduce the issue, but I want to keep track of it in here.

**UPDATE 15/01/2019**

The possible solution mentioned above did not work, but I am now able to reproduce the issue systematically.
Celery has a main process and a number (depending on the configuration) of worker processes.
This is our case:
![image](https://user-images.githubusercontent.com/6423485/51178413-c3d8d400-18c2-11e9-96e8-2e8708260caf.png)

I added more logging statement and acknowledged that:
- the consumer thread started [here](https://github.com/kpn-digital/py-timeexecution/blob/master/time_execution/backends/threaded.py#L50) lives in the main Celery process. That is because the `ThreadedBackend` class is instantiated on app's boot up (thread with pid=32450 in the htop screenshot).
- the [thread safe queue](https://github.com/kpn-digital/py-timeexecution/blob/master/time_execution/backends/threaded.py#L36) is also started in the `__init__` (I guess it is the thread with pid=32447 in the htop screenshot).
- the [producer code](https://github.com/kpn-digital/py-timeexecution/blob/master/time_execution/backends/threaded.py#L39) lives in the Celery worker processes (processes with pids=1017,966 in the htop screenshot).

When a Celery worker process exceeds the memory limit set in the configuration, what happens is:
- the Celery worker process is killed and replaced by a new process.
- the producer code in the new Celery worker process keeps on queueing metrics until it gets full (a few seconds later).
- the consumer thread in the main Celery process sees an empty queue and thus no metrics is sent.

I worked out a solution where the consumer thread, the queue and the producer code live in the Celery worker processes. I have been testing it manually and in one canary production machine for a couple of days and it works well (while all the other non-canary machines actually are affected by the issue). The solution is backward compatible. PR coming soon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queue full and stop sending metrics to backend w/ ThreadedBackend in a Celery project #37

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Queue full and stop sending metrics to backend w/ ThreadedBackend in a Celery project #37

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions