Skip to content

More gracefully handle situations with many events & running into rate limitsΒ #948

@PragTob

Description

@PragTob

Problem Statement

πŸ‘‹ Hi and first of all for maintaining a wonderful client library for elixir πŸ’œ

TLDR;

  • Many exceptions to report
  • run over quota --> reporting fails
  • sentry keeps using ~8 Gigabyte in memory trying to report them over days

(I decided to call it an improvement, depending on PoV some may call it bug πŸ˜… )

We ran into an incident some days ago which caused a lot of exception events. This caused us to run over our sentry quota (despite Spike protection being enabled, but also fair!).

From what I'm seeing all the events were were retained in memory and continuously tried to be submitted to sentry. This kept failing as we were over the quota, so we also had a bunch of these log messages:

[warning] Failed to send Sentry event. Sentry failed to report event: Sentry responded with status 429 - Too Many Requests and the SDK exhausted the configured retries

The symptom on the system were a bunch of Sentry.Transport.Sender processes with very high reduction counts and significant memory consumption and big message queues (as reported by :observer_cli):

|1  |<0.113277224| 111.0986 MiB |Elixir.IEx.Evaluator:init/5                  |       9821477| 0        |Elixir.IEx.Evaluator:loop/1      |
|2  |<0.3783.0>  |  20.5259 MiB |'Elixir.Sentry.Transport.Sender'             |     117164527| 558      |Elixir.Process:sleep/1           |
|3  |<0.3809.0>  |  18.3961 MiB |'Elixir.Sentry.Transport.Sender'             |     125430590| 1810     |Elixir.Process:sleep/1           |
|4  |<0.3827.0>  |  18.3959 MiB |'Elixir.Sentry.Transport.Sender'             |     102370638| 1821     |Elixir.Process:sleep/1           |
|5  |<0.3793.0>  |  18.3955 MiB |'Elixir.Sentry.Transport.Sender'             |     103264516| 1809     |Elixir.Process:sleep/1           |
|6  |<0.3772.0>  |  18.3935 MiB |'Elixir.Sentry.Transport.Sender'             |     102460005| 1776     |Elixir.Process:sleep/1           |
|7  |<0.3825.0>  |  18.3913 MiB |'Elixir.Sentry.Transport.Sender'             |     120544828| 1763     |Elixir.Process:sleep/1           |
|8  |<0.3800.0>  |  18.3869 MiB |'Elixir.Sentry.Transport.Sender'             |     105581947| 1702     |Elixir.Process:sleep/1           |
|9  |<0.3789.0>  |  18.3837 MiB |'Elixir.Sentry.Transport.Sender'             |     123818720| 1676     |Elixir.Process:sleep/1           |
|10 |<0.3778.0>  |  18.3829 MiB |'Elixir.Sentry.Transport.Sender'             |     118361900| 1655     |Elixir.Process:sleep/1           |
|11 |<0.3817.0>  |  18.3754 MiB |'Elixir.Sentry.Transport.Sender'             |     121602601| 1559     |Elixir.Process:sleep/1           |
|12 |<0.3769.0>  |  18.3735 MiB |'Elixir.Sentry.Transport.Sender'             |     112488249| 1551     |Elixir.Process:sleep/1           |
|13 |<0.3794.0>  |  15.7664 MiB |'Elixir.Sentry.Transport.Sender'             |     110978452| 1880     |Elixir.Process:sleep/1           |
|14 |<0.3802.0>  |  15.7609 MiB |'Elixir.Sentry.Transport.Sender'             |     116827056| 1824     |Elixir.Process:sleep/1           |
|15 |<0.3804.0>  |  15.7569 MiB |'Elixir.Sentry.Transport.Sender'             |     112633395| 1767     |Elixir.Process:sleep/1           |
|16 |<0.3767.0>  |  15.7569 MiB |'Elixir.Sentry.Transport.Sender'             |     108640217| 1778     |Elixir.Process:sleep/1           |
|17 |<0.3826.0>  |  15.7494 MiB |'Elixir.Sentry.Transport.Sender'             |     119471892| 1682     |Elixir.Process:sleep/1           |
|18 |<0.3812.0>  |  15.7478 MiB |'Elixir.Sentry.Transport.Sender'             |     110992238| 1657     |Elixir.Process:sleep/1           |
|19 |<0.3818.0>  |  15.7443 MiB |'Elixir.Sentry.Transport.Sender'             |     111055539| 1629     |Elixir.Process:sleep/1           |
|20 |<0.3797.0>  |  15.7374 MiB |'Elixir.Sentry.Transport.Sender'             |     105904350| 1527     |Elixir.Process:sleep/1           |
|21 |<0.3801.0>  |  15.7369 MiB |'Elixir.Sentry.Transport.Sender'             |     120051518| 1541     |Elixir.Process:sleep/1           |
|22 |<0.3819.0>  |  15.7360 MiB |'Elixir.Sentry.Transport.Sender'             |     107081411| 1530     |Elixir.Process:sleep/1

(only an excerp from one of the not as affected pods).

At this point the incident was ~2.5 days ago but all these processes seemed to never succeed in submitting the events and hence kept them around. You can clearly see the impact here in our pod that was affected the worst:

Image

I omitted the concrete time frame but it's ~2.5 days. The graphed metric is beam_memory_allocated_bytes.

You can see the memory going up, metric reporting malfunctioning due to incident and then the memory staying at a consistently high amount until it suddenly came crashing down. The time it came crashing down is when I ran Supervisor.terminate_child(Sentry.Supervisor, Sentry.Transport.SenderPool) on the system. So I'm 100% sure this memory increase is directly related to these Sentry processes.

Admittedly those were exceptional circumstances but I wouldn't expect sentry to potentially contribute to an incident/outage by hogging this much memory unsuccessfully trying to resubmit exceptions πŸ˜…

Environment

elixir -v
Erlang/OTP 27 [erts-15.2.6] [source] [64-bit] [smp:14:14] [ds:14:14:10] [async-threads:1] [jit]

Elixir 1.18.3 (compiled with Erlang/OTP 27)

mix deps | grep sentry
* sentry 11.0.4 (Hex package) (mix)
  locked at 11.0.4 (sentry) feaafc28

Our sentry config is nothing special I believe:

  config :sentry,
    dsn: sentry_dsn,
    environment_name: environment,
    enable_source_code_context: true,
    root_source_code_paths: [File.cwd!()],
    integrations: [
      oban: [
        capture_errors: true,
        cron: [enabled: true]
      ]
    ]

(I did need to write custom code to make error reporting from gRPC work, but I don't think that's too relevant here - in the end it calls Sentry.capture_exception and is modeled after the plugs)

It's a phoenix application with ecto etc... it does have gRPC but otherwise nothing too fancy I'd say.

Solution Brainstorm

I'm not too sure, my expectation was that eventually sentry would give up and just throw away the events as it failed to submit them too many times or too many events are outstanding on the client side.

The best comparison I have is the protections the erlang users to protect the handler from overload. It tracks message que length (among others) and enters drop or flush modes to get rid of extra messages/avoid getting overloaded.

Especially when the sentry error message is a 429 with quota exceeded I think it's the right call to drop some events. I'm not sure if the client can know that the monthly quote was exceeded, but if it can esp. then I think it's probably good to drop events.

I know, not losing events is a noble goal but as is it's a bit counter-productive.

As an aside, in case someone runs into this situation, a sentry client function to help you flush events would be nice. I first tried to kill the Sentry.Supervisor but that took the entire application down - whoopsie my fault πŸ˜… But yeah, took a while to find the child to terminate and do that. A helper may be nice.

I know I can be rambly, sorry, in conclusion:

  • a client side limit for events being processes similar to Logger overflow protections may be good, to also limit memory impact on applications
  • if quota exceeded is detected, dropping even more events instead of continously trying to submit them may be good
  • a helper functions to remove in process events may be neat for people live debugging a system

Metadata

Metadata

Assignees

No one assigned

    Projects

    Status

    Waiting for: Product Owner

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions