More gracefully handle situations with many events & running into rate limits

### Problem Statement

👋  Hi and first of all for maintaining a wonderful client library for elixir 💜 

TLDR;
* Many exceptions to report
* run over quota --> reporting fails
* sentry keeps using ~8 Gigabyte in memory trying to report them over days

(I decided to call it an improvement, depending on PoV some may call it bug 😅 )

We ran into an incident some days ago which caused a lot of exception events. This caused us to run over our sentry quota (despite Spike protection being enabled, but also fair!).

From what I'm seeing all the events were were retained in memory and continuously tried to be submitted to sentry. This kept failing as we were over the quota, so we also had a bunch of these log messages:

```
[warning] Failed to send Sentry event. Sentry failed to report event: Sentry responded with status 429 - Too Many Requests and the SDK exhausted the configured retries
```

The symptom on the system were a bunch of `Sentry.Transport.Sender` processes with very high reduction counts and significant memory consumption and big message queues (as reported by `:observer_cli`):

```
|1  |<0.113277224| 111.0986 MiB |Elixir.IEx.Evaluator:init/5                  |       9821477| 0        |Elixir.IEx.Evaluator:loop/1      |
|2  |<0.3783.0>  |  20.5259 MiB |'Elixir.Sentry.Transport.Sender'             |     117164527| 558      |Elixir.Process:sleep/1           |
|3  |<0.3809.0>  |  18.3961 MiB |'Elixir.Sentry.Transport.Sender'             |     125430590| 1810     |Elixir.Process:sleep/1           |
|4  |<0.3827.0>  |  18.3959 MiB |'Elixir.Sentry.Transport.Sender'             |     102370638| 1821     |Elixir.Process:sleep/1           |
|5  |<0.3793.0>  |  18.3955 MiB |'Elixir.Sentry.Transport.Sender'             |     103264516| 1809     |Elixir.Process:sleep/1           |
|6  |<0.3772.0>  |  18.3935 MiB |'Elixir.Sentry.Transport.Sender'             |     102460005| 1776     |Elixir.Process:sleep/1           |
|7  |<0.3825.0>  |  18.3913 MiB |'Elixir.Sentry.Transport.Sender'             |     120544828| 1763     |Elixir.Process:sleep/1           |
|8  |<0.3800.0>  |  18.3869 MiB |'Elixir.Sentry.Transport.Sender'             |     105581947| 1702     |Elixir.Process:sleep/1           |
|9  |<0.3789.0>  |  18.3837 MiB |'Elixir.Sentry.Transport.Sender'             |     123818720| 1676     |Elixir.Process:sleep/1           |
|10 |<0.3778.0>  |  18.3829 MiB |'Elixir.Sentry.Transport.Sender'             |     118361900| 1655     |Elixir.Process:sleep/1           |
|11 |<0.3817.0>  |  18.3754 MiB |'Elixir.Sentry.Transport.Sender'             |     121602601| 1559     |Elixir.Process:sleep/1           |
|12 |<0.3769.0>  |  18.3735 MiB |'Elixir.Sentry.Transport.Sender'             |     112488249| 1551     |Elixir.Process:sleep/1           |
|13 |<0.3794.0>  |  15.7664 MiB |'Elixir.Sentry.Transport.Sender'             |     110978452| 1880     |Elixir.Process:sleep/1           |
|14 |<0.3802.0>  |  15.7609 MiB |'Elixir.Sentry.Transport.Sender'             |     116827056| 1824     |Elixir.Process:sleep/1           |
|15 |<0.3804.0>  |  15.7569 MiB |'Elixir.Sentry.Transport.Sender'             |     112633395| 1767     |Elixir.Process:sleep/1           |
|16 |<0.3767.0>  |  15.7569 MiB |'Elixir.Sentry.Transport.Sender'             |     108640217| 1778     |Elixir.Process:sleep/1           |
|17 |<0.3826.0>  |  15.7494 MiB |'Elixir.Sentry.Transport.Sender'             |     119471892| 1682     |Elixir.Process:sleep/1           |
|18 |<0.3812.0>  |  15.7478 MiB |'Elixir.Sentry.Transport.Sender'             |     110992238| 1657     |Elixir.Process:sleep/1           |
|19 |<0.3818.0>  |  15.7443 MiB |'Elixir.Sentry.Transport.Sender'             |     111055539| 1629     |Elixir.Process:sleep/1           |
|20 |<0.3797.0>  |  15.7374 MiB |'Elixir.Sentry.Transport.Sender'             |     105904350| 1527     |Elixir.Process:sleep/1           |
|21 |<0.3801.0>  |  15.7369 MiB |'Elixir.Sentry.Transport.Sender'             |     120051518| 1541     |Elixir.Process:sleep/1           |
|22 |<0.3819.0>  |  15.7360 MiB |'Elixir.Sentry.Transport.Sender'             |     107081411| 1530     |Elixir.Process:sleep/1
```

(only an excerp from one of the not as affected pods).

At this point the incident was ~2.5 days ago but all these processes seemed to never succeed in submitting the events and hence kept them around. You can clearly see the impact here in our pod that was affected the worst:

<img width="2060" height="369" alt="Image" src="https://github.com/user-attachments/assets/c46ba9b4-4689-44b3-861f-807d35165c87" />

I omitted the concrete time frame but it's ~2.5 days. The graphed metric is `beam_memory_allocated_bytes`.

You can see the memory going up, metric reporting malfunctioning due to incident and then the memory staying at a consistently high amount until it suddenly came crashing down. The time it came crashing down is when I ran `Supervisor.terminate_child(Sentry.Supervisor, Sentry.Transport.SenderPool)` on the system. So I'm 100% sure this memory increase is directly related to these Sentry processes.

Admittedly those were exceptional circumstances but I wouldn't expect sentry to potentially contribute to an incident/outage by hogging this much memory unsuccessfully trying to resubmit exceptions 😅 

## Environment
```
elixir -v
Erlang/OTP 27 [erts-15.2.6] [source] [64-bit] [smp:14:14] [ds:14:14:10] [async-threads:1] [jit]

Elixir 1.18.3 (compiled with Erlang/OTP 27)

mix deps | grep sentry
* sentry 11.0.4 (Hex package) (mix)
  locked at 11.0.4 (sentry) feaafc28
```

Our sentry config is nothing special I believe:

```elixir
  config :sentry,
    dsn: sentry_dsn,
    environment_name: environment,
    enable_source_code_context: true,
    root_source_code_paths: [File.cwd!()],
    integrations: [
      oban: [
        capture_errors: true,
        cron: [enabled: true]
      ]
    ]
```

(I did need to write custom code to make error reporting from gRPC work, but I don't think that's too relevant here - in the end it calls `Sentry.capture_exception` and is modeled after the plugs)

It's a phoenix application with ecto etc... it does have gRPC but otherwise nothing too fancy I'd say.

### Solution Brainstorm

I'm not too sure, my expectation was that eventually sentry would give up and just throw away the events as it failed to submit them too many times or too many events are outstanding on the client side.

The best comparison I have is the protections [the erlang users to protect the handler from overload](https://www.erlang.org/doc/apps/kernel/logger_chapter.html#protecting-the-handler-from-overload). It tracks message que length (among others) and enters drop or flush modes to get rid of extra messages/avoid getting overloaded. 

Especially when the sentry error message is a 429 with quota exceeded I think it's the right call to drop _some_ events. I'm not sure if the client can know that the monthly quote was exceeded, but if it can esp. then I think it's probably good to drop events.

I know, not losing events is a noble goal but as is it's a bit counter-productive.

As an aside, in case someone runs into this situation, a sentry client function to help you flush events would be nice. I first tried to kill the `Sentry.Supervisor` but that took the entire application down - whoopsie my fault 😅  But yeah, took a while to find the child to terminate and do that. A helper may be nice.

I know I can be rambly, sorry, in conclusion:
* a client side limit for events being processes similar to Logger overflow protections may be good, to also limit memory impact on applications
* if quota exceeded is detected, dropping even more events instead of continously trying to submit them may be good
* a helper functions to remove in process events may be neat for people live debugging a system

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

More gracefully handle situations with many events & running into rate limits #948

Problem Statement

Environment

Solution Brainstorm

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

More gracefully handle situations with many events & running into rate limits #948

Description

Problem Statement

Environment

Solution Brainstorm

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions