Skip to content

Commit b2be987

Browse files
committed
Why did the Elixir application keep running despite dependency shutdown?
1 parent 233ccf0 commit b2be987

File tree

1 file changed

+66
-0
lines changed

1 file changed

+66
-0
lines changed
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
---
2+
title: Why did the Elixir application keep running despite dependency shutdown?
3+
date: 2024-12-05
4+
tags: [elixir, erlang, applications, application-restart-strategy]
5+
description: Discover how we resolved a production incident in an Elixir-based RPC service caused by PgBouncer downtime and application restart strategies. Explore the impact of restart_type on applications.
6+
---
7+
8+
Recently, we've had an incident with one of our RPC services on production. The root cause was an issue with a PgBouncer instance going down. Although PgBouncer was eventually restored and became operational, our service remained partially functional until the pods were restarted. In this article, I’ll document the root cause of the issue and how we fixed it, effectively answering the question posed in the title.
9+
10+
## The incident
11+
12+
During the incident, our service appeared to be partially functional. While we were still receiving and responding to some RPC requests, others were failing due to errors like this:
13+
14+
```elixir
15+
** (RuntimeError) could not lookup Ecto repo Platform.Repo because it was not started or it does not exist
16+
(ecto 3.10.3) lib/ecto/repo/registry.ex:22: Ecto.Repo.Registry.lookup/1
17+
(ecto 3.10.3) lib/ecto/repo/supervisor.ex:160: Ecto.Repo.Supervisor.tuplet/2
18+
(platform 0.0.1) lib/repo.ex:6: Platform.Repo.all/2
19+
```
20+
21+
That means we could only respond to requests that didn't interact with `Platform.Repo`, however that error still didn't tell us much, except that the `Platform.Repo` we were looking for in the registry couldn't be found for some reason while handling an RPC call. But why was this happening?
22+
23+
## Root cause
24+
25+
Upon further investigation we've found logs below from that service's pods.
26+
27+
```elixir
28+
Application platform exited: shutdown
29+
```
30+
31+
That meant our application `platform`, one of our RPC service's dependencies, which manages `Platform.Repo` had shut down.
32+
But if `platform` was down, how was our service still running?
33+
The answer lay in how the application was started. Since this was one of our legacy apps, it was still using Mix (instead of releases) to start the server on production.
34+
Specifically we were using a task to start the application and all its dependencies like this:
35+
36+
```elixir
37+
Application.ensure_all_started(:our_legacy_rpc)
38+
```
39+
40+
Here is the key difference:
41+
In a `Mix` release, applications are started with a `restart_type` of `permanent`. However when using `Application.ensure_all_started/1`, the `restart_type` defaults to `temporary`. According to the documentation:
42+
43+
> :temporary - if app terminates, it is reported but no other applications are terminated (the default).
44+
45+
This explains why our service kept running even though one of its critical dependencies (`platform`) had shut down. Erlang does nothing about shutdown applications, which makes sense, since the top-level supervisor of the application will attempt to restart its children up to `max_retries` within `max_seconds` ([source](https://hexdocs.pm/elixir/Supervisor.html#module-supervisor-strategies-and-options)) before giving up and eventually shutting down the whole application.
46+
47+
## Resolution
48+
49+
To mitigate the issue we basically started all applications with a `restart_type` of `permanent` like so:
50+
51+
```elixir
52+
Application.ensure_all_start(:our_legacy_rpc, type: :permanent)
53+
```
54+
55+
This guarantees that, if any dependency of our application (or the application itself) goes down for any reason, everything will be shutdown which will trigger Kubernetes to restart the container because of failing liveness checks.
56+
57+
> :permanent - if app terminates, all other applications and the entire node are also terminated.
58+
59+
### Sources
60+
61+
Some links that helped me understand how Erlang manages applications and the role of `restart_type`.
62+
63+
- http://erlang.org/documentation/doc-5.2/doc/design_principles/applications.html
64+
- https://blog.plataformatec.com.br/2015/04/build-embedded-and-start-permanent-in-elixir-1-0-4/
65+
- https://stackoverflow.com/a/44665998/4796762
66+
- https://stackoverflow.com/a/39662718/4796762

0 commit comments

Comments
 (0)