Skip to content

Commit c1b3df2

Browse files
committed
Add "ADR-003: Notification Publishing"
Signed-off-by: nscuro <nscuro@protonmail.com>
1 parent 91a641b commit c1b3df2

File tree

2 files changed

+349
-0
lines changed

2 files changed

+349
-0
lines changed
Lines changed: 348 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,348 @@
1+
| Status | Date | Author(s) |
2+
|:---------|:-----------|:-------------------------------------|
3+
| Proposed | 2025-01-19 | [@nscuro](https://github.com/nscuro) |
4+
5+
## Context
6+
7+
By dropping the Kafka dependency ([ADR-001]), we are now missing a means to reliably dispatch
8+
notifications. Users are building processes around the notifications we send, so we must ensure
9+
that whatever we replace Kafka with offers the same or better delivery guarantees.
10+
11+
### Background
12+
13+
Users can configure multiple notification rules (known as *Alerts* in the UI). A rule semantically acts like a consumer,
14+
which subscribes to one or more subjects (aka *Notification Groups*, e.g. `BOM_PROCESSED`, `NEW_VULNERABILITY`),
15+
and publishes those notification to a destination (e.g. email, Slack, Webhook).
16+
17+
Rules can further be limited to specific projects or tags, which acts like an additional filter.
18+
19+
This means that a notification emitted by Dependency-Track "fans-out" to zero or more rules:
20+
21+
```mermaid
22+
---
23+
title: Notification Publishing Process
24+
---
25+
flowchart LR
26+
A@{ shape: circle, label: "Start" }
27+
B["Emit Notification"]
28+
C["Route Notification"]
29+
D["Send email to foo\@example.com"]
30+
E["Send email to bar\@example.com"]
31+
F["Send Slack message to channel X"]
32+
G["Send Webhook to example.com"]
33+
I@{ shape: dbl-circ, label: "Stop" }
34+
A --> B
35+
B --> C
36+
C --> D
37+
C --> E
38+
C --> F
39+
C --> G
40+
D --> I
41+
E --> I
42+
F --> I
43+
G --> I
44+
```
45+
46+
Because each rule has a separate destination, the publishing process for them can fail independently.
47+
For example, the email server could be down, the Jira credentials could have expired, or Slack could enforce rate limiting.
48+
49+
In Dependency-Track v4, the routing of notifications according to the configured rules, as well as the
50+
publishing according to those rules, are performed as a single unit of work. There are no retries.
51+
52+
So far in Hyades, we emit notifications to Kafka instead. The routing and publishing still happens in a single
53+
unit of work, but in the separate `notification-publisher` service. The service uses [Confluent Parallel Consumer]
54+
to implement retries. However, these retries can cause duplicates, as documented in [hyades/#771].
55+
56+
### Requirements
57+
58+
#### Atomicity
59+
60+
A limitation of the in-memory, as well as the Kafka-based notification mechanism, is that emission of notifications
61+
can't happen atomically with the state change they inform about. Both approaches suffer from the [dual write problem].
62+
Although not technically a hard requirement, we want atomic notification emission to be *possible*.
63+
64+
Note that notification *delivery* to external systems **can't** be atomic.
65+
Exactly-once delivery is [impossible](https://bravenewgeek.com/you-cannot-have-exactly-once-delivery/).
66+
67+
#### At-least-once delivery
68+
69+
Notifications must be delivered to destination systems *at least once*. Duplicate deliveries are acceptable,
70+
loss of notifications is not.
71+
72+
#### Isolation of deliveries
73+
74+
Notification deliveries must be processed in isolation, and separate from the routing itself.
75+
Only then can deliveries be retried without causing the issue described in [hyades/#771].
76+
More specifically, each delivery must be a separate unit of work in a queue.
77+
78+
### Constraints
79+
80+
#### Notification format
81+
82+
All solutions should treat notification contents as binary blobs. During our initial move to Kafka,
83+
we adopted [Protobuf] to serialize notifications over-the-wire. We intend to keep it this way,
84+
since [Protobuf] makes it easy to perform backward-compatible changes over time. It is also faster
85+
to serialize, and more compact than JSON.
86+
87+
The current [Protobuf] model is defined
88+
[here](https://github.com/DependencyTrack/hyades/blob/589b58042c096865c48939c8db79ef950fd94d20/proto/src/main/proto/org/dependencytrack/notification/v1/notification.proto).
89+
90+
#### No new infrastructure
91+
92+
No new infrastructure must be introduced. The solution should leverage the workflow orchestration capabilities
93+
described in [ADR-002]. This will take care of queueing, observability, and resiliency needs.
94+
95+
#### Payload size
96+
97+
As noted in [ADR-001], notifications can be large: `BOM_CONSUMED` and `BOM_PROCESSED` notifications include the entire
98+
BOM in Base64-encoded format. `PROJECT_VULN_ANALYSIS_COMPLETE` notifications include all vulnerabilities of a project.
99+
While Postgres *can* store large columns (see [TOAST]), it comes with penalties affecting performance, maintenance,
100+
and backups. Potent compression (e.g. using `zstd`) will most definitely be necessary for all solutions.
101+
Postgres itself will compress large values, too, but using a less effective algorithm (`pglz`).
102+
103+
> Starting with Postgres v14, the default compression algorithm for [TOAST]-ed tables can be changed to `lz4`.
104+
> This should be added to our [database operations guide](../../operations/database.md).
105+
> `lz4` [performs noticeably better](https://www.timescale.com/blog/optimizing-postgresql-performance-compression-pglz-vs-lz4)
106+
> than `pglz`.
107+
108+
### Possible Solutions
109+
110+
#### A: Use transactional outbox pattern
111+
112+
The [transactional outbox] pattern involves a separate *outbox* table, where to-be-dispatched notifications
113+
are inserted into as part of a database transaction. A simple outbox table might look like this:
114+
115+
| Column Name | Column Type |
116+
|:------------|:--------------|
117+
| `ID` | `BIGINT` |
118+
| `TIMESTAMP` | `TIMESTAMPTZ` |
119+
| `CONTENT` | `BYTEA` |
120+
121+
The pattern is mostly meant to deal with the [dual write problem], but it could also act as a work queue:
122+
A pool of workers polls the table in regular intervals, and either deletes polled records, or marks them as delivered:
123+
124+
```mermaid
125+
---
126+
title: Notification Publishing Process with Outbox Table
127+
---
128+
sequenceDiagram
129+
participant D as Domain
130+
participant P as Postgres
131+
participant N as Notification Router
132+
participant W as "Publish Notification"<br/>Workflow
133+
134+
activate D
135+
D->>P: Begin TX
136+
D->>P: ...
137+
D->>P: INSERT INTO "OUTBOX"<br/>(NOW(), <content>)
138+
D->>-P: Commit TX
139+
loop continuously
140+
activate N
141+
N->>P: Begin TX
142+
N->>P: SELECT * FROM "OUTBOX" ...<br/>FOR UPDATE SKIP LOCKED<br/>LIMIT 1
143+
N->>N: Evaluate<br/>notification rules
144+
opt At least one rule matched
145+
N->>W: Schedule "Publish Notification" workflow<br/>Args: Rule names, notification
146+
end
147+
N->>P: DELETE FROM "OUTBOX"<br/>WHERE "ID" = ANY(...)
148+
N->>-P: Commit TX
149+
end
150+
```
151+
152+
Outbox items are dequeued, processed, and marked as completed in a single database transaction,
153+
using the `FOR UPDATE SKIP LOCKED` clause to allow for multiple concurrent pollers.
154+
155+
If workflow scheduling fails, the transaction is rolled back, and the respective record will be retried
156+
during the next poll.
157+
158+
> The workflow engine may reside on a separate database, so scheduling of workflows
159+
> can't happen atomically with the polling of outbox records. It is possible that a
160+
> workflow gets scheduled, but committing of the transaction fails. In worst case,
161+
> multiple workflows get scheduled for the same notification. This should be rare,
162+
> but it would still satisfy our at-least-once delivery goal.
163+
164+
The actual delivery is then taken care of by a workflow:
165+
166+
```mermaid
167+
---
168+
title: Notification Publishing Workflow
169+
---
170+
sequenceDiagram
171+
participant W as "Publish Notification"<br/>Workflow
172+
participant A as "Publish Notification"<br/>Activity
173+
participant P as Postgres
174+
participant D as Destination
175+
176+
activate W
177+
loop for each matched rule
178+
W->>+A: Invoke<br/>Args: Rule name, notification
179+
deactivate W
180+
A->>P: Retrieve publisher config
181+
P-->>A: Destination URL,<br/>credentials, template
182+
A->>-D: Publish
183+
end
184+
```
185+
186+
**Pro**:
187+
188+
1. Allows for atomic emission of notifications.
189+
2. When marking outbox records as processed instead of deleting them, enables (targeted) replay of past notifications.
190+
3. When partitioning the outbox table by timestamp, cheap retention enforcement via `DROP TABLE`.
191+
4. Allows for multiple concurrent routers.
192+
193+
**Con**:
194+
195+
1. More database overhead: `INSERT`s, `UPDATE`s / `DELETE`s, polling, vacuuming, retention, storage.
196+
2. Overhead can't be delegated to separate database without losing transactional guarantees.
197+
3. Duplicates queueing logic we already have in the workflow orchestration system.
198+
4. Partitioning by timestamp requires partition management, either manually or via `pg_partman`.
199+
5. Multiple concurrent routers increase the chance of delivering notifications out-of-order.
200+
6. When not using partitioning, requires a separate retention enforcement mechanism.
201+
202+
#### B: Use Postgres logical replication messages
203+
204+
A way to sidestep the drawbacks of maintaining an outbox table is to emit and consume logical replication messages.
205+
Here, notifications are written to Postgres' write-ahead-log (WAL), but never materialized into an actual table.
206+
This still provides transactional guarantees, but completely avoids the overhead of table maintenance.
207+
208+
The procedure is inspired by Gunnar Morling's
209+
[*The Wonders of Postgres Logical Decoding Messages*](https://www.infoq.com/articles/wonders-of-postgres-logical-decoding-messages/)
210+
article.
211+
212+
Conceptually, the process of publishing notification remains mostly identical to
213+
[option A](#a-use-transactional-outbox-pattern):
214+
215+
```mermaid
216+
---
217+
title: Notification Publishing Process with Logical Replication Messages
218+
---
219+
sequenceDiagram
220+
participant D as Domain
221+
participant P as Postgres
222+
participant N as Notification Router
223+
participant W as "Publish Notification"<br/>Workflow
224+
225+
activate D
226+
D->>P: Begin TX
227+
D->>P: ...
228+
D->>P: pg_logical_emit_message<br/>(?, 'notification', <content>)
229+
D->>-P: Commit TX
230+
loop continuously
231+
activate N
232+
N->>P: Consume logical<br/>replication message
233+
N->>N: Evaluate<br/>notification rules
234+
opt At least one rule matched
235+
N->>W: Schedule "Publish Notification" workflow<br/>Args: Rule names, notification
236+
end
237+
N->>-P: Mark message LSN as<br/>applied / flushed
238+
end
239+
```
240+
241+
Instead of polling a table, the router reads a stream of messages from a logical replication slot.
242+
Each message has an associated log sequence number (LSN), which represents its position in the WAL.
243+
In order for Postgres to know that a message was delivered successfully, the router acknowledges
244+
the LSN of processed messages. This is similar to how Kafka uses offset commits to track progress
245+
within a topic partition.
246+
247+
The *Publish Notification* workflow being scheduled remains identical to [option A](#a-use-transactional-outbox-pattern).
248+
249+
**Pro**:
250+
251+
1. Allows for atomic emission of notifications.
252+
2. Less pressure on the WAL. [Option A](#a-use-transactional-outbox-pattern) involves *at least* one `INSERT`,
253+
and one `UPDATE` or `DELETE` per notification, each of which writes to the WAL, too.
254+
3. No increase in storage requirements.
255+
4. No retention logic necessary.
256+
5. No additional overhead for autovacuum.
257+
258+
**Con**:
259+
260+
1. Only a single instance can consume from a replication slot at a time.
261+
2. Logical replication requires a special kind of connection, thus can't go through a connection pooler.
262+
3. Requires Postgres v14 or later. This is when the default `pgoutput` decoding plugin started to support
263+
consumption of replication messages.
264+
265+
#### C: Schedule workflows directly
266+
267+
This option is similar to [A](#a-use-transactional-outbox-pattern) and
268+
[B](#b-use-postgres-logical-replication-messages), but skips the respective intermediary step.
269+
270+
Notifications are no longer emitted atomically with the domain's persistence operations,
271+
but instead *after* the database transaction committed successfully, effectively re-introducing the [dual write problem]:
272+
273+
```mermaid
274+
sequenceDiagram
275+
participant D as Domain
276+
participant P as Postgres
277+
participant W as "Publish Notification"<br/>Workflow
278+
279+
activate D
280+
D->>P: Begin TX
281+
D->>P: ...
282+
D->>P: Commit TX
283+
D->>-W: Schedule<br/>Args: Notification
284+
```
285+
286+
The routing based on configured notification rules is performed as part of the *Publish Notification* workflow.
287+
288+
```mermaid
289+
---
290+
title: Notification Publishing Workflow with Routing
291+
---
292+
sequenceDiagram
293+
participant W as "Publish Notification"<br/>Workflow
294+
participant A1 as "Route Notification"<br/>Activity
295+
participant A2 as "Publish Notification"<br/>Activity
296+
participant P as Postgres
297+
participant D as Destination
298+
299+
activate W
300+
W->>A1: Invoke<br/>Args: Notification
301+
A1-->>W: Matched rule names
302+
loop for each matched rule
303+
W->>+A2: Invoke<br/>Args: Rule name, notification
304+
deactivate W
305+
A2->>P: Retrieve publisher config
306+
P-->>A2: Destination URL,<br/>credentials, template
307+
A2->>-D: Publish
308+
end
309+
```
310+
311+
**Pro**:
312+
313+
1. Fewer moving parts than options [A](#a-use-transactional-outbox-pattern) and
314+
[B](#b-use-postgres-logical-replication-messages).
315+
2. All concerns related to publishing are neatly encapsulated in a workflow.
316+
3. Rule resolution benefits from the same reliability guarantees as the publishing itself.
317+
318+
**Con**:
319+
320+
1. Atomic emission of notifications is impossible.
321+
2. Workflows are scheduled even if no rule may be configured for the notification at hand,
322+
increasing pressure on the workflow system.
323+
324+
## Decision
325+
326+
We propose to follow option [B](#b-use-postgres-logical-replication-messages), because:
327+
328+
1. Maintaining an outbox table as detailed in option [A](#a-use-transactional-outbox-pattern) comes with a lot of
329+
overhead that we prefer to not deal with.
330+
2. Option [C](#c-schedule-workflows-directly) does not provide transactional guarantees.
331+
332+
TODO: Update with final decision.
333+
334+
## Consequences
335+
336+
* The minimum required Postgres version becomes 14. This version is over three years old and well-supported
337+
across all managed offerings. We don't anticipate it to be a problem.
338+
* It must be possible to configure separate database connection details for the notification router,
339+
in case a pooler like PgBouncer is used. Logical replication requires a direct connection.
340+
341+
[ADR-001]: 001-drop-kafka-dependency.md
342+
[ADR-002]: 002-workflow-orchestration.md
343+
[Confluent Parallel Consumer]: https://github.com/confluentinc/parallel-consumer
344+
[dual write problem]: https://www.confluent.io/blog/dual-write-problem/
345+
[hyades/#771]: https://github.com/DependencyTrack/hyades/issues/771
346+
[Protobuf]: https://protobuf.dev/
347+
[TOAST]: https://www.postgresql.org/docs/current/storage-toast.html
348+
[transactional outbox]: https://microservices.io/patterns/data/transactional-outbox.html

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,7 @@ nav:
9191
- Overview: architecture/decisions/000-index.md
9292
- "ADR-001: Drop Kafka Dependency": architecture/decisions/001-drop-kafka-dependency.md
9393
- "ADR-002: Workflow Orchestration": architecture/decisions/002-workflow-orchestration.md
94+
- "ADR-003: Notification Publishing": architecture/decisions/003-notification-publishing.md
9495
- Design:
9596
- Workflow State Tracking: architecture/design/workflow-state-tracking.md
9697
- Operations:

0 commit comments

Comments
 (0)