|
| 1 | +| Status | Date | Author(s) | |
| 2 | +|:---------|:-----------|:-------------------------------------| |
| 3 | +| Proposed | 2025-01-19 | [@nscuro](https://github.com/nscuro) | |
| 4 | + |
| 5 | +## Context |
| 6 | + |
| 7 | +By dropping the Kafka dependency ([ADR-001]), we are now missing a means to reliably dispatch |
| 8 | +notifications. Users are building processes around the notifications we send, so we must ensure |
| 9 | +that whatever we replace Kafka with offers the same or better delivery guarantees. |
| 10 | + |
| 11 | +### Background |
| 12 | + |
| 13 | +Users can configure multiple notification rules (known as *Alerts* in the UI). A rule semantically acts like a consumer, |
| 14 | +which subscribes to one or more subjects (aka *Notification Groups*, e.g. `BOM_PROCESSED`, `NEW_VULNERABILITY`), |
| 15 | +and publishes those notification to a destination (e.g. email, Slack, Webhook). |
| 16 | + |
| 17 | +Rules can further be limited to specific projects or tags, which acts like an additional filter. |
| 18 | + |
| 19 | +This means that a notification emitted by Dependency-Track "fans-out" to zero or more rules: |
| 20 | + |
| 21 | +```mermaid |
| 22 | +--- |
| 23 | +title: Notification Publishing Process |
| 24 | +--- |
| 25 | +flowchart LR |
| 26 | + A@{ shape: circle, label: "Start" } |
| 27 | + B["Emit Notification"] |
| 28 | + C["Route Notification"] |
| 29 | + D["Send email to foo\@example.com"] |
| 30 | + E["Send email to bar\@example.com"] |
| 31 | + F["Send Slack message to channel X"] |
| 32 | + G["Send Webhook to example.com"] |
| 33 | + I@{ shape: dbl-circ, label: "Stop" } |
| 34 | + A --> B |
| 35 | + B --> C |
| 36 | + C --> D |
| 37 | + C --> E |
| 38 | + C --> F |
| 39 | + C --> G |
| 40 | + D --> I |
| 41 | + E --> I |
| 42 | + F --> I |
| 43 | + G --> I |
| 44 | +``` |
| 45 | + |
| 46 | +Because each rule has a separate destination, the publishing process for them can fail independently. |
| 47 | +For example, the email server could be down, the Jira credentials could have expired, or Slack could enforce rate limiting. |
| 48 | + |
| 49 | +In Dependency-Track v4, the routing of notifications according to the configured rules, as well as the |
| 50 | +publishing according to those rules, are performed as a single unit of work. There are no retries. |
| 51 | + |
| 52 | +So far in Hyades, we emit notifications to Kafka instead. The routing and publishing still happens in a single |
| 53 | +unit of work, but in the separate `notification-publisher` service. The service uses [Confluent Parallel Consumer] |
| 54 | +to implement retries. However, these retries can cause duplicates, as documented in [hyades/#771]. |
| 55 | + |
| 56 | +### Requirements |
| 57 | + |
| 58 | +#### Atomicity |
| 59 | + |
| 60 | +A limitation of the in-memory, as well as the Kafka-based notification mechanism, is that emission of notifications |
| 61 | +can't happen atomically with the state change they inform about. Both approaches suffer from the [dual write problem]. |
| 62 | +Although not technically a hard requirement, we want atomic notification emission to be *possible*. |
| 63 | + |
| 64 | +Note that notification *delivery* to external systems **can't** be atomic. |
| 65 | +Exactly-once delivery is [impossible](https://bravenewgeek.com/you-cannot-have-exactly-once-delivery/). |
| 66 | + |
| 67 | +#### At-least-once delivery |
| 68 | + |
| 69 | +Notifications must be delivered to destination systems *at least once*. Duplicate deliveries are acceptable, |
| 70 | +loss of notifications is not. |
| 71 | + |
| 72 | +#### Isolation of deliveries |
| 73 | + |
| 74 | +Notification deliveries must be processed in isolation, and separate from the routing itself. |
| 75 | +Only then can deliveries be retried without causing the issue described in [hyades/#771]. |
| 76 | +More specifically, each delivery must be a separate unit of work in a queue. |
| 77 | + |
| 78 | +### Constraints |
| 79 | + |
| 80 | +#### Notification format |
| 81 | + |
| 82 | +All solutions should treat notification contents as binary blobs. During our initial move to Kafka, |
| 83 | +we adopted [Protobuf] to serialize notifications over-the-wire. We intend to keep it this way, |
| 84 | +since [Protobuf] makes it easy to perform backward-compatible changes over time. It is also faster |
| 85 | +to serialize, and more compact than JSON. |
| 86 | + |
| 87 | +The current [Protobuf] model is defined |
| 88 | +[here](https://github.com/DependencyTrack/hyades/blob/589b58042c096865c48939c8db79ef950fd94d20/proto/src/main/proto/org/dependencytrack/notification/v1/notification.proto). |
| 89 | + |
| 90 | +#### No new infrastructure |
| 91 | + |
| 92 | +No new infrastructure must be introduced. The solution should leverage the workflow orchestration capabilities |
| 93 | +described in [ADR-002]. This will take care of queueing, observability, and resiliency needs. |
| 94 | + |
| 95 | +#### Payload size |
| 96 | + |
| 97 | +As noted in [ADR-001], notifications can be large: `BOM_CONSUMED` and `BOM_PROCESSED` notifications include the entire |
| 98 | +BOM in Base64-encoded format. `PROJECT_VULN_ANALYSIS_COMPLETE` notifications include all vulnerabilities of a project. |
| 99 | +While Postgres *can* store large columns (see [TOAST]), it comes with penalties affecting performance, maintenance, |
| 100 | +and backups. Potent compression (e.g. using `zstd`) will most definitely be necessary for all solutions. |
| 101 | +Postgres itself will compress large values, too, but using a less effective algorithm (`pglz`). |
| 102 | + |
| 103 | +> Starting with Postgres v14, the default compression algorithm for [TOAST]-ed tables can be changed to `lz4`. |
| 104 | +> This should be added to our [database operations guide](../../operations/database.md). |
| 105 | +> `lz4` [performs noticeably better](https://www.timescale.com/blog/optimizing-postgresql-performance-compression-pglz-vs-lz4) |
| 106 | +> than `pglz`. |
| 107 | +
|
| 108 | +### Possible Solutions |
| 109 | + |
| 110 | +#### A: Use transactional outbox pattern |
| 111 | + |
| 112 | +The [transactional outbox] pattern involves a separate *outbox* table, where to-be-dispatched notifications |
| 113 | +are inserted into as part of a database transaction. A simple outbox table might look like this: |
| 114 | + |
| 115 | +| Column Name | Column Type | |
| 116 | +|:------------|:--------------| |
| 117 | +| `ID` | `BIGINT` | |
| 118 | +| `TIMESTAMP` | `TIMESTAMPTZ` | |
| 119 | +| `CONTENT` | `BYTEA` | |
| 120 | + |
| 121 | +The pattern is mostly meant to deal with the [dual write problem], but it could also act as a work queue: |
| 122 | +A pool of workers polls the table in regular intervals, and either deletes polled records, or marks them as delivered: |
| 123 | + |
| 124 | +```mermaid |
| 125 | +--- |
| 126 | +title: Notification Publishing Process with Outbox Table |
| 127 | +--- |
| 128 | +sequenceDiagram |
| 129 | + participant D as Domain |
| 130 | + participant P as Postgres |
| 131 | + participant N as Notification Router |
| 132 | + participant W as "Publish Notification"<br/>Workflow |
| 133 | +
|
| 134 | + activate D |
| 135 | + D->>P: Begin TX |
| 136 | + D->>P: ... |
| 137 | + D->>P: INSERT INTO "OUTBOX"<br/>(NOW(), <content>) |
| 138 | + D->>-P: Commit TX |
| 139 | + loop continuously |
| 140 | + activate N |
| 141 | + N->>P: Begin TX |
| 142 | + N->>P: SELECT * FROM "OUTBOX" ...<br/>FOR UPDATE SKIP LOCKED<br/>LIMIT 1 |
| 143 | + N->>N: Evaluate<br/>notification rules |
| 144 | + opt At least one rule matched |
| 145 | + N->>W: Schedule "Publish Notification" workflow<br/>Args: Rule names, notification |
| 146 | + end |
| 147 | + N->>P: DELETE FROM "OUTBOX"<br/>WHERE "ID" = ANY(...) |
| 148 | + N->>-P: Commit TX |
| 149 | + end |
| 150 | +``` |
| 151 | + |
| 152 | +Outbox items are dequeued, processed, and marked as completed in a single database transaction, |
| 153 | +using the `FOR UPDATE SKIP LOCKED` clause to allow for multiple concurrent pollers. |
| 154 | + |
| 155 | +If workflow scheduling fails, the transaction is rolled back, and the respective record will be retried |
| 156 | +during the next poll. |
| 157 | + |
| 158 | +> The workflow engine may reside on a separate database, so scheduling of workflows |
| 159 | +> can't happen atomically with the polling of outbox records. It is possible that a |
| 160 | +> workflow gets scheduled, but committing of the transaction fails. In worst case, |
| 161 | +> multiple workflows get scheduled for the same notification. This should be rare, |
| 162 | +> but it would still satisfy our at-least-once delivery goal. |
| 163 | +
|
| 164 | +The actual delivery is then taken care of by a workflow: |
| 165 | + |
| 166 | +```mermaid |
| 167 | +--- |
| 168 | +title: Notification Publishing Workflow |
| 169 | +--- |
| 170 | +sequenceDiagram |
| 171 | + participant W as "Publish Notification"<br/>Workflow |
| 172 | + participant A as "Publish Notification"<br/>Activity |
| 173 | + participant P as Postgres |
| 174 | + participant D as Destination |
| 175 | + |
| 176 | + activate W |
| 177 | + loop for each matched rule |
| 178 | + W->>+A: Invoke<br/>Args: Rule name, notification |
| 179 | + deactivate W |
| 180 | + A->>P: Retrieve publisher config |
| 181 | + P-->>A: Destination URL,<br/>credentials, template |
| 182 | + A->>-D: Publish |
| 183 | + end |
| 184 | +``` |
| 185 | + |
| 186 | +**Pro**: |
| 187 | + |
| 188 | +1. Allows for atomic emission of notifications. |
| 189 | +2. When marking outbox records as processed instead of deleting them, enables (targeted) replay of past notifications. |
| 190 | +3. When partitioning the outbox table by timestamp, cheap retention enforcement via `DROP TABLE`. |
| 191 | +4. Allows for multiple concurrent routers. |
| 192 | + |
| 193 | +**Con**: |
| 194 | + |
| 195 | +1. More database overhead: `INSERT`s, `UPDATE`s / `DELETE`s, polling, vacuuming, retention, storage. |
| 196 | +2. Overhead can't be delegated to separate database without losing transactional guarantees. |
| 197 | +3. Duplicates queueing logic we already have in the workflow orchestration system. |
| 198 | +4. Partitioning by timestamp requires partition management, either manually or via `pg_partman`. |
| 199 | +5. Multiple concurrent routers increase the chance of delivering notifications out-of-order. |
| 200 | +6. When not using partitioning, requires a separate retention enforcement mechanism. |
| 201 | + |
| 202 | +#### B: Use Postgres logical replication messages |
| 203 | + |
| 204 | +A way to sidestep the drawbacks of maintaining an outbox table is to emit and consume logical replication messages. |
| 205 | +Here, notifications are written to Postgres' write-ahead-log (WAL), but never materialized into an actual table. |
| 206 | +This still provides transactional guarantees, but completely avoids the overhead of table maintenance. |
| 207 | + |
| 208 | +The procedure is inspired by Gunnar Morling's |
| 209 | +[*The Wonders of Postgres Logical Decoding Messages*](https://www.infoq.com/articles/wonders-of-postgres-logical-decoding-messages/) |
| 210 | +article. |
| 211 | + |
| 212 | +Conceptually, the process of publishing notification remains mostly identical to |
| 213 | +[option A](#a-use-transactional-outbox-pattern): |
| 214 | + |
| 215 | +```mermaid |
| 216 | +--- |
| 217 | +title: Notification Publishing Process with Logical Replication Messages |
| 218 | +--- |
| 219 | +sequenceDiagram |
| 220 | + participant D as Domain |
| 221 | + participant P as Postgres |
| 222 | + participant N as Notification Router |
| 223 | + participant W as "Publish Notification"<br/>Workflow |
| 224 | +
|
| 225 | + activate D |
| 226 | + D->>P: Begin TX |
| 227 | + D->>P: ... |
| 228 | + D->>P: pg_logical_emit_message<br/>(?, 'notification', <content>) |
| 229 | + D->>-P: Commit TX |
| 230 | + loop continuously |
| 231 | + activate N |
| 232 | + N->>P: Consume logical<br/>replication message |
| 233 | + N->>N: Evaluate<br/>notification rules |
| 234 | + opt At least one rule matched |
| 235 | + N->>W: Schedule "Publish Notification" workflow<br/>Args: Rule names, notification |
| 236 | + end |
| 237 | + N->>-P: Mark message LSN as<br/>applied / flushed |
| 238 | + end |
| 239 | +``` |
| 240 | + |
| 241 | +Instead of polling a table, the router reads a stream of messages from a logical replication slot. |
| 242 | +Each message has an associated log sequence number (LSN), which represents its position in the WAL. |
| 243 | +In order for Postgres to know that a message was delivered successfully, the router acknowledges |
| 244 | +the LSN of processed messages. This is similar to how Kafka uses offset commits to track progress |
| 245 | +within a topic partition. |
| 246 | + |
| 247 | +The *Publish Notification* workflow being scheduled remains identical to [option A](#a-use-transactional-outbox-pattern). |
| 248 | + |
| 249 | +**Pro**: |
| 250 | + |
| 251 | +1. Allows for atomic emission of notifications. |
| 252 | +2. Less pressure on the WAL. [Option A](#a-use-transactional-outbox-pattern) involves *at least* one `INSERT`, |
| 253 | + and one `UPDATE` or `DELETE` per notification, each of which writes to the WAL, too. |
| 254 | +3. No increase in storage requirements. |
| 255 | +4. No retention logic necessary. |
| 256 | +5. No additional overhead for autovacuum. |
| 257 | + |
| 258 | +**Con**: |
| 259 | + |
| 260 | +1. Only a single instance can consume from a replication slot at a time. |
| 261 | +2. Logical replication requires a special kind of connection, thus can't go through a connection pooler. |
| 262 | +3. Requires Postgres v14 or later. This is when the default `pgoutput` decoding plugin started to support |
| 263 | + consumption of replication messages. |
| 264 | + |
| 265 | +#### C: Schedule workflows directly |
| 266 | + |
| 267 | +This option is similar to [A](#a-use-transactional-outbox-pattern) and |
| 268 | +[B](#b-use-postgres-logical-replication-messages), but skips the respective intermediary step. |
| 269 | + |
| 270 | +Notifications are no longer emitted atomically with the domain's persistence operations, |
| 271 | +but instead *after* the database transaction committed successfully, effectively re-introducing the [dual write problem]: |
| 272 | + |
| 273 | +```mermaid |
| 274 | +sequenceDiagram |
| 275 | + participant D as Domain |
| 276 | + participant P as Postgres |
| 277 | + participant W as "Publish Notification"<br/>Workflow |
| 278 | + |
| 279 | + activate D |
| 280 | + D->>P: Begin TX |
| 281 | + D->>P: ... |
| 282 | + D->>P: Commit TX |
| 283 | + D->>-W: Schedule<br/>Args: Notification |
| 284 | +``` |
| 285 | + |
| 286 | +The routing based on configured notification rules is performed as part of the *Publish Notification* workflow. |
| 287 | + |
| 288 | +```mermaid |
| 289 | +--- |
| 290 | +title: Notification Publishing Workflow with Routing |
| 291 | +--- |
| 292 | +sequenceDiagram |
| 293 | + participant W as "Publish Notification"<br/>Workflow |
| 294 | + participant A1 as "Route Notification"<br/>Activity |
| 295 | + participant A2 as "Publish Notification"<br/>Activity |
| 296 | + participant P as Postgres |
| 297 | + participant D as Destination |
| 298 | +
|
| 299 | + activate W |
| 300 | + W->>A1: Invoke<br/>Args: Notification |
| 301 | + A1-->>W: Matched rule names |
| 302 | + loop for each matched rule |
| 303 | + W->>+A2: Invoke<br/>Args: Rule name, notification |
| 304 | + deactivate W |
| 305 | + A2->>P: Retrieve publisher config |
| 306 | + P-->>A2: Destination URL,<br/>credentials, template |
| 307 | + A2->>-D: Publish |
| 308 | + end |
| 309 | +``` |
| 310 | + |
| 311 | +**Pro**: |
| 312 | + |
| 313 | +1. Fewer moving parts than options [A](#a-use-transactional-outbox-pattern) and |
| 314 | + [B](#b-use-postgres-logical-replication-messages). |
| 315 | +2. All concerns related to publishing are neatly encapsulated in a workflow. |
| 316 | +3. Rule resolution benefits from the same reliability guarantees as the publishing itself. |
| 317 | + |
| 318 | +**Con**: |
| 319 | + |
| 320 | +1. Atomic emission of notifications is impossible. |
| 321 | +2. Workflows are scheduled even if no rule may be configured for the notification at hand, |
| 322 | + increasing pressure on the workflow system. |
| 323 | + |
| 324 | +## Decision |
| 325 | + |
| 326 | +We propose to follow option [B](#b-use-postgres-logical-replication-messages), because: |
| 327 | + |
| 328 | +1. Maintaining an outbox table as detailed in option [A](#a-use-transactional-outbox-pattern) comes with a lot of |
| 329 | +overhead that we prefer to not deal with. |
| 330 | +2. Option [C](#c-schedule-workflows-directly) does not provide transactional guarantees. |
| 331 | + |
| 332 | +TODO: Update with final decision. |
| 333 | + |
| 334 | +## Consequences |
| 335 | + |
| 336 | +* The minimum required Postgres version becomes 14. This version is over three years old and well-supported |
| 337 | + across all managed offerings. We don't anticipate it to be a problem. |
| 338 | +* It must be possible to configure separate database connection details for the notification router, |
| 339 | + in case a pooler like PgBouncer is used. Logical replication requires a direct connection. |
| 340 | + |
| 341 | +[ADR-001]: 001-drop-kafka-dependency.md |
| 342 | +[ADR-002]: 002-workflow-orchestration.md |
| 343 | +[Confluent Parallel Consumer]: https://github.com/confluentinc/parallel-consumer |
| 344 | +[dual write problem]: https://www.confluent.io/blog/dual-write-problem/ |
| 345 | +[hyades/#771]: https://github.com/DependencyTrack/hyades/issues/771 |
| 346 | +[Protobuf]: https://protobuf.dev/ |
| 347 | +[TOAST]: https://www.postgresql.org/docs/current/storage-toast.html |
| 348 | +[transactional outbox]: https://microservices.io/patterns/data/transactional-outbox.html |
0 commit comments