|
| 1 | +| Status | Date | Author(s) | |
| 2 | +|:---------|:-----------|:-------------------------------------| |
| 3 | +| Proposed | 2025-01-07 | [@nscuro](https://github.com/nscuro) | |
| 4 | + |
| 5 | +## Context |
| 6 | + |
| 7 | +### How Kafka is currently used |
| 8 | + |
| 9 | +As of hyades version 0.6.0, Kafka is used for the following purposes: |
| 10 | + |
| 11 | +* **Notification dispatching**. Each notification type has its own Kafka topic. The `notification-publisher` service |
| 12 | + is responsible for consuming from those topics, and publishing notifications based on the configured rules, |
| 13 | + i.e. sending Slack messages or Webhooks. Because Kafka does not delete messages after consumption, notifications |
| 14 | + can be consumed by multiple clients, and replayed if necessary. Given a consistent message key, Kafka can further |
| 15 | + guarantee message ordering. |
| 16 | +* **Vulnerability mirroring**. Vulnerability records downloaded from the NVD and other sources are not immediately |
| 17 | + written to the database. Instead, they are sent to a [compacted Kafka topic], from where they are consumed and ingested. |
| 18 | + Kafka acts as a firehose that allows ingestion to be performed at a steady rate, without overloading the database. |
| 19 | +* **Vulnerability analysis**. The API server publishes a record for each to-be-analyzed component to Kafka. |
| 20 | + The `vulnerability-analyzer` service consumes from this topic, scans each component with all configured scanners, |
| 21 | + and publishes the results to a separate Kafka topic. The results topic is consumed by the API server, |
| 22 | + which is responsible for ingesting them into the database. Analysis makes heavy use of stream processing techniques |
| 23 | + to improve performance. The full process is documented [here](https://github.com/DependencyTrack/hyades/blob/8f1dd4cb4e02c8b4b646217ddd006ef81490cdec/vulnerability-analyzer/README.md#how-it-works). |
| 24 | +* **Repository metadata analysis**. Similar to vulnerability analysis, but for component metadata such as latest |
| 25 | + available versions, publish timestamps, and hashes. The process is documented [here](https://github.com/DependencyTrack/hyades/tree/8f1dd4cb4e02c8b4b646217ddd006ef81490cdec/repository-meta-analyzer#how-it-works). |
| 26 | + |
| 27 | +The Kafka ecosystem is huge, and there exist many managed offerings for it. Over the recent years, many more |
| 28 | +projects and products were released that implement the Kafka protocol, giving users more choice. |
| 29 | +Operating Kafka has gotten easier, both due to the many new implementations, and because Kafka has dropped |
| 30 | +the previously mandatory ZooKeeper dependency. |
| 31 | + |
| 32 | +Message throughput is fantastic. Durability and ordering guarantees are great. |
| 33 | + |
| 34 | +### Issues and limitations |
| 35 | + |
| 36 | +* **Retries**. Kafka does not yet support ACKs or NACKs of individual messages. It works with log offsets, |
| 37 | + making it difficult to implement fine-grained retry mechanisms for individual failed messages, |
| 38 | + without complex client-side solutions. Hyades implements retries using [Kafka Streams state stores], |
| 39 | + or by leveraging [Confluent Parallel Consumer]. With [KIP-932], native ACKs of individual messages are on the horizon. |
| 40 | +* **Prioritization**. Kafka is an append-only log, and as such does not support prioritization of messages. |
| 41 | + Prioritization is required to ensure that actions triggered by users or clients take precedence over scheduled ones. |
| 42 | + [Implementing prioritization on top of Kafka primitives](https://www.confluent.io/blog/prioritize-messages-in-kafka/) |
| 43 | + is complex and inflexible. [KIP-932] does not cover priorities. The lack of prioritization can *partly* be compensated |
| 44 | + by ensuring high throughput of the system, which Kafka *does* support. But it's not a sustainable solution. |
| 45 | +* **Message sizes**. Kafka has a default message size limit of `1 MiB`. We observed notifications and vulnerability |
| 46 | + analysis results growing larger than `1 MiB`, even when compressed. The size limit can be increased on a per-topic basis, |
| 47 | + but it comes with a performance penalty. We further found that some organizations disallow increasing size limits |
| 48 | + entirely to limit impact on other teams sharing the same brokers. |
| 49 | +* **End-to-end observability**. Tracking message flow and workflow progress across multiple topics and services |
| 50 | + requires dedicated monitoring for logs, metrics, and traces. This raises the barrier to entry for operating |
| 51 | + Dependency-Track clusters, and complicates debugging of issues and development. Relying solely on a PubSub broker like |
| 52 | + Kafka or DT v4's internal message bus promotes [choreography] over [orchestration]. Choreography makes processes |
| 53 | + increasingly hard to understand and follow. The initial [workflow state tracking] implementation attempted to lessen |
| 54 | + the pain, but the logic being scattered across all choreography participants is not helpful. |
| 55 | +* **Spotty support for advanced Kafka features**. Kafka comes with advanced features like transactions, compacted topics, |
| 56 | + and more. We found that support for these is very spotty across alternative implementations (in particular transactions). |
| 57 | + Further, we received feedback that organizations that operate shared Kafka clusters may prohibit usage of compacted |
| 58 | + topics. With only bare-bones features left available, the argument for Kafka becomes a lot less compelling. |
| 59 | +* **Topic management**. Partitions are what enables parallelism for Kafka consumers. The number of partitions must |
| 60 | + be decided before topics are created. Increasing partitions later is possible, decreasing is not. Adding partitions |
| 61 | + impacts ordering guarantees and can be tricky to coordinate. In order to leverage stream processing techniques, |
| 62 | + some topics must be [co-partitioned](https://www.confluent.io/blog/co-partitioning-in-kafka-streams/). |
| 63 | + Generic tooling around topic management, comparable to database migration tooling, is severely lacking, |
| 64 | + making it hard to maintain for a diverse user base. Vendor-specific tooling is available, |
| 65 | + such as [Strimzi's topic operator](https://strimzi.io/docs/operators/latest/overview#overview-concepts-topic-operator-str). |
| 66 | +* **Community support**. Running an additional piece of infrastructure ourselves is one thing. Supporting a whole |
| 67 | + community in doing that correctly and efficiently is another. Unfortunately there is no single deployment |
| 68 | + or configuration that works for everyone. We don't have dedicated support staff and need to be pragmatic about |
| 69 | + what we can realistically support. Requiring Kafka doesn't help. |
| 70 | + |
| 71 | +In summary, *Kafka on its own provides not enough benefit for us to justify its usage*. |
| 72 | + |
| 73 | +We hoped it would help in more areas, but ended up realizing that working around these issues required even more |
| 74 | +additional overhead and infrastructure to address. Which is not sustainable, given we already spent [innovation tokens] |
| 75 | +on Kafka itself, and have limited team capacities. |
| 76 | + |
| 77 | +### Possible Solutions |
| 78 | + |
| 79 | +#### A: Replace Kafka for another message broker |
| 80 | + |
| 81 | +We could replace Kafka with another, more lightweight broker, like [ActiveMQ], [RabbitMQ], [NATS], or [Redis]. |
| 82 | + |
| 83 | +[ActiveMQ] and [RabbitMQ] support [AMQP] and [JMS] as common messaging protocols. |
| 84 | +Managed offerings are widely available, both for these specific brokers, and alternative [AMQP] implementations. |
| 85 | + |
| 86 | +[NATS] is capable to cater to the widest range of use cases, but managed offerings are mainly limited to |
| 87 | +one vendor ([Synadia]). Realistically, users would need to maintain their own clusters. |
| 88 | +[NATS JetStream] can provide Kafka-like semantics, but also work queues, key-value and object stores. |
| 89 | +While its protocols are public and well-documented, there are currently no alternative server implementations. |
| 90 | + |
| 91 | +[Redis] provides data structures for classic queues (i.e. lists) and priority queues (i.e. sorted sets). |
| 92 | +It can act as publish-subscribe broker, although it only provides at-most-once delivery guarantees there. |
| 93 | + |
| 94 | +*Pro*: |
| 95 | + |
| 96 | +1. [AMQP]-compatible brokers come with support for retries and prioritization built-in. |
| 97 | +2. [NATS] could also be used as blob storage. |
| 98 | +3. [Redis] could also be used for caching. |
| 99 | + |
| 100 | +*Con*: |
| 101 | + |
| 102 | +1. Still requires an additional dependency. |
| 103 | +2. Still inherits many of the issues we have with Kafka (i.e. topic / queue management, e2e observability). |
| 104 | +3. We don't have expertise in configuring and operating any of these. |
| 105 | +4. Fully managed offerings are more scarce, especially [NATS]. |
| 106 | +5. Following a license change in 2024, the [Redis] ecosystem has become fragmented. [Redis] itself is no longer |
| 107 | + permissively licensed. Forks like [ValKey] exist, but the whole situation is concerning. |
| 108 | + |
| 109 | +#### B: Use an in-memory data grid |
| 110 | + |
| 111 | +In-memory data grids (IMDGs) are a popular option for various use cases in the JVM ecosystem, |
| 112 | +including messaging. Prominent solutions in this space include [Hazelcast], [Ignite], and [Infinispan]. |
| 113 | + |
| 114 | +IMDGs could further be combined with frameworks such as Eclipse [Vert.x], which use them for clustering. |
| 115 | + |
| 116 | +*Pro*: |
| 117 | + |
| 118 | +1. Could also be used for caching. |
| 119 | + |
| 120 | +*Con*: |
| 121 | + |
| 122 | +1. Most IMDGs still require a central server and are thus not necessarily simpler than a normal message broker. |
| 123 | +2. No managed offering in any major cloud(?). |
| 124 | +3. We only have very limited experience in configuring and operating any of these. |
| 125 | +4. Except Hazelcast, only very limited support for advanced data structures like priority queues. |
| 126 | +5. Upgrades are tricky to coordinate and require downtime. Rolling upgrades are a paid feature in Hazelcast. |
| 127 | + |
| 128 | +#### C: Just use Postgres |
| 129 | + |
| 130 | +We already decided to focus entirely on Postgres for our database. We dropped support for H2, MSSQL, and MySQL |
| 131 | +as a result. This decision opens up a lot more possibilities when it comes to other parts of the stack. |
| 132 | + |
| 133 | +Solutions like [JobRunr], [Hatchet], [Oban], [pgmq], [River], and [Solid Queue] demonstrate that building |
| 134 | +queues or queue-like systems on a RDBMSes *and Postgres specifically* is viable. |
| 135 | + |
| 136 | +Running such workloads on a database does not necessarily mean that the database must be shared |
| 137 | +with the core application. It *can* be done for smaller deployments to keep complexity low, |
| 138 | +but larger deployments can simply leverage a separate database. |
| 139 | + |
| 140 | +Both architecture and operations are simpler, even if more database servers were to be involved. |
| 141 | + |
| 142 | +Database migrations are well understood, easy to test and to automate. With Liquibase, we already have |
| 143 | +great tooling in our stack. |
| 144 | + |
| 145 | +Organizations that are able to provision and support a Postgres database for the core application will |
| 146 | +also have an easier time to provision more instances if needed, versus having to procure another technology altogether. |
| 147 | + |
| 148 | +Postgres is also a lot more common than any of the message brokers or IMDGs. |
| 149 | + |
| 150 | +Performance-wise, messaging and queueing is not our main bottleneck. Since all asynchronous operations involve |
| 151 | +access to a database or external service anyway, raw message throughput is not a primary performance driver for us. |
| 152 | + |
| 153 | +Impact on database performance can be reduced by sizing units of work a little bigger. For example, processing |
| 154 | +all components of a project in a single task, rather than each component individually. Fewer writes and fewer |
| 155 | +transactions lead to more headroom. |
| 156 | + |
| 157 | +*Pro*: |
| 158 | + |
| 159 | +1. Drastically simplified tech stack. Easier to develop with and to support. |
| 160 | +2. We already have expertise in configuration and operation. |
| 161 | +3. Abundance of managed offerings across wide range of vendors. |
| 162 | +4. Retries, priorities, observability are easier to implement with a strongly consistent SQL database. |
| 163 | +5. Migrations are simpler, we already have tooling for it. |
| 164 | +6. More flexible: If we have special needs for queueing, we can just build it ourselves, |
| 165 | + rather than adopting yet another technology, or implementing more workarounds. |
| 166 | + |
| 167 | +*Con*: |
| 168 | + |
| 169 | +1. Will not scale as far as Kafka or other dedicated brokers could. |
| 170 | +2. We're binding ourselves more to one specific technology. |
| 171 | + |
| 172 | +## Decision |
| 173 | + |
| 174 | +We propose to follow solution **C**. Go all-in on Postgres. |
| 175 | + |
| 176 | +TODO: Update with final decision. |
| 177 | + |
| 178 | +## Consequences |
| 179 | + |
| 180 | +* Functionalities that currently rely on Kafka will need to be re-architected for Postgres. |
| 181 | +* Since we already have a few adopters of hyades, the transition will need to be gradual. |
| 182 | +* We need a Kafka [notification publisher] to ensure that users relying on this functionality are not cut off. |
| 183 | + |
| 184 | + |
| 185 | +[ActiveMQ]: https://activemq.apache.org/ |
| 186 | +[AMQP]: https://www.amqp.org/ |
| 187 | +[compacted Kafka topic]: https://docs.confluent.io/kafka/design/log_compaction.html |
| 188 | +[choreography]: https://microservices.io/patterns/data/saga.html#example-choreography-based-saga |
| 189 | +[Confluent Parallel Consumer]: https://github.com/confluentinc/parallel-consumer |
| 190 | +[Hatchet]: https://hatchet.run/ |
| 191 | +[Hazelcast]: https://hazelcast.com/ |
| 192 | +[JMS]: https://en.wikipedia.org/wiki/Jakarta_Messaging |
| 193 | +[JobRunr]: https://www.jobrunr.io/en/ |
| 194 | +[Ignite]: https://ignite.apache.org/use-cases/in-memory-data-grid.html |
| 195 | +[Infinispan]: https://infinispan.org/ |
| 196 | +[innovation tokens]: https://boringtechnology.club/#17 |
| 197 | +[Kafka Streams state stores]: https://kafka.apache.org/39/documentation/streams/core-concepts#streams_state |
| 198 | +[KIP-932]: https://cwiki.apache.org/confluence/display/KAFKA/KIP-932%3A+Queues+for+Kafka |
| 199 | +[NATS]: https://nats.io/ |
| 200 | +[NATS JetStream]: https://docs.nats.io/nats-concepts/jetstream |
| 201 | +[notification publisher]: https://github.com/DependencyTrack/hyades/blob/main/notification-publisher/src/main/java/org/dependencytrack/notification/publisher/Publisher.java |
| 202 | +[Oban]: https://getoban.pro/ |
| 203 | +[orchestration]: https://microservices.io/patterns/data/saga.html#example-orchestration-based-saga |
| 204 | +[pgmq]: https://github.com/tembo-io/pgmq |
| 205 | +[RabbitMQ]: https://www.rabbitmq.com/ |
| 206 | +[Redis]: https://redis.io/ |
| 207 | +[River]: https://riverqueue.com/ |
| 208 | +[Solid Queue]: https://github.com/rails/solid_queue |
| 209 | +[Synadia]: https://www.synadia.com/ |
| 210 | +[ValKey]: https://valkey.io/ |
| 211 | +[Vert.x]: https://vertx.io/ |
| 212 | +[workflow state tracking]: ../design/workflow-state-tracking.md |
0 commit comments