Skip to content

Commit 1afc1dd

Browse files
authored
Merge pull request #1619 from DependencyTrack/adr-drop-kafka
2 parents cb48335 + 4b96d85 commit 1afc1dd

File tree

6 files changed

+1129
-0
lines changed

6 files changed

+1129
-0
lines changed
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
Here is where we keep our architecture decision records ([ADR]s).
2+
3+
## Template
4+
5+
Use the following template ([source]) when adding a new [ADR]:
6+
7+
```markdown linenums="1"
8+
| Status | Date | Author(s) |
9+
|:-----------------------------------------------------|:-----------|:-------------------------------------------------|
10+
| Proposed, Accepted, Rejected, Deprecated, Superseded | yyyy-mm-dd | [@githubHandle](https://github.com/githubHandle) |
11+
12+
## Context
13+
14+
What is the issue that we're seeing that is motivating this decision or change?
15+
16+
## Decision
17+
18+
What is the change that we're proposing and/or doing?
19+
20+
## Consequences
21+
22+
What becomes easier or more difficult to do because of this change?
23+
```
24+
25+
[ADR]: https://cognitect.com/blog/2011/11/15/documenting-architecture-decisions
26+
[source]: https://github.com/joelparkerhenderson/architecture-decision-record/blob/7d759443a20e8bafc3e5d19631bd7eda475b25d3/locales/en/templates/decision-record-template-by-michael-nygard/index.md
Lines changed: 212 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,212 @@
1+
| Status | Date | Author(s) |
2+
|:---------|:-----------|:-------------------------------------|
3+
| Proposed | 2025-01-07 | [@nscuro](https://github.com/nscuro) |
4+
5+
## Context
6+
7+
### How Kafka is currently used
8+
9+
As of hyades version 0.6.0, Kafka is used for the following purposes:
10+
11+
* **Notification dispatching**. Each notification type has its own Kafka topic. The `notification-publisher` service
12+
is responsible for consuming from those topics, and publishing notifications based on the configured rules,
13+
i.e. sending Slack messages or Webhooks. Because Kafka does not delete messages after consumption, notifications
14+
can be consumed by multiple clients, and replayed if necessary. Given a consistent message key, Kafka can further
15+
guarantee message ordering.
16+
* **Vulnerability mirroring**. Vulnerability records downloaded from the NVD and other sources are not immediately
17+
written to the database. Instead, they are sent to a [compacted Kafka topic], from where they are consumed and ingested.
18+
Kafka acts as a firehose that allows ingestion to be performed at a steady rate, without overloading the database.
19+
* **Vulnerability analysis**. The API server publishes a record for each to-be-analyzed component to Kafka.
20+
The `vulnerability-analyzer` service consumes from this topic, scans each component with all configured scanners,
21+
and publishes the results to a separate Kafka topic. The results topic is consumed by the API server,
22+
which is responsible for ingesting them into the database. Analysis makes heavy use of stream processing techniques
23+
to improve performance. The full process is documented [here](https://github.com/DependencyTrack/hyades/blob/8f1dd4cb4e02c8b4b646217ddd006ef81490cdec/vulnerability-analyzer/README.md#how-it-works).
24+
* **Repository metadata analysis**. Similar to vulnerability analysis, but for component metadata such as latest
25+
available versions, publish timestamps, and hashes. The process is documented [here](https://github.com/DependencyTrack/hyades/tree/8f1dd4cb4e02c8b4b646217ddd006ef81490cdec/repository-meta-analyzer#how-it-works).
26+
27+
The Kafka ecosystem is huge, and there exist many managed offerings for it. Over the recent years, many more
28+
projects and products were released that implement the Kafka protocol, giving users more choice.
29+
Operating Kafka has gotten easier, both due to the many new implementations, and because Kafka has dropped
30+
the previously mandatory ZooKeeper dependency.
31+
32+
Message throughput is fantastic. Durability and ordering guarantees are great.
33+
34+
### Issues and limitations
35+
36+
* **Retries**. Kafka does not yet support ACKs or NACKs of individual messages. It works with log offsets,
37+
making it difficult to implement fine-grained retry mechanisms for individual failed messages,
38+
without complex client-side solutions. Hyades implements retries using [Kafka Streams state stores],
39+
or by leveraging [Confluent Parallel Consumer]. With [KIP-932], native ACKs of individual messages are on the horizon.
40+
* **Prioritization**. Kafka is an append-only log, and as such does not support prioritization of messages.
41+
Prioritization is required to ensure that actions triggered by users or clients take precedence over scheduled ones.
42+
[Implementing prioritization on top of Kafka primitives](https://www.confluent.io/blog/prioritize-messages-in-kafka/)
43+
is complex and inflexible. [KIP-932] does not cover priorities. The lack of prioritization can *partly* be compensated
44+
by ensuring high throughput of the system, which Kafka *does* support. But it's not a sustainable solution.
45+
* **Message sizes**. Kafka has a default message size limit of `1 MiB`. We observed notifications and vulnerability
46+
analysis results growing larger than `1 MiB`, even when compressed. The size limit can be increased on a per-topic basis,
47+
but it comes with a performance penalty. We further found that some organizations disallow increasing size limits
48+
entirely to limit impact on other teams sharing the same brokers.
49+
* **End-to-end observability**. Tracking message flow and workflow progress across multiple topics and services
50+
requires dedicated monitoring for logs, metrics, and traces. This raises the barrier to entry for operating
51+
Dependency-Track clusters, and complicates debugging of issues and development. Relying solely on a PubSub broker like
52+
Kafka or DT v4's internal message bus promotes [choreography] over [orchestration]. Choreography makes processes
53+
increasingly hard to understand and follow. The initial [workflow state tracking] implementation attempted to lessen
54+
the pain, but the logic being scattered across all choreography participants is not helpful.
55+
* **Spotty support for advanced Kafka features**. Kafka comes with advanced features like transactions, compacted topics,
56+
and more. We found that support for these is very spotty across alternative implementations (in particular transactions).
57+
Further, we received feedback that organizations that operate shared Kafka clusters may prohibit usage of compacted
58+
topics. With only bare-bones features left available, the argument for Kafka becomes a lot less compelling.
59+
* **Topic management**. Partitions are what enables parallelism for Kafka consumers. The number of partitions must
60+
be decided before topics are created. Increasing partitions later is possible, decreasing is not. Adding partitions
61+
impacts ordering guarantees and can be tricky to coordinate. In order to leverage stream processing techniques,
62+
some topics must be [co-partitioned](https://www.confluent.io/blog/co-partitioning-in-kafka-streams/).
63+
Generic tooling around topic management, comparable to database migration tooling, is severely lacking,
64+
making it hard to maintain for a diverse user base. Vendor-specific tooling is available,
65+
such as [Strimzi's topic operator](https://strimzi.io/docs/operators/latest/overview#overview-concepts-topic-operator-str).
66+
* **Community support**. Running an additional piece of infrastructure ourselves is one thing. Supporting a whole
67+
community in doing that correctly and efficiently is another. Unfortunately there is no single deployment
68+
or configuration that works for everyone. We don't have dedicated support staff and need to be pragmatic about
69+
what we can realistically support. Requiring Kafka doesn't help.
70+
71+
In summary, *Kafka on its own provides not enough benefit for us to justify its usage*.
72+
73+
We hoped it would help in more areas, but ended up realizing that working around these issues required even more
74+
additional overhead and infrastructure to address. Which is not sustainable, given we already spent [innovation tokens]
75+
on Kafka itself, and have limited team capacities.
76+
77+
### Possible Solutions
78+
79+
#### A: Replace Kafka for another message broker
80+
81+
We could replace Kafka with another, more lightweight broker, like [ActiveMQ], [RabbitMQ], [NATS], or [Redis].
82+
83+
[ActiveMQ] and [RabbitMQ] support [AMQP] and [JMS] as common messaging protocols.
84+
Managed offerings are widely available, both for these specific brokers, and alternative [AMQP] implementations.
85+
86+
[NATS] is capable to cater to the widest range of use cases, but managed offerings are mainly limited to
87+
one vendor ([Synadia]). Realistically, users would need to maintain their own clusters.
88+
[NATS JetStream] can provide Kafka-like semantics, but also work queues, key-value and object stores.
89+
While its protocols are public and well-documented, there are currently no alternative server implementations.
90+
91+
[Redis] provides data structures for classic queues (i.e. lists) and priority queues (i.e. sorted sets).
92+
It can act as publish-subscribe broker, although it only provides at-most-once delivery guarantees there.
93+
94+
*Pro*:
95+
96+
1. [AMQP]-compatible brokers come with support for retries and prioritization built-in.
97+
2. [NATS] could also be used as blob storage.
98+
3. [Redis] could also be used for caching.
99+
100+
*Con*:
101+
102+
1. Still requires an additional dependency.
103+
2. Still inherits many of the issues we have with Kafka (i.e. topic / queue management, e2e observability).
104+
3. We don't have expertise in configuring and operating any of these.
105+
4. Fully managed offerings are more scarce, especially [NATS].
106+
5. Following a license change in 2024, the [Redis] ecosystem has become fragmented. [Redis] itself is no longer
107+
permissively licensed. Forks like [ValKey] exist, but the whole situation is concerning.
108+
109+
#### B: Use an in-memory data grid
110+
111+
In-memory data grids (IMDGs) are a popular option for various use cases in the JVM ecosystem,
112+
including messaging. Prominent solutions in this space include [Hazelcast], [Ignite], and [Infinispan].
113+
114+
IMDGs could further be combined with frameworks such as Eclipse [Vert.x], which use them for clustering.
115+
116+
*Pro*:
117+
118+
1. Could also be used for caching.
119+
120+
*Con*:
121+
122+
1. Most IMDGs still require a central server and are thus not necessarily simpler than a normal message broker.
123+
2. No managed offering in any major cloud(?).
124+
3. We only have very limited experience in configuring and operating any of these.
125+
4. Except Hazelcast, only very limited support for advanced data structures like priority queues.
126+
5. Upgrades are tricky to coordinate and require downtime. Rolling upgrades are a paid feature in Hazelcast.
127+
128+
#### C: Just use Postgres
129+
130+
We already decided to focus entirely on Postgres for our database. We dropped support for H2, MSSQL, and MySQL
131+
as a result. This decision opens up a lot more possibilities when it comes to other parts of the stack.
132+
133+
Solutions like [JobRunr], [Hatchet], [Oban], [pgmq], [River], and [Solid Queue] demonstrate that building
134+
queues or queue-like systems on a RDBMSes *and Postgres specifically* is viable.
135+
136+
Running such workloads on a database does not necessarily mean that the database must be shared
137+
with the core application. It *can* be done for smaller deployments to keep complexity low,
138+
but larger deployments can simply leverage a separate database.
139+
140+
Both architecture and operations are simpler, even if more database servers were to be involved.
141+
142+
Database migrations are well understood, easy to test and to automate. With Liquibase, we already have
143+
great tooling in our stack.
144+
145+
Organizations that are able to provision and support a Postgres database for the core application will
146+
also have an easier time to provision more instances if needed, versus having to procure another technology altogether.
147+
148+
Postgres is also a lot more common than any of the message brokers or IMDGs.
149+
150+
Performance-wise, messaging and queueing is not our main bottleneck. Since all asynchronous operations involve
151+
access to a database or external service anyway, raw message throughput is not a primary performance driver for us.
152+
153+
Impact on database performance can be reduced by sizing units of work a little bigger. For example, processing
154+
all components of a project in a single task, rather than each component individually. Fewer writes and fewer
155+
transactions lead to more headroom.
156+
157+
*Pro*:
158+
159+
1. Drastically simplified tech stack. Easier to develop with and to support.
160+
2. We already have expertise in configuration and operation.
161+
3. Abundance of managed offerings across wide range of vendors.
162+
4. Retries, priorities, observability are easier to implement with a strongly consistent SQL database.
163+
5. Migrations are simpler, we already have tooling for it.
164+
6. More flexible: If we have special needs for queueing, we can just build it ourselves,
165+
rather than adopting yet another technology, or implementing more workarounds.
166+
167+
*Con*:
168+
169+
1. Will not scale as far as Kafka or other dedicated brokers could.
170+
2. We're binding ourselves more to one specific technology.
171+
172+
## Decision
173+
174+
We propose to follow solution **C**. Go all-in on Postgres.
175+
176+
TODO: Update with final decision.
177+
178+
## Consequences
179+
180+
* Functionalities that currently rely on Kafka will need to be re-architected for Postgres.
181+
* Since we already have a few adopters of hyades, the transition will need to be gradual.
182+
* We need a Kafka [notification publisher] to ensure that users relying on this functionality are not cut off.
183+
184+
185+
[ActiveMQ]: https://activemq.apache.org/
186+
[AMQP]: https://www.amqp.org/
187+
[compacted Kafka topic]: https://docs.confluent.io/kafka/design/log_compaction.html
188+
[choreography]: https://microservices.io/patterns/data/saga.html#example-choreography-based-saga
189+
[Confluent Parallel Consumer]: https://github.com/confluentinc/parallel-consumer
190+
[Hatchet]: https://hatchet.run/
191+
[Hazelcast]: https://hazelcast.com/
192+
[JMS]: https://en.wikipedia.org/wiki/Jakarta_Messaging
193+
[JobRunr]: https://www.jobrunr.io/en/
194+
[Ignite]: https://ignite.apache.org/use-cases/in-memory-data-grid.html
195+
[Infinispan]: https://infinispan.org/
196+
[innovation tokens]: https://boringtechnology.club/#17
197+
[Kafka Streams state stores]: https://kafka.apache.org/39/documentation/streams/core-concepts#streams_state
198+
[KIP-932]: https://cwiki.apache.org/confluence/display/KAFKA/KIP-932%3A+Queues+for+Kafka
199+
[NATS]: https://nats.io/
200+
[NATS JetStream]: https://docs.nats.io/nats-concepts/jetstream
201+
[notification publisher]: https://github.com/DependencyTrack/hyades/blob/main/notification-publisher/src/main/java/org/dependencytrack/notification/publisher/Publisher.java
202+
[Oban]: https://getoban.pro/
203+
[orchestration]: https://microservices.io/patterns/data/saga.html#example-orchestration-based-saga
204+
[pgmq]: https://github.com/tembo-io/pgmq
205+
[RabbitMQ]: https://www.rabbitmq.com/
206+
[Redis]: https://redis.io/
207+
[River]: https://riverqueue.com/
208+
[Solid Queue]: https://github.com/rails/solid_queue
209+
[Synadia]: https://www.synadia.com/
210+
[ValKey]: https://valkey.io/
211+
[Vert.x]: https://vertx.io/
212+
[workflow state tracking]: ../design/workflow-state-tracking.md

0 commit comments

Comments
 (0)