Skip to content

Commit e64e916

Browse files
author
moxious
authored
Architectural Guidance for neo4j-streams (Antora docs) (#337)
* carry-over of architectural guidance doc for neo4j-streams
1 parent 9297f66 commit e64e916

16 files changed

+329
-1
lines changed
85.8 KB
Loading
66.9 KB
Loading
48.5 KB
Loading
63.4 KB
Loading
45.4 KB
Loading

doc/docs/modules/ROOT/nav.adoc

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,16 @@
5454
5555
* xref::cloud.adoc[Confluent Cloud]
5656
57+
* xref::architecture.adoc[Architectural Guidance]
58+
** xref::architecture/kafkatopics.adoc[Kafka Topics Overview]
59+
** xref::architecture/pluginvsconnect.adoc[Kafka Connect vs. Plugin]
60+
** xref::architecture/sinkconsume.adoc[Sink: Consume Records]
61+
** xref::architecture/sourceproduce.adoc[Source: Produce Records]
62+
** xref::architecture/retries.adoc[Retries & Error Handling]
63+
** xref::architecture/throughput.adoc[Factors which Affect Throughput]
64+
** xref::architecture/transformations.adoc[Streams Transformations]
65+
** xref::architecture/optimize.adoc[Optimizing Kafka]
66+
5767
* xref::examples.adoc[Examples with Confluent Platform and Kafka Connect Datagen]
5868
** xref::examples.adoc#examples_binary_format[Confluent and Neo4j in binary format]
5969
** xref::examples.adoc#confluent_docker_example[Confluent with Docker, Neo4j in binary format]
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
[[architecture]]
2+
== Architectural Guidance
3+
4+
The purpose of this section is to:
5+
6+
* Describe how this integration works
7+
* Discuss the architectural issues that are at play
8+
* Provide a basis for advice on how to build systems with Neo4j & Kafka
9+
10+
[NOTE]
11+
Customers should use the information in this document to design the best possible pipelines to connect graphs and streams.
12+
13+
== The Challenge: Graph ETL
14+
15+
Data coming from Kafka isn't a graph. When we load JSON documents or Avro into a Graph, what we're actually doing is extracting relevant bits of data from some input message. We're then transforming those relevant bits into a "Graph Snippet" and then loading it into Neo4j.
16+
17+
image::graph-etl.png[align="center"]
18+
19+
Using neo4j-streams is a form of graph ETL. And it pays to separate out these two pieces (the extraction and the transformation) and handle them separately if we want to do this in a performant and easy to maintain manner. If we are producing records from Neo4j back out to Kafka, it's still the same challenge, just in the opposite direction.
20+
21+
https://www.confluent.io/blog/building-real-time-streaming-etl-pipeline-20-minutes/[Streaming ETL is nothing new for Kafka] -- it is one of the platform's core use cases. A big complicating factor for Neo4j-streams is that not many people have done it for graph before neo4j-streams.
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
= Kafka Topics Overview
2+
3+
image::kafka-partitions.png[align="center"]
4+
5+
Each topic (message queue) is broken into any number of partitions, which can accept independent reads and writes. Producers push messages onto the end of the queue, and readers advance through the partition offsets.
6+
7+
== Partitions and Offsets
8+
9+
Partitions in Kafka are generally read by a client in "round robin" fashion. Partition offsets describe where a given reader is in a given partition. In a simple example, given a single partition, Bob may be currently reading at offset 3 while Sarah is at offset 11. Either reader can selectively "rewind" or "replay" any part of the message queue if they like.
10+
11+
== Compaction & Retention
12+
13+
Kafka topics can be configured with various "compaction" and "retention" settings that control how long the Kafka cluster keeps parts of the topic. If all history on the topic is retained, then in theory you can reconstruct all of history by playing it from the beginning.
14+
15+
== Database Replication
16+
17+
All databases in Kafka land can be thought of as generic data buckets that can emit messages (producers) or can consume messages (consumers). A common technique in Kafka is to set up one database to publish all its ongoing flow of changes (CDC) to a Kafka topic. Generally, the Debezium format is used for this, but not exclusively.
18+
19+
**This is an important concept for database replication**, which is common. If you are ingesting from a Kafka topic and you have a setup with decent performance, you may be able to 100% recreate a database by "replaying the topic" into a new Neo4j database.
20+
21+
== Polyglot Persistence
22+
23+
Given all of this, an important architectural pattern to be aware of is the idea of having one single "Source of Truth" database (such as Oracle) - which publishes all of its changes to Kafka, feeding multiple downstream "helper systems" such as ElasticSearch and Neo4j. In this way, copies of the data are in 3 different places. The "helper systems" probably don't accept writes, and just add new query capabilities to the existing data. High Availability may be less of a topic for the helper systems, as they can always be recreated from the topic.
24+
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
= Optimizing Kafka
2+
3+
Neo4j can't ingest fast if Kafka isn't set up correctly. While this isn't a common source of problems, it has come up. Confluent has https://www.confluent.io/blog/optimizing-apache-kafka-deployment/[good overall documentation] on optimizing Kafka that is worth being familiar with.
4+
5+
The main trade offs are these, and they have to make sense at the Kafka layer before they can make sense for Neo4j.
6+
7+
* Do you want to optimize for high throughput, which is the rate that data is moved from producers to brokers or brokers to consumers?
8+
* Do you want to optimize for low latency, which is the elapsed time moving messages end-to-end (from producers to brokers to consumers)?
9+
* Do you want to optimize for high durability, which guarantees that messages that have been committed will not be lost?
10+
* Do you want to optimize for high availability, which minimizes downtime in case of unexpected failures? Kafka is a distributed system, and it is designed to tolerate failures.
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
= When to use Kafka Connect vs. Neo4j Streams as a Plugin
2+
3+
[abstract]
4+
This section covers how to decide whether to run as a Kafka Connect worker, or as a Neo4j Plugin.
5+
6+
== Kafka Connect
7+
8+
=== Pros
9+
10+
* Processing is outside of Neo4j so that memory & CPU impact doesn't impact Neo4j. You don't need to size the database with Kafka utilization in mind.
11+
* Much easier for Kafka pros to manage; they benefit from the Confluent ecosystem, such as connecting the REST API to manipulate connectors, the control center to administer & monitor them.
12+
* By restarting the worker, you can restart your sink strategy without having downtime for Neo4j.
13+
* Upgrade Neo4j-Streams without restarting the cluster
14+
Strictly an external bolt client, so better overall security management of plugin actions.
15+
16+
=== Cons
17+
18+
* You can't do TransactionEventHandlers from outside of the database, so you can only sink to Neo4j, you can't produce from it.
19+
* If you're using Confluent Cloud, you can't host the connector in the cloud (yet). So this requires a 3rd piece of architecture: Confluent Cloud, Neo4j, and the Connect Worker (usually a separate VM)
20+
* Possibly worse throughput due to bolt latency & overhead, and separate network hop.
21+
22+
== Neo4j-Streams Plugin
23+
24+
=== Pros
25+
26+
* Much easier for Neo4j pros to manage
27+
* You can produce records back out to Kafka
28+
* You can use Neo4j procedures so that "custom produce" (see later section) becomes an option.
29+
* Possibly better throughput, because you don't have bolt latency / overhead
30+
31+
=== Cons
32+
33+
* Memory & CPU consumption on your Neo4j Server
34+
* Requires restarting Neo4j in order to update your configuration and/or Cypher.
35+
* Upgrading plugin requires cluster restart
36+
* Need to track config to be identical across all members in the cluster
37+
* Lesser ability to manage the plugin because it is running inside of the database and not under a particular user account.
38+

0 commit comments

Comments
 (0)