Skip to content

Commit 7842c72

Browse files
committed
docs(self-hosted): Kafka troubleshooting from review comments
1 parent 1d300d4 commit 7842c72

File tree

1 file changed

+10
-2
lines changed
  • develop-docs/self-hosted/troubleshooting

1 file changed

+10
-2
lines changed

develop-docs/self-hosted/troubleshooting/kafka.mdx

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,9 @@ This section is aimed for those who have Kafka problems, but are not yet familia
1010

1111
On the inside, when a message enters a topic, it will be written to a certain partition. You can think of a partition as physical a box that stores messages for a specific topic. In a distributed Kafka setup, each partition might be stored on a different machine/node, but if you only have a single Kafka instance, then all the partitions are stored on the same machine.
1212

13-
When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. **Important to note: the number of consumers cannot exceed the number of partitions**. If you have more consumers than partitions, the extra consumers will receive no messages.
13+
When a producer sends a message to a topic, it will either stick to a certain partition number based on the partition key (example: partition 1, partition 2, etc.) or it will choose a partition in a round-robin manner. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. **Important to note: the number of consumers cannot exceed the number of partitions**. If you have more consumers than partitions, the extra consumers will receive no messages.
1414

15-
Each message in a topic will have an "offset" (number). You can think of this like an "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the time, we want "lag" to be as low as possible. The easiest solution to lagging is adding more partitions and increasing the number of consumers.
15+
Each message in a topic will have an "offset" (number). You can think of this like an "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. Offsets are scoped to a partition, therefore a partition in a topic can have the same offset numbers. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the time, we want "lag" to be as low as possible. The easiest solution to lagging is adding more partitions and increasing the number of consumers.
1616

1717
The differences with other types of queues or brokers like RabbitMQ or Redis is that Kafka has a concept called "retention time". Messages that are stored on Kafka and consumed by consumers won't be deleted immediately. Instead, they will be stored for a certain period of time. By default, self-hosted Sentry uses Kafka with a retention time of 24 hours. This means that messages that are older than 24 hours will be deleted. If you want to change the retention time, you can do so by modifying the `KAFKA_LOG_RETENTION_HOURS` environment variable in the `kafka` service.
1818

@@ -75,6 +75,10 @@ This happens where Kafka and the consumers get out of sync. Possible reasons are
7575

7676
Ideally, you want to have zero lag for all consumer groups. If a consumer group has a lot of lag, you need to investigate whether it's caused by a disconnected consumer (e.g., a Sentry/Snuba container that's disconnected from Kafka) or a consumer that's stuck processing a certain message. If it's a disconnected consumer, you can either restart the container or reset the Kafka offset to 'earliest.' Otherwise, you can reset the Kafka offset to 'latest.'
7777

78+
<Alert level="info" title="Tip">
79+
Choose "earliest" if you want to start re-processing events from the beginning. Choose "latest" if you are okay with losing old events and want to start processing from the newest events.
80+
</Alert>
81+
7882
### Recovery
7983

8084
<Alert level="warning" title="Warning">
@@ -176,6 +180,10 @@ If you notice a very slow ingestion speed and consumers are lagging behind, it's
176180
```
177181
5. Observe the logs of `events-consumer`, you should not see any consumer errors. Let it run for a while (usually a few minutes until a few hours) and observe the Kafka topic lags.
178182

183+
<Alert level="info" title="Tip">
184+
The definition of "normal lag" varies depending on your system resources. If you are running a small instance, you can expect a normal lag of around hundreds of messages. If you are running a large instance, you can expect a normal lag of around thousands of messages.
185+
</Alert>
186+
179187
## Reducing disk usage
180188

181189
If you want to reduce the disk space used by Kafka, you'll need to carefully calculate how much data you are ingesting, how much data loss you can tolerate and then follow the recommendations on [this awesome StackOverflow post](https://stackoverflow.com/a/52970982/90297) or [this post on our community forum](https://forum.sentry.io/t/sentry-disk-cleanup-kafka/11337/2?u=byk).

0 commit comments

Comments
 (0)