Skip to content

Commit ffca9d0

Browse files
aldy505LordSimaljjbayersfanahata
authored
docs(self-hosted): provide more insights on troubleshooting kafka (#15131)
Turns out most of our self-hosted users has never touch Kafka before, so it's a good idea to introduce them regarding how Kafka works. Also added how to increase consumers replica if they're lagging behind. --------- Co-authored-by: Kevin Pfeifer <[email protected]> Co-authored-by: Joris Bayer <[email protected]> Co-authored-by: Shannon Anahata <[email protected]>
1 parent 7474095 commit ffca9d0

File tree

1 file changed

+66
-15
lines changed
  • develop-docs/self-hosted/troubleshooting

1 file changed

+66
-15
lines changed

develop-docs/self-hosted/troubleshooting/kafka.mdx

Lines changed: 66 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -4,19 +4,19 @@ sidebar_title: Kafka
44
sidebar_order: 2
55
---
66

7-
## Offset Out Of Range Error
7+
## How Kafka Works
88

9-
```log
10-
Exception: KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str="Broker: Offset out of range"}
11-
```
9+
This section is aimed for those who have Kafka problems, but are not yet familiar with Kafka. At a high level, Kafka is a message broker which stores messages in a log, very similar to an array, format. It receives messages from producers that write to a specific topic, and then sends them to consumers that are subscribed to that topic. The consumers can then process the messages.
1210

13-
This happens where Kafka and the consumers get out of sync. Possible reasons are:
11+
On the inside, when a message enters a topic, it will be written to a certain partition. You can think of a partition as physical a box that stores messages for a specific topic. In a distributed Kafka setup, each partition might be stored on a different machine/node, but if you only have a single Kafka instance, then all the partitions are stored on the same machine.
1412

15-
1. Running out of disk space or memory
16-
2. Having a sustained event spike that causes very long processing times, causing Kafka to drop messages as they go past the retention time
17-
3. Date/time out of sync issues due to a restart or suspend/resume cycle
13+
When a producer sends a message to a topic, it will either stick to a certain partition number based on the partition key (example: partition 1, partition 2, etc.) or it will choose a partition in a round-robin manner. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. **Important to note: the number of consumers cannot exceed the number of partitions**. If you have more consumers than partitions, the extra consumers will receive no messages.
14+
15+
Each message in a topic will have an "offset" (number). You can think of this like an "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. Offsets are scoped to a partition, therefore a partition in a topic can have the same offset numbers. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the time, we want "lag" to be as low as possible. The easiest solution to lagging is adding more partitions and increasing the number of consumers.
16+
17+
The differences with other types of queues or brokers like RabbitMQ or Redis is that Kafka has a concept called "retention time". Messages that are stored on Kafka and consumed by consumers won't be deleted immediately. Instead, they will be stored for a certain period of time. By default, self-hosted Sentry uses Kafka with a retention time of 24 hours. This means that messages that are older than 24 hours will be deleted. If you want to change the retention time, you can do so by modifying the `KAFKA_LOG_RETENTION_HOURS` environment variable in the `kafka` service.
1818

19-
### Visualize
19+
### Visualize Kafka
2020

2121
You can visualize the Kafka consumers and their offsets by bringing an additional container, such as [Kafka UI](https://github.com/provectus/kafka-ui) or [Redpanda Console](https://github.com/redpanda-data/console) into your Docker Compose.
2222

@@ -59,8 +59,26 @@ redpanda-console:
5959
- kafka
6060
```
6161
62+
It's recommended to put this on `docker-compose.override.yml` rather than modifying your `docker-compose.yml` directly. The UI will then can be accessed via `http://localhost:8080/` (or `http://<your-ip>:8080/` if you're using a reverse proxy).
63+
64+
## Offset Out Of Range Error
65+
66+
```log
67+
Exception: KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str="Broker: Offset out of range"}
68+
```
69+
70+
This happens where Kafka and the consumers get out of sync. Possible reasons are:
71+
72+
1. Running out of disk space or memory
73+
2. Having a sustained event spike that causes very long processing times, causing Kafka to drop messages as they go past the retention time
74+
3. Date/time out of sync issues due to a restart or suspend/resume cycle
75+
6276
Ideally, you want to have zero lag for all consumer groups. If a consumer group has a lot of lag, you need to investigate whether it's caused by a disconnected consumer (e.g., a Sentry/Snuba container that's disconnected from Kafka) or a consumer that's stuck processing a certain message. If it's a disconnected consumer, you can either restart the container or reset the Kafka offset to 'earliest.' Otherwise, you can reset the Kafka offset to 'latest.'
6377

78+
<Alert level="info" title="Tip">
79+
Choose "earliest" if you want to start re-processing events from the beginning. Choose "latest" if you are okay with losing old events and want to start processing from the newest events.
80+
</Alert>
81+
6482
### Recovery
6583

6684
<Alert level="warning" title="Warning">
@@ -77,19 +95,19 @@ The _proper_ solution is as follows ([reported](https://github.com/getsentry/sel
7795
```
7896
2. Receive consumers list:
7997
```shell
80-
docker compose run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --list
98+
docker compose exec kafka kafka-consumer-groups --bootstrap-server kafka:9092 --list
8199
```
82100
3. Get group info:
83101
```shell
84-
docker compose run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --group snuba-consumers --describe
102+
docker compose exec kafka kafka-consumer-groups --bootstrap-server kafka:9092 --group snuba-consumers --describe
85103
```
86104
4. Watching what is going to happen with offset by using dry-run (optional):
87105
```shell
88-
docker compose run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --group snuba-consumers --topic events --reset-offsets --to-latest --dry-run
106+
docker compose exec kafka kafka-consumer-groups --bootstrap-server kafka:9092 --group snuba-consumers --topic events --reset-offsets --to-latest --dry-run
89107
```
90108
5. Set offset to latest and execute:
91109
```shell
92-
docker compose run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --group snuba-consumers --topic events --reset-offsets --to-latest --execute
110+
docker compose exec kafka kafka-consumer-groups --bootstrap-server kafka:9092 --group snuba-consumers --topic events --reset-offsets --to-latest --execute
93111
```
94112
6. Start the previously stopped Sentry/Snuba containers:
95113
```shell
@@ -107,14 +125,16 @@ This option is as follows ([reported](https://github.com/getsentry/self-hosted/i
107125

108126
1. Set offset to latest and execute:
109127
```shell
110-
docker compose run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --all-groups --all-topics --reset-offsets --to-latest --execute
128+
docker compose exec kafka kafka-consumer-groups --bootstrap-server kafka:9092 --all-groups --all-topics --reset-offsets --to-latest --execute
111129
```
112130

113131
Unlike the proper solution, this involves resetting the offsets of all consumer groups and all topics.
114132

115133
#### Nuclear option
116134

117-
The _nuclear option_ is removing all Kafka-related volumes and recreating them which _will_ cause data loss. Any data that was pending there will be gone upon deleting these volumes.
135+
<Alert level="warning" title="Warning">
136+
The _nuclear option_ is removing all Kafka-related volumes and recreating them which _will_ cause data loss. Any data that was pending there will be gone upon deleting these volumes.
137+
</Alert>
118138

119139
1. Stop the instance:
120140
```shell
@@ -133,6 +153,37 @@ The _nuclear option_ is removing all Kafka-related volumes and recreating them w
133153
```shell
134154
docker compose up --wait
135155
```
156+
157+
## Consumers Lagging Behind
158+
159+
If you notice a very slow ingestion speed and consumers are lagging behind, it's likely that the consumers are not able to keep up with the producers. This can happen if the consumers are not able to keep up with the rate of messages being produced. To fix this, you can increase the number of partitions and increase the number of consumers.
160+
161+
1. For example, if you see `ingest-consumer` consumer group has a lot of lag, and you can see that it's subscribed to `ingest-events` topic, then you need to first increase the number of partitions for that topic.
162+
```bash
163+
docker compose exec kafka kafka-topics --bootstrap-server kafka:9092 --alter --partitions 3 --topic ingest-events
164+
```
165+
2. Validate that the number of partitions for the topic is now 3.
166+
```bash
167+
docker compose exec kafka kafka-topics --bootstrap-server kafka:9092 --describe --topic ingest-events
168+
```
169+
3. Then, you need to increase the number of consumers for the consumer group. You can see on the `docker-compose.yml` that the container that consumes `ingest-events` topic using ` ingest-consumer` consumer group is `events-consumer` container. But we won't modify the `docker-compose.yml` directly, instead, we will create a new file called `docker-compose.override.yml` and add the following:
170+
```yaml
171+
services:
172+
events-consumer:
173+
deploy:
174+
replicas: 3
175+
```
176+
This will increase the number of consumers for the `ingest-consumer` consumer group to 3.
177+
4. Finally, you need to refresh the `events-consumer` container. You can do so by running the following command:
178+
```bash
179+
docker compose up -d --wait events-consumer
180+
```
181+
5. Observe the logs of `events-consumer`, you should not see any consumer errors. Let it run for a while (usually a few minutes until a few hours) and observe the Kafka topic lags.
182+
183+
<Alert level="info" title="Tip">
184+
The definition of "normal lag" varies depending on your system resources. If you are running a small instance, you can expect a normal lag of around hundreds of messages. If you are running a large instance, you can expect a normal lag of around thousands of messages.
185+
</Alert>
186+
136187
## Reducing disk usage
137188

138189
If you want to reduce the disk space used by Kafka, you'll need to carefully calculate how much data you are ingesting, how much data loss you can tolerate and then follow the recommendations on [this awesome StackOverflow post](https://stackoverflow.com/a/52970982/90297) or [this post on our community forum](https://forum.sentry.io/t/sentry-disk-cleanup-kafka/11337/2?u=byk).

0 commit comments

Comments
 (0)