Streams: Losing messages during fail-over #9458

spjoe · 2023-09-19T11:42:14Z

spjoe
Sep 19, 2023

Describe the bug

We noticed if we do fail-over testing using streams, then messages are lost

Note: If we do similar fail-over test with quorum queues, then no messages are missing.

Reproduction steps

ps-rabbitmq-stream-client.zip
A small client application was written to reproduce this issue. The application consists of a consumer and a producer. The producer sends messages in the following format:

sequenceNumber timestamp randomPayload

The consumer application uses the timestamp to calculate a latency. The sequence number is used to identify gaps and duplication of messages.
The producer prints every confirmed message to stdout.
The consumer prints every time a message is received the following.

sequenceNumber latency randomPayload

Setup a 3-node RabbitMQ Cluster using the Operator Helm Chart: 3.4.2 (https://artifacthub.io/packages/helm/bitnami/rabbitmq-cluster-operator/3.4.2). In this example this cluster runs in the namespace message-broker.
The RabbitMQ Cluster is running on Openshift Cluster and the configuration is similar to "Production Ready" example as defined on https://github.com/rabbitmq/cluster-operator/blob/main/docs/examples/production-ready/rabbitmq.yaml
Add rabbitmq_stream plugin, set heartbeat to 10 and net_ticktime to 10
Expose a rabbitmq management UI port. So that we can see which rabbitmq node is the leader of any given stream.
Starting one pod with a producer

java -cp ps-rabbitmq-stream-client.jar com.frequentis.ps.spike.Producer --name testStream --rate 100 --uri rabbitmq-stream://userinfo@message-broker-server-0.message-broker-nodes.message-broker:5552/%2f --uri rabbitmq-stream://userinfo@message-broker-server-1.message-broker-nodes.message-broker:5552/%2f --uri rabbitmq-stream://userinfo@message-broker-server-2.message-broker-nodes.message-broker:5552/%2f | tee producer.log

Starting another pod with a consumer

java -cp ps-rabbitmq-stream-client.jar com.frequentis.ps.spike.Consumer --name testStream --uri rabbitmq-stream://userinfo@message-broker-server-0.message-broker-nodes.message-broker:5552/%2f --uri rabbitmq-stream://userinfo@message-broker-server-1.message-broker-nodes.message-broker:5552/%2f --uri rabbitmq-stream://userinfo@message-broker-server-2.message-broker-nodes.message-broker:5552/%2f | tee consumer.log

Identify with the help of the rabbitmq management UI which pod is the leader of the testStream
Delete the leader pod or kill the k8s node where the leader pod is running.
See in the output of the consumer that there is a gap in the sequence numbers.

An example output we have on the consumer side is the following:

5197 2 zTlM6yAid9AJbTdWPIU3CKtjw3RHJ3hWTXDto2LKCVrmktKjgd3fFDg13MgUjWpyrDl4kr8rFM2SKNq7Qz3BkgjBIGEwhIe7cJhd
5198 2 k37e1Fq0HdsSwtXL5pkhhHJh8zRGk4SDVMf0PQPH7QykWhl2RDvgEw3IMjZESg9cJEyDMKIkeGuyTCUQs2ymbaqbhI7C5uEYT6bM
5199 40961 kb96NArpag8eeEAdazZ5mWNPynGvmt8G07gXDFj5O8q3Vei4EjbNE8CD3M7krvGLtbFfXPERbpDcim9MTrZy3UNLy9mBDR0gUYZR
5200 40952 Ur38X2FwmyHlFPImrq5sDoJpyJp1SGJ8Zzu4T1JuC4LF9BG2Gh3Q6RYkmnLC5uYOhO9fAZqWbIYxPsoreUDn2Q0xMkWg8hw0b2i8
....
5997 32998 6iPXxv9qEHkY8mBzXJUFTfZJoqfJYDaoaQTM3fcaouWviRyvpxedXbWbSswxmSTb1gdrZYvRovu1HJJ1xoNJJcy7oYi2SwuYXL1E
5998 32988 g7wfYo0cPmOrT5j3OBWqhOZUftjCId2Ck881SoETbp2GsphZt4TGvGLp0Zy19DaIuP8AFU5tbR3jyszQklWwL5S8c1msymEyiZwN
9296 7 8KlJnhFKNUcQXP3BsnLE7Pr3MJocWbHRpbLNqmkZc2W3MklxHMjaI9EiN49JET2aVToyufUkmjwDFhnGgENrmasbQjZNmovwZ67w
9297 1 0aPNYFq6R0QLXCbATqEj1Owq4YSUUp1oUlfsvmvZIcFSXP2b3VxbhzIsI2uL53JiLdeNQmzOqNaS3jNgpEx80T7obVZbqzg53V2f

This example shows that at first the messages do come with a ~40 seconds delay(messages from 5199 until 5998) and then there is a gap of 3298 messages (5998-6296). With the given publish rate of 100 per second these are around 33 seconds worth of messages lost.

The corresponding producer side:
Producer:

5197 1694693591860 zTlM6yAid9AJbTdWPIU3CKtjw3RHJ3hWTXDto2LKCVrmktKjgd3fFDg13MgUjWpyrDl4kr8rFM2SKNq7Qz3BkgjBIGEwhIe7cJhd
5198 1694693591870 k37e1Fq0HdsSwtXL5pkhhHJh8zRGk4SDVMf0PQPH7QykWhl2RDvgEw3IMjZESg9cJEyDMKIkeGuyTCUQs2ymbaqbhI7C5uEYT6bM
5999 1694693599881 dNB0yCcoRLKmx74FLcU0RMwx9KURGFsv7qKzy57y3JJBXESoqu4mULg7vrgRLndMpE5wfjwvBEHxHDlDhTR6hE7L37Ksr8t8Q1EF
6000 1694693599890 8X0loGIwERMpX6sB56F5I8bTQLCm0GD5y7D3LJk199Z2cw81zIaTLvZjvtQ9TIcrRiXC0NUUSkhJQZKWPBNDx1NDWXXQQhJokYQE
....
9294 1694693632831 rkDfbQuWaZd0f5tGYqLXnI3ONxW0KIZ0pwCblsMKui02VDCkf6ZSexf1ry0Ue57fBz5YR0u6WjKVSQNOuIrBR1ujeX9f2GWgm6ee
9295 1694693632841 gRGSCscu0T9eyIV2OEW48aZO0BOQ7wNyEjad6WR25KQqVA6Jd8Ho5kbuPEtqX6GclttPNVgdQtnhQJ7PiKIEZxVd27OhUM7yXI5Q
5199 1694693591881 kb96NArpag8eeEAdazZ5mWNPynGvmt8G07gXDFj5O8q3Vei4EjbNE8CD3M7krvGLtbFfXPERbpDcim9MTrZy3UNLy9mBDR0gUYZR
5200 1694693591891 Ur38X2FwmyHlFPImrq5sDoJpyJp1SGJ8Zzu4T1JuC4LF9BG2Gh3Q6RYkmnLC5uYOhO9fAZqWbIYxPsoreUDn2Q0xMkWg8hw0b2i8

This shows that all messages are confirmed but the confirms are out of order.

Expected behavior

That no message is lost.

Additional context

No response

acogoluegnes · 2023-09-19T11:59:15Z

acogoluegnes
Sep 19, 2023
Maintainer

@spjoe Can you provide the code to reproduce the issue as Maven or Gradle project? Thanks.

0 replies

kjnilsson · 2023-09-19T12:00:23Z

kjnilsson
Sep 19, 2023
Maintainer

And full broker logs at debug level from all nodes. Cheers.

0 replies

michaelklishin · 2023-09-19T13:12:30Z

michaelklishin
Sep 19, 2023
Maintainer

Converting to a discussion because we don't have all the information needed to investigate a report like this.

0 replies

michaelklishin · 2023-09-19T13:15:37Z

michaelklishin
Sep 19, 2023
Maintainer

these are around 33 seconds worth of messages lost

These messages may or may not have made it to any of the nodes. We have seen node failure detection on Kubernetes taking 30-60 seconds, so this could be something very similar to #9209 but in a different place.

0 replies

acogoluegnes · 2023-09-20T13:52:25Z

acogoluegnes
Sep 20, 2023
Maintainer

@spjoe I noticed the publisher code does not check the confirmation status. The client library calls the confirmation callback after 30 seconds if a message has not been confirmed. The application can choose to republish the message in such a case.

You should either print the payload only if the message is actually confirmed or set the confirm timeout to something much longer. You can set it to Duration.ZERO, this way the producer will never fail messages and will keep retrying to publish them. See the Why setting confirmTimeout to 0 when using deduplication? block below the "Setting the Name of a Producer" section (this is not related to deduplication though).

0 replies

acogoluegnes · 2023-09-20T13:56:32Z

acogoluegnes
Sep 20, 2023
Maintainer

BTW, 100 KB is rather a small value for stream segments, 10 MB would be more reasonable for a 100-MB-max-length stream.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Streams: Losing messages during fail-over #9458

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 6 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Streams: Losing messages during fail-over #9458

Uh oh!

Uh oh!

spjoe Sep 19, 2023

Describe the bug

Reproduction steps

Expected behavior

Additional context

Replies: 6 comments

Uh oh!

acogoluegnes Sep 19, 2023 Maintainer

Uh oh!

kjnilsson Sep 19, 2023 Maintainer

Uh oh!

michaelklishin Sep 19, 2023 Maintainer

Uh oh!

michaelklishin Sep 19, 2023 Maintainer

Uh oh!

acogoluegnes Sep 20, 2023 Maintainer

Uh oh!

acogoluegnes Sep 20, 2023 Maintainer

spjoe
Sep 19, 2023

acogoluegnes
Sep 19, 2023
Maintainer

kjnilsson
Sep 19, 2023
Maintainer

michaelklishin
Sep 19, 2023
Maintainer

michaelklishin
Sep 19, 2023
Maintainer

acogoluegnes
Sep 20, 2023
Maintainer

acogoluegnes
Sep 20, 2023
Maintainer