Osiris can't start replica with certain configurations #5747

johanrhodin · 2022-09-09T20:02:45Z

johanrhodin
Sep 9, 2022

RabbitMQ 3.10.7, Erlang 25.0.0.4, 3-nodes.

Create a stream s1 with parameters x-max-length-bytes 1234 and x-stream-max-segment-size-bytes 1234
Publish some messages with PerfTest:
bin/runjava com.rabbitmq.perf.PerfTest -y0 -p -u "s1" -s 1000 --id "streamish" -ad false -f persistent -h $TARGET -pmessages 10000

The following log is generated and Osiris keeps on crashing with that offset_out_of_range exception error on all three nodes:
https://gist.github.com/johanrhodin/2115dabd39de8b650b8b0beef879b371

olikasg · 2022-11-23T11:42:07Z

olikasg
Nov 23, 2022

We managed to reproduce this issue on the latest RabbitMQ (3.11.2, 3.11.3 and master). On top of that we had similar issues with deleting and adding a replica to a high throughput stream without the extreme segment size. This was part of a routine maintenance test that failed with the exact same error and symptoms as your gist. (So I omit repeating it again.)

After several detours in trying to patch us out of this problem, we narrowed down the root cause to the retention frequency and how long it takes the reader to read all the info it needs from the segment. Here comes my theory what is the problem.

Your reproduction sets the stream length to exactly one segment (1234 bytes). This means that when the segment fills it is deleted. If all the replicas are up to date at this point, then there is no crash. But if one replica is lagging behind, then the segment gets deleted and the next chunk transfer fails with eintval. When this happens, the follower dies (expected behaviour) and tries to resynchronise with the leader. This initial sync setup succeeds, but by the time it attaches another reader, the reader fails with the same issue, and it starts a never ending fail cycle. At a point it gets into a state when the followers internal state lags behind so much, that the segment cannot be found and it pushes the follower into a death spiral. (When this happens is very difficult to trace, because the events come so frequently and my lack of understanding of the internal data structures of orisirs_log.)

As I mentioned we arrived to the same problem by deleting a replica from a high throughput stream, and when re-added it, it cannot catch up as it starts to read from starting position 0 and the first (ie. oldest) segment. By the time the reader is initialised, the segment is rotated out, so it goes into the same death spiral.

I tried to tune the segment file size (with more realistic sizes in megabytes) to a point where the time it takes to sync the data from the oldest segment is much smaller than to fill up the segment. In my testing, if you rotate about 10 times a second or quicker, the issue comes up naturally if one follower lags behind without any action. If each segment is filled (let's say) in 3 seconds, then it is almost impossible to reproduce the problem on a local setup. The data transfer rate is much quicker and the reader can catch up easily. I was able to reproduce the same problem with retention value 1s.

One could argue that very frequent retention is not a good design, and I tend to agree. But the edge case can happen with any kind of ("sane") retention if a new replica is added to the stream and the sync takes longer than the next retention. E.g. with the default segment size of 500MB, the sync starts moments before the segment is deleted, then the replica will crash into a death spiral. Or a new stream consumer attaches to the beginning of the stream that is about to hit retention. This happened to us with a stream limited to 100GB with default segment that filled with about 100MB/s.

It would be nice to add some kind of lock on segments when a reader is about to attached. What do you think @michaelklishin and @mkuratczyk ?

(Special thanks to @luos in the investigation.)

0 replies

michaelklishin · 2022-11-24T06:07:37Z

michaelklishin
Nov 24, 2022
Maintainer

I don't think such retention intervals are practical.

An enforced minimum of, say, 30 or 60 seconds is what I'd do instead of introducing locks on segment operations.

0 replies

michaelklishin · 2022-11-24T07:23:01Z

michaelklishin
Nov 24, 2022
Maintainer

@johanrhodin @olikasg future RabbitMQ versions will quietly default to a 30 or 60s retention period if a lower value is used. Single digit seconds-long retention is impractical and instead of introducing locking on a sensitive code path we'd rather solve this with a validation or an enforced minimum.

0 replies

kjnilsson · 2022-11-24T12:45:46Z

kjnilsson
Nov 24, 2022
Maintainer

the main thing here and this is something we completely missed is to not allow extremely small segment sizes. There are certain bits of data that needs to fit into the segment in addition to user data (messages). For example when using AMQP legacy (as Johan is above) we always use stream publisher deduplication which means there is already tracking data that needs to be prepended into every new segment. Whenever a new segment is opened all tracking data (writer deduplication and reader offset tracking) is snapshotted into the first chunk of the segment. This allows us to recover the tracking data by only scanning the first segment.

Each tracking record has max size of ~260 bytes. By default we only write 255 writer deduplication records into the snapshot. Reader offset tracking is unbounded and is only cleared up when reader records (that aren't explicitly deleted) are staler than the first offset in the stream. But lets say there are rarely more than 255 readers that store offsets as well. this means we'd need 133KB+ just to store a worst case scenario snapshot. This is before any message data or additional tracking data is stored at all! So to leave plenty of margin I'd say 4MB is absolute minimum viable segments size. For me even that feels too small but see if you can create issue with a 4MB segment size and we'll ship a change that ensure this is the minimum effective size.

1 reply

olikasg Nov 24, 2022

I didn't rerun the tests, but I was able to re-create the crash with the default segment size and publishing about +150MB/s into a stream. In this scenario, you fill a segment in about 3-4 seconds.

So in my opinion, there are two issues:

With time based retention, as Michael suggested, setting a sensible default would solve the problem. I like the 30 second minimum limit.
With size based retention, I think there should also be some minimum, but I trust you on how much data you need. 4MB seems to be sensible to me. But based on the publish speed, any size of segment can be filled in matter of seconds. To handle this case, I think, a warning message in the documentation would be enough as it is not something that rabbit should enforce. There will always be footguns.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Osiris can't start replica with certain configurations #5747

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Osiris can't start replica with certain configurations #5747

Uh oh!

johanrhodin Sep 9, 2022

Replies: 4 comments · 1 reply

Uh oh!

olikasg Nov 23, 2022

Uh oh!

Uh oh!

michaelklishin Nov 24, 2022 Maintainer

Uh oh!

michaelklishin Nov 24, 2022 Maintainer

Uh oh!

kjnilsson Nov 24, 2022 Maintainer

Uh oh!

olikasg Nov 24, 2022

johanrhodin
Sep 9, 2022

Replies: 4 comments 1 reply

olikasg
Nov 23, 2022

michaelklishin
Nov 24, 2022
Maintainer

michaelklishin
Nov 24, 2022
Maintainer

kjnilsson
Nov 24, 2022
Maintainer