Osiris can't start replica with certain configurations #5747
Replies: 4 comments 1 reply
-
We managed to reproduce this issue on the latest RabbitMQ (3.11.2, 3.11.3 and master). On top of that we had similar issues with deleting and adding a replica to a high throughput stream without the extreme segment size. This was part of a routine maintenance test that failed with the exact same error and symptoms as your gist. (So I omit repeating it again.) After several detours in trying to patch us out of this problem, we narrowed down the root cause to the retention frequency and how long it takes the reader to read all the info it needs from the segment. Here comes my theory what is the problem. Your reproduction sets the stream length to exactly one segment (1234 bytes). This means that when the segment fills it is deleted. If all the replicas are up to date at this point, then there is no crash. But if one replica is lagging behind, then the segment gets deleted and the next chunk transfer fails with As I mentioned we arrived to the same problem by deleting a replica from a high throughput stream, and when re-added it, it cannot catch up as it starts to read from starting position 0 and the first (ie. oldest) segment. By the time the reader is initialised, the segment is rotated out, so it goes into the same death spiral. I tried to tune the segment file size (with more realistic sizes in megabytes) to a point where the time it takes to sync the data from the oldest segment is much smaller than to fill up the segment. In my testing, if you rotate about 10 times a second or quicker, the issue comes up naturally if one follower lags behind without any action. If each segment is filled (let's say) in 3 seconds, then it is almost impossible to reproduce the problem on a local setup. The data transfer rate is much quicker and the reader can catch up easily. I was able to reproduce the same problem with retention value One could argue that very frequent retention is not a good design, and I tend to agree. But the edge case can happen with any kind of ("sane") retention if a new replica is added to the stream and the sync takes longer than the next retention. E.g. with the default segment size of 500MB, the sync starts moments before the segment is deleted, then the replica will crash into a death spiral. Or a new stream consumer attaches to the beginning of the stream that is about to hit retention. This happened to us with a stream limited to 100GB with default segment that filled with about 100MB/s. It would be nice to add some kind of lock on segments when a reader is about to attached. What do you think @michaelklishin and @mkuratczyk ? (Special thanks to @luos in the investigation.) |
Beta Was this translation helpful? Give feedback.
-
I don't think such retention intervals are practical. An enforced minimum of, say, 30 or 60 seconds is what I'd do instead of introducing locks on segment operations. |
Beta Was this translation helpful? Give feedback.
-
@johanrhodin @olikasg future RabbitMQ versions will quietly default to a 30 or 60s retention period if a lower value is used. Single digit seconds-long retention is impractical and instead of introducing locking on a sensitive code path we'd rather solve this with a validation or an enforced minimum. |
Beta Was this translation helpful? Give feedback.
-
the main thing here and this is something we completely missed is to not allow extremely small segment sizes. There are certain bits of data that needs to fit into the segment in addition to user data (messages). For example when using AMQP legacy (as Johan is above) we always use stream publisher deduplication which means there is already tracking data that needs to be prepended into every new segment. Whenever a new segment is opened all tracking data (writer deduplication and reader offset tracking) is snapshotted into the first chunk of the segment. This allows us to recover the tracking data by only scanning the first segment. Each tracking record has max size of ~260 bytes. By default we only write 255 writer deduplication records into the snapshot. Reader offset tracking is unbounded and is only cleared up when reader records (that aren't explicitly deleted) are staler than the first offset in the stream. But lets say there are rarely more than 255 readers that store offsets as well. this means we'd need 133KB+ just to store a worst case scenario snapshot. This is before any message data or additional tracking data is stored at all! So to leave plenty of margin I'd say 4MB is absolute minimum viable segments size. For me even that feels too small but see if you can create issue with a 4MB segment size and we'll ship a change that ensure this is the minimum effective size. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
RabbitMQ 3.10.7, Erlang 25.0.0.4, 3-nodes.
Create a stream
s1
with parametersx-max-length-bytes
1234
andx-stream-max-segment-size-bytes
1234
Publish some messages with PerfTest:
bin/runjava com.rabbitmq.perf.PerfTest -y0 -p -u "s1" -s 1000 --id "streamish" -ad false -f persistent -h $TARGET -pmessages 10000
The following log is generated and Osiris keeps on crashing with that
offset_out_of_range
exception error on all three nodes:https://gist.github.com/johanrhodin/2115dabd39de8b650b8b0beef879b371
Beta Was this translation helpful? Give feedback.
All reactions