Streams running with Ram Disk (tmfps) #8948

stefanLeo · 2023-07-25T14:16:34Z

stefanLeo
Jul 25, 2023

Hi

We are very keen about using RMQ Streams and like all of its features as well as performance, but encountered some hickups during host failures of the underlying storage system that apparently affects RMQ Stream Latency drastically (15-20 seconds of delay).
Hence we thought about using emptydir Kubernetes volumes (i.e., tmpfs ram disk) for "storing" stream data. We have a cluster of 3 RMQ nodes.
Before we go into that direction I wanted to ask about the potential drawbacks that we might encounter and why RMQ chose to not offer an in memory option for streams at all.

Thank you &
BR Stefan

Answered by michaelklishin

Jul 26, 2023

Modern RabbitMQ 3.x features, most notably quorum queues and streams, and as of RabbitMQ 4.0, virtually every subsystem, are not designed with transient storage in mind.

Data safety features of modern queue and stream types, node restarts, the upgrade process and tooling: all of these things do not assume that a node can basically lose all of its data and prior knowledge about the rest of the cluster.

Upgrades and even node restarts will fail with RAM-based storage (at some point whatever provides the RAM-backed filesystem volume also has to be restarted, right?).

Both the Cluster Formation and Clustering doc guides describes a scenario where a node is reset or restored after failure, and…

View full answer

lukebakken · 2023-07-25T14:41:41Z

lukebakken
Jul 25, 2023
Maintainer

why RMQ chose to not offer an in memory option for streams at all.

The underlying Raft library, ra, depends on non-volatile storage for data safety, and there is no in-memory only option.

Using a RAM disk is a good option for an "in-memory" Streams solution.

0 replies

michaelklishin · 2023-07-25T15:19:19Z

michaelklishin
Jul 25, 2023
Maintainer

RabbitMQ 4.0 will remove in-memory only entities because in our 16+ years of experience, they cause way more problems than they solve.

0 replies

stefanLeo · 2023-07-25T15:55:29Z

stefanLeo
Jul 25, 2023
Author

Thx for the answers and the link to the RA library. So if I got that right and a client writes a msg to an RMQ Stream, every node executes the following:

Write msg to cache
Write msg to WAL on disc and ETS Table (in mem)
Commit CMD to leader

So let's say if the disc is unresponsive for ~5 secs or more I would expect the following behavior:

Cache write still works
ETS Write works. WAL write is stalling.
Commit Cmd delay

If this only happens on one out of 3 nodes, this should not affect the writing client right?
If this happens on the leader node - does this affect the writing client?
If a consuming client is connected to the affected node - will RMQ Stream still provide that message from its cache or not?

And a side question: Are there any limits for this cache - so let's say if the disc is unresponsive for ~30 secs during peak load (worst case?)

3 replies

lukebakken Jul 25, 2023
Maintainer

@kjnilsson is the definitive ra resource, but he's out on PTO, so I'll answer as best I can -

Write msg to cache

I'm not sure about any "cache" in place. Where did you see that in our docs or the code?

If this only happens on one out of 3 nodes, this should not affect the writing client right?

That is my understanding.

If this happens on the leader node - does this affect the writing client?

I don't think so, since a quorum of writes succeeded.

michaelklishin Jul 25, 2023
Maintainer

Raft assumes durable storage and that writes are reasonably efficient. If one node falls behind, it's not an issue as long as a majority is making progress.

If the leader slows down, that inevitably will affect how quickly clients get their responses. Slow but not failing parts in a leader-based system is the limitation you have to live with.

In some cases, a severe slowdown will result in a leader failure. From there, Raft (and QQs, streams) will continue as they were designed to function.

michaelklishin Jul 25, 2023
Maintainer

If leader disks are completely unresponsive for 30s, you should deploy to a different environment.

That said, a 30s delay may result in aten (the failure detector library we use) detecting that a leader is unresponsive, which will result in some followers triggering a new leader election. Then if more nodes detect the same condition, the former leader will quickly become a follower.

And a slow follower does not affect the ability of a Raft-based systems to make progress as quickly as the majority performs local disk commits and responds to the leader.

stefanLeo · 2023-07-25T18:02:26Z

stefanLeo
Jul 25, 2023
Author

Thx again for your replies. I think I got some more insights. To answer your cache q: Its documented here: https://github.com/rabbitmq/ra/blob/main/docs/internals/INTERNALS.md#wals-ets-tables

To give you some more details about our env (should have added that in the first place):
x) We run Kubernetes on top of ESXi with VSAN as storage provider.
x) Our software is soft realtime with strong consistency. We can life with 1-2 sec msg latency outliers in some failure conditions, but not more.
x) We see unavailable discs for ~15sec secs in case of a host failure that affects VSAN storage

We conduct some more failure tests with VSAN and ram disks and I will then come back with results if you guys are interested.

4 replies

michaelklishin Jul 25, 2023
Maintainer

We explicitly recommend against network-attached storage for RabbitMQ.

RAM disks very much go against all the assumptions and safety guarantees of Raft.

Both are terrible options for running modern RabbitMQ. Do not do it.

michaelklishin Jul 25, 2023
Maintainer

A 15s delay on the leader cannot result in a mere 2s delay observed by the client, for obvious reasons. Use local disks that are not network-attached.

stefanLeo Jul 26, 2023
Author

Thank you again for the clear statement.
What I still would like to better understand is what concrete drawbacks you see when we would go for the ram disk option?
I mean sure, the whole disk writer strategy with WAL and segment files is apparently not designed for ram disks and makes no sense at all (its an overkill). But it should still not cause any troubles granted we have enough RAM. We also would not keep like days of messages in the stream. For us we are talking about keeping maybe the last hour or even less in the stream retention.
Moreover, if we go for a dedicated disks for RMQ (lets say one dedicated SSD per node - 5 physical nodes and 3 RMQ nodes) I see the same RMQ behavior for e.g. a failure of a host. Kubernetes starts the pod on another host which has no RMQ data at that point in time and the node has to re-sync - same as for ram discs in case of a restart of the pod.

So while I see your point in saying the storage strategy is not designed for ram discs, I also see no concrete drawbacks as of now. That's why I started this discussion in the first place to make us aware of some obvious and concrete drawbacks, so that we can decide whether we can live or workaround those drawbacks in our domain.

Thx again!

michaelklishin Jul 26, 2023
Maintainer

Modern RabbitMQ 3.x features, most notably quorum queues and streams, and as of RabbitMQ 4.0, virtually every subsystem, are not designed with transient storage in mind.

Data safety features of modern queue and stream types, node restarts, the upgrade process and tooling: all of these things do not assume that a node can basically lose all of its data and prior knowledge about the rest of the cluster.

Upgrades and even node restarts will fail with RAM-based storage (at some point whatever provides the RAM-backed filesystem volume also has to be restarted, right?).

Both the Cluster Formation and Clustering doc guides describes a scenario where a node is reset or restored after failure, and tries to join peers that no longer know about it. These two parts of a cluster won't be able to rejoin themselves.
This is a scenario that's guaranteed to eventually happen when RAM-backed volumes are used.

In the case of Khepri, quorum queues and streams, even if the node did rejoin, the entire data
set would have to be transferred to the new node. This is wasteful and with large streams, can
put significant unexpected load on the network and other nodes.

Deprecating and removing transient entities has been the strategy since about 2016, culminating in the expected adoption of Khepri in early 2024 (RabbitMQ 4.x).

Answer selected by michaelklishin

Streams running with Ram Disk (tmfps) #8948

Uh oh!

stefanLeo Jul 25, 2023

Replies: 4 comments · 7 replies

Uh oh!

lukebakken Jul 25, 2023 Maintainer

Uh oh!

michaelklishin Jul 25, 2023 Maintainer

Uh oh!

stefanLeo Jul 25, 2023 Author

Uh oh!

lukebakken Jul 25, 2023 Maintainer

Uh oh!

michaelklishin Jul 25, 2023 Maintainer

Uh oh!

michaelklishin Jul 25, 2023 Maintainer

Uh oh!

stefanLeo Jul 25, 2023 Author

Uh oh!

michaelklishin Jul 25, 2023 Maintainer

Uh oh!

Uh oh!

michaelklishin Jul 25, 2023 Maintainer

Uh oh!

stefanLeo Jul 26, 2023 Author

Uh oh!

michaelklishin Jul 26, 2023 Maintainer

stefanLeo
Jul 25, 2023

Replies: 4 comments 7 replies

lukebakken
Jul 25, 2023
Maintainer

michaelklishin
Jul 25, 2023
Maintainer

stefanLeo
Jul 25, 2023
Author

lukebakken Jul 25, 2023
Maintainer

michaelklishin Jul 25, 2023
Maintainer

michaelklishin Jul 25, 2023
Maintainer

stefanLeo
Jul 25, 2023
Author

michaelklishin Jul 25, 2023
Maintainer

michaelklishin Jul 25, 2023
Maintainer

stefanLeo Jul 26, 2023
Author

michaelklishin Jul 26, 2023
Maintainer