Connections terminute after running into a file_handle_cache exception (which terminates) #8776
-
Describe the bugSeveral times when try to open web socket over mqtt-web plugin get this error (the authentication backed used is rabbitmq_auth_backend_http with rabbitmq_auth_backend_cache active)
After some errors I need restart rabbitmq for 500 error Reproduction steps
Expected behaviorNot crash connection Additional contextThe JS library used is Paho |
Beta Was this translation helpful? Give feedback.
Replies: 29 comments 52 replies
-
What runs into an exception is a function that maintainers file handle metrics. It tries to update a counter that does not exist. So this is not something MQTT-specific. Any connection to this node will fail. What version of RabbitMQ is used? Can you share full logs? There may be an error logged earlier that would provide a clue as to what the root cause is. |
Beta Was this translation helpful? Give feedback.
-
Also, I'm not sure how steps 1 and 3 are different. Can you please share an executable way to reproduce? (a public repository on GitHub or at least an archive with some code) There is an example WebSockets-over-MQTT plugin, Does said plugin connect sucessfully? |
Beta Was this translation helpful? Give feedback.
-
Also, can this be RabbitMQ 3.11.x running on Erlang 26? That is not a supported combination and the symptoms are very similar: all client connections run into exceptions that do not seem related. |
Beta Was this translation helpful? Give feedback.
-
I confirmed that http://localhost:15670/web-mqtt-examples/bunny.html works as expected on
|
Beta Was this translation helpful? Give feedback.
-
send log of other cluster node. This node not is accessible over web management |
Beta Was this translation helpful? Give feedback.
-
According to the logs on
and then a warning (could be entirely unrelated) when the node is asked to shut down:
This churn happens for a couple of days and then all connections start failing with
My best guess from just these logs is that file handle cache somehow does not deal with constant high connection churn, or in particular the kind of churn where clients do not clearly close connections. Note the average connection life span length:
|
Beta Was this translation helpful? Give feedback.
-
On node
|
Beta Was this translation helpful? Give feedback.
-
@fernandomacho I do not understand what may be happening with the metric update. That code path can be made more defensive so that it does not throw (the metric is non-essential) but my hypothesis is that it caused by excessive connection churn from your clients. Please eliminate the churn and use long-lived connections or this proxy. When/if I find a way to inspect the relevant metrics store, I will share a |
Beta Was this translation helpful? Give feedback.
-
@fernandomacho can you please run
and share the output every hour or so when the cluster accepts connection, and once or twice (say, every few minutes) when it does not? |
Beta Was this translation helpful? Give feedback.
-
![]() |
Beta Was this translation helpful? Give feedback.
-
I have had to reboot two cluster nodes. The result of the command that you ask me in the node that I have not had to restart (innova2) is: and on two restarted nodes: ovh-innova: and innova3: |
Beta Was this translation helpful? Give feedback.
-
seems to be most relevant. I am not sure if this is due to node termination, likely not. |
Beta Was this translation helpful? Give feedback.
-
ok I config log level to info and try to reproduce |
Beta Was this translation helpful? Give feedback.
-
So yeah, I have enough evidence that this scenario only be triggered by high connection churn. Here is what happens:
So far so good. Now, concurrently with that, another connection is open and
In other words, for this scenario to happen you need one very short lived connection and another very short-lived connection to get "assigned" the same Erlang process ("green thread") ID, then two independent metric table updates can step over one another. Getting rid of high connection churn should help. Thank you for providing the logs, @fernandomacho. |
Beta Was this translation helpful? Give feedback.
-
Do you think it's a problem of migrating queues to quorum instead of using websockets? |
Beta Was this translation helpful? Give feedback.
-
@lhoguin another thing to investigate would be this: assuming that the FHC process fails and is restarted, what would its peak restart rate be? With connection churn above a certain level its Alternatively we can consider dropping |
Beta Was this translation helpful? Give feedback.
-
In the meantime, here is what I have #8790. |
Beta Was this translation helpful? Give feedback.
-
I will publish a GitHub release shortly, it will include a .deb package that can be downloaded In it, you can disable FHC for the Web MQTT plugin: web_mqtt.enable_file_handle_cache = false and all FHC relevant FHC operations that update metric counters are now exception-safe. |
Beta Was this translation helpful? Give feedback.
-
We are not getting any closer to having a way to reproduce, so I am out of ideas as to what the root cause may be and what else would help. Without a way to describe (e.g. with a traffic capture) the workload and relevant information collected from all nodes, I conclude that this is a bag of different aspects:
|
Beta Was this translation helpful? Give feedback.
-
Here is a direct link to the .deb file in case you'd prefer to install via |
Beta Was this translation helpful? Give feedback.
-
In addition to #8790 Node configuration and stateMuch of this is provided by this script that VMware support and the RabbitMQ core team use,
Workload descriptionMore importantly now, we need to understand what your clients do, so any code or a reasonably
Ideally, if it is enough to use only AMQP 0-9-1 clients, a set of PerfTest flags that would roughly simulate your For MQTT or Web MQTT connections, there are tools from Mosquitto, emqx that help simulate workloads Alternatively you can take a traffic capture with How to send collected data privatelyThis information can be sensitive, e.g. virtual host names, queue and stream names, logs, etc. Feel free to send this as a single archive to the address our team uses for security disclosures and private communication. |
Beta Was this translation helpful? Give feedback.
-
Hello, I will install the version you have released first thing tomorrow morning. I will try to explain the use: 2.- I am migrating part of the API functionality to websockets, which is when the problem appeared. The clients (browsers) using the Paho library for javascript, open a connection by websockets to rabbit (previously there is a process of creating a username and password that is used to connect). Rabbitmq validates the user through the rabbitmq_auth_backend_http plugin and using the authentication cache through the rabbitmq_auth_backend_cache plugin. 3.- Consumers of messages, these processes open the connection only once and in principle it remains open, since the message "wait" process itself prevents the socket from being closed. Right now each server (there are three) can have between 3000 and 4000 open connections, I understand that there are a lot but it shouldn't be a problem by far. Right now everything is configured, both the api and the consumers to make use of amqproxy. I send you some pictures with stats: Note that despite using amqproxy the number of open and closed chucks is very similar. The amqproxy configuration is: [listen] y la de rabbitmq `mqtt.vhost = websocks auth_backends.1 = internal log.default.level = critical` |
Beta Was this translation helpful? Give feedback.
-
@fernandomacho we have put together a workload simulation that has this connection churn rate with QQs. No issues so far. One thing that stands out to us is that on your chart, the closed connection rate is 0. Can you tell us (or share some code) about how the connections are closed? If they are never closed, |
Beta Was this translation helpful? Give feedback.
-
Hi, first of all I would like to thank you for how you have handled this problem and how you have helped me. The truth is that he was a bit desperate with this issue. Right now it's been running smoothly with websocks since this morning (in a user controlled environment), tomorrow I'll add more users over websocks and I give you feedback. Now I generate the files for the three servers and send them by mail to the address you gave me along with the definitions. |
Beta Was this translation helpful? Give feedback.
-
Without any extra feedback, I will assume that #8790 did address the issue as first reported. We also have some data to dig into, although our initial attempts to reproduce using comparable churn rates, failed to make the exception manifest itself. |
Beta Was this translation helpful? Give feedback.
-
Thanks!
… On 13 Jul 2023, at 22:41, Michael Klishin ***@***.***> wrote:
We will ship 3.12.2 next Monday. Thanks for confirming!
—
Reply to this email directly, view it on GitHub <#8776 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AO2NF7XKLD4FCFHVCJ66AWTXQBMOJANCNFSM6AAAAAAZ6QVK4U>.
You are receiving this because you were mentioned.
|
Beta Was this translation helpful? Give feedback.
3.12.2-beta.1
is up on GitHub.Here is a direct link to the .deb file in case you'd prefer to install via
dpkg -i
(temporarily), in addition to the Cloudsmith repo for preview releases.