Replies: 7 comments 3 replies
-
Exchange and queue names must be shorter than 255 characters per spec. I don't see how they would be connected to publisher confirms. Publisher confirms are first sent by queues, then propagated by channels. Classic mirrored queues are deprecated, have been for some time. I don't think our team would be interested in spending a lot of time on CMQs beyond things that affect upgrades such as #5931. |
Beta Was this translation helpful? Give feedback.
-
Thank you for the reproduction steps. I think we will see if it can be reproduced with the upcoming |
Beta Was this translation helpful? Give feedback.
-
Yes, I'm well aware of this fact. If you look at my repro code, you'll see that the exchange names that are long enough to cause this are nowhere near this length.
The problem with that is that if you have existing systems that are already mirrored classic queues, you need a way to migrate a production system to quorum queues. For example, you have to follow a multi-step process like what we've documented here. Our customer was in the process of testing this migration process when they hit this problem, so it's hard to see how you can just dismiss this as something not worth fixing. We have to be able to provide them a way to migrate existing production systems without data loss and full system downtime. |
Beta Was this translation helpful? Give feedback.
-
Classic mirrored queues haven’t seen any investment in some good three years (unlike classic unmirrored queues, which have been worked on non-stop). We have developed two replicated alternatives that are superior in every way. CMQ are scheduled to be removed entirely in 4.0 next year, if that’s not a strong enough message to avoid CMQs, I don’t know what is. If non-mirrored classic queues do no exhibit the same behavior, disabling mirroring before upgrading sounds like a reasonable workaround. With policies it can be done at any moment. Our current hypothesis that this has the same root cause as #5931. When 3.10.9 ships in the next 48 hours we will have a way to test it thanks to your efforts. |
Beta Was this translation helpful? Give feedback.
-
Looking at the issues and PRs for #5931, it looks like that might require 3.10.8 or newer? If so, then I don't think it's related, because this problem can trigger on 3.10.0 or newer. I would also expect to see some sort of error in the server logs when the crash in #5931 happens, but when running my repro, I do not see any crashes/errors reported. |
Beta Was this translation helpful? Give feedback.
-
I'm not sure that asking someone to turn off all HA in their production environment is going to be a viable approach. I understand classic queue mirroring is on its way out, but we've needed the features introduced in 3.10 to be able to start supporting quorum queues in NServiceBus, and now that we do support them, we have customers hitting bugs like this during the migration process. I also understand not wanting to add new features to them, but bugs that got introduced in recent versions should still be fixed, especially when they block migrating to quorum queues in the first place. |
Beta Was this translation helpful? Give feedback.
-
We have a candidate fix in #6126. @bording what kind of package would you need to test that PR? Would an OCI image suffice or should we produce a Windows installer? Let us know. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
It appears that a bug has been introduced in 3.10.0 that causes the broker to not send publisher confirms when publishing messages to exchanges bound to mirrored classic queues and the names of the exchanges involved exceed some (unknown) length.
This problem was reported to us by a customer who ran into it while testing upgrading to 3.10.7 in their environment. I've been able to create a repro that just uses the .NET RabbitMQ client, without NServiceBus being involved:
https://github.com/bording/exchange-repro
I have included a docker compose file that creates a 3-node cluster and sets up a classic queue mirroring policy.
When the repro is run, it creates the necessary exchanges, queues, and bindings, Fills the
receiver
queue with 10,000 messages, and then consumes those messages.For each message consumed, it attempts to publish some messages that get routed through the exchanges, and then waits for the publisher confirms.
Very quickly, you'll observe that the consumer stops processing messages, and you'll see that the channel used to publish the messages is showing that it has unconfirmed messages on it, which the broker never sends.
It's not clear to me what specifically is required to trigger the problem, but here is what I do know are factors:
1. Length of the names of the exchanges
If I shorten the names of the exchanges being used, then the problem is less likely to occur.
2. Mirrored classic queues
If the queues aren't mirrored, then this also doesn't seem to be a problem, regardless of exchange name length. It also doesn't trigger if quorum queues are used instead of classic queues. You can uncomment the line in the repro to observe this.
3. 3.10.0 or newer
When the docker compose file is changed to use 3.9.22 or older, the problem does not occur.
Beta Was this translation helpful? Give feedback.
All reactions