-
Notifications
You must be signed in to change notification settings - Fork 12k
Description
Before Creating the Enhancement Request
- I have confirmed that this should be classified as an enhancement rather than a bug/feature.
Summary
Pop long-polling is not awakened for V1 retry messages, causing significant consumption delay
Motivation
Description
In Pop consumption mode, when a message is not acknowledged (ACKed) within its invisible time, it is moved to a retry topic for later consumption. The system is expected to wake up any waiting long-polling requests for the original topic so the message can be re-consumed promptly.
Currently, this wake-up mechanism works correctly for V2 retry topics (%RETRY%group+topic) because the original topic and group can be reliably parsed.
However, for V1 retry topics (%RETRY%group_topic), the broker fails to parse the original topic name from the retry topic. As a result, the notifyMessageArrivingWithRetryTopic method cannot identify and awaken the correct long-polling request. This forces the consumer's long-polling request to wait until it times out (controlled by BROKER_SUSPEND_MAX_TIME_MILLIS, typically 15 seconds), introducing a significant delay in message retries.
Steps to Reproduce
- Configure and start the broker with the following settings:
enableRetryTopicV2 = false(to use the V1 retry topic format)popConsumerKVServiceEnable = true(orpopConsumerFSServiceInit = true)
- Producer: Send a batch of messages (e.g., 32 messages) to a normal topic, let's call it
TopicA. - Consumer: Use a Pop-based consumer (e.g.,
PushConsumer) to subscribe toTopicA. - Simulate Failure: In the message listener, do not return
CONSUME_SUCCESSfor the received messages. For example, returnRECONSUME_LATERor simply don't ACK them, causing them to expire and be sent to the retry topic (%RETRY%YourConsumerGroup_TopicA). - Observe: Do not send any new messages to
TopicA. Monitor the consumer logs.
Expected Behavior
When the invisible time for a message expires and it's moved to the V1 retry topic, the long-polling request waiting for messages on TopicA should be awakened immediately. The consumer should receive the retry message promptly (e.g., within 1-2 seconds after the invisible time + retry delay).
Actual Behavior
The long-polling request is not awakened by the arrival of the retry message. It remains suspended until the long-poll timeout is reached (approx. 15 seconds). Only after the timeout does the client re-initiate the poll request and finally fetch the message from the retry queue.
Log Evidence:
- Message first nack'd time:
2025-08-22 16:05:21,414 - Message revived and written to retry topic (from
mqadmin topicStatus):2025-08-22 16:05:33,497 - Consumer receives the retry message:
2025-08-22 16:05:36,385 - Observed Delay: ~15 seconds from the initial NACK, matching the long-poll timeout.
This is further confirmed by a second experiment: if new normal messages are continuously sent to TopicA, the retry messages are consumed much faster. This proves that the arrival of new normal messages is waking up the long-poll, which then happens to pick up the waiting retry messages. The retry message itself is not triggering the wake-up.
Describe the Solution You'd Like
The issue lies in PopLongPollingService.notifyMessageArrivingWithRetryTopic. For V1 retry topics, it cannot resolve the original topic.
We can enhance this method by implementing a reverse-lookup mechanism using the topicCidMap, which stores active (topic, consumer_group) mappings.
Logic:
- When a message arrives in a V1 retry topic, iterate through the entries in
PopLongPollingService.topicCidMap. - For each
(topic, cid)pair, reconstruct the potential V1 retry topic name usingKeyBuilder.buildPopRetryTopicV1(topic, cid). - Compare this reconstructed name with the incoming retry topic name.
- If a unique match is found, we have successfully identified the original topic. Use this original topic to notify the long-polling service.
- If multiple matches are found, or no match is found, fall back to the current behavior (using the retry topic name) to avoid incorrect notifications.
Describe Alternatives You've Considered
popKV may be an alternative
Additional Context
No response