fix(tmq): client does not poll data in a long time if there are some exceptions in channel#34781
fix(tmq): client does not poll data in a long time if there are some exceptions in channel#34781wangmm0220 wants to merge 1 commit into3.3.8from
Conversation
…exceptions in channel
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical issue where the TMQ client could stop polling data for prolonged periods due to channel exceptions. The fix involves adjusting the client's idle polling duration and implementing a safeguard to reset the virtual group's status if it becomes stuck. Concurrently, the PR refactors the TMQ system by completely removing the push-based data transfer mechanism, simplifying the overall architecture and focusing on a more robust polling model. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses a bug where the TMQ client could get stuck and stop polling for data. The fix involves adding a timeout based on the number of skipped polls for a vgroup, which resets its status and allows polling to resume. The idle duration for polling after receiving an empty block has also been increased. A significant portion of this pull request is dedicated to refactoring, which removes the TMQ push model and its associated code. My review includes suggestions to improve a debug log message for clarity and to replace a magic number with a named constant to enhance maintainability.
| tqDebugC("consumer:0x%" PRIx64 " epoch %d, vgId:%d idle for 10ms before start next poll", tmq->consumerId, | ||
| tmq->epoch, pVg->vgId); |
There was a problem hiding this comment.
The debug message here is misleading. The idle duration was changed to 100ms (EMPTY_BLOCK_POLL_IDLE_DURATION), but the log message still refers to '10ms'. It would be better to make this message reflect the actual value to avoid confusion during debugging.
tqDebugC("consumer:0x%" PRIx64 " epoch %d, vgId:%d idle for %dms before start next poll", tmq->consumerId,
tmq->epoch, pVg->vgId, EMPTY_BLOCK_POLL_IDLE_DURATION);| } | ||
|
|
||
| // set status = idle if no response from vnode in a long time to avoid not polling data from vnode | ||
| if (atomic_load_32(&pVg->vgSkipCnt) == 100000) { |
There was a problem hiding this comment.
There was a problem hiding this comment.
Pull request overview
This PR removes the TMQ “push-to-consumer” mechanism on the vnode side (push manager + internal CONSUME_PUSH message path) and adjusts the TMQ client polling behavior to rely on polling/backoff instead.
Changes:
- Removed vnode/TQ push manager infrastructure (pPushMgr), push message dispatch, and related message handler paths.
- Simplified TQ polling flow by removing push-handle registration logic when WAL reaches end.
- Tuned client-side polling backoff (empty-block idle duration) and added a “stuck WAIT” recovery path.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| source/dnode/vnode/src/vnd/vnodeSvr.c | Removes write-time push trigger and query handling for CONSUME_PUSH. |
| source/dnode/vnode/src/tq/tqUtil.c | Removes push-handle registration when no more WAL data is available. |
| source/dnode/vnode/src/tq/tqPush.c | Deletes the push-manager implementation. |
| source/dnode/vnode/src/tq/tq.c | Removes push-manager init/cleanup and deletes poll-push handler; leaves commented-out unregister calls. |
| source/dnode/vnode/src/inc/vnodeInt.h | Drops tqPushMsg/tqProcessPollPush declarations; still declares push-handle APIs. |
| source/dnode/vnode/src/inc/tq.h | Removes pPushMgr from STQ. |
| source/dnode/vnode/CMakeLists.txt | Removes tqPush.c from build sources. |
| source/dnode/mgmt/mgmt_vnode/src/vmHandle.c | Stops registering TDMT_VND_TMQ_CONSUME_PUSH message handler. |
| source/client/src/clientTmq.c | Increases empty-block polling idle duration and adds a “stuck vgStatus WAIT” reset. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| tqDebugC("consumer:0x%" PRIx64 " epoch %d, vgId:%d idle for 10ms before start next poll", tmq->consumerId, | ||
| tmq->epoch, pVg->vgId); |
| // set status = idle if no response from vnode in a long time to avoid not polling data from vnode | ||
| if (atomic_load_32(&pVg->vgSkipCnt) == 100000) { | ||
| atomic_store_32(&pVg->vgStatus, TMQ_VG_STATUS__IDLE); | ||
| } |
| // if consumer register to push manager, push empty to consumer to change vg status from TMQ_VG_STATUS__WAIT to | ||
| // TMQ_VG_STATUS__IDLE, otherwise poll data failed after seek. | ||
| tqUnregisterPushHandle(pTq, pHandle); | ||
| // tqUnregisterPushHandle(pTq, pHandle); | ||
| taosWUnLockLatch(&pTq->lock); |
| continue; | ||
| } | ||
| tqUnregisterPushHandle(pTq, pHandle); | ||
| // tqUnregisterPushHandle(pTq, pHandle); |
| atomic_store_64(&pHandle->consumerId, req.newConsumerId); | ||
| atomic_store_32(&pHandle->epoch, 0); | ||
| tqUnregisterPushHandle(pTq, pHandle); | ||
| // tqUnregisterPushHandle(pTq, pHandle); |
| int tqRegisterPushHandle(STQ* pTq, void* handle, SRpcMsg* pMsg); | ||
| void tqUnregisterPushHandle(STQ* pTq, void* pHandle); |
| taosWUnLockLatch(&pTq->lock); | ||
| } | ||
|
|
||
| // reqOffset represents the current date offset, may be changed if wal not exists |
…exceptions in channel
Description
Issue(s)
Checklist
Please check the items in the checklist if applicable.