in_tail: reconcile files after missed inotify events#11750
Conversation
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughCentralizes resumable-offset persistence and truncate-reset into new helpers in ChangesTail file state & persistence
Inotify reconciliation and progress checks
Sequence Diagram(s)sequenceDiagram
participant Inotify
participant TailFS as "tail_fs_inotify"
participant Reconciler as "reconcile_file_state"
participant TailFile as "tail_file"
participant SQLDB
Inotify->>TailFS: file event (modify/move/unlink/IN_Q_OVERFLOW)
alt IN_Q_OVERFLOW
TailFS->>TailFS: log overflow
TailFS->>Reconciler: reconcile_all_files()
TailFS->>TailFS: flb_tail_scan() for full path list
else normal event
TailFS->>Reconciler: reconcile_file_state(file)
Reconciler->>TailFile: fstat(), compare offsets/sizes
alt truncated or offset mismatch
Reconciler->>TailFile: flb_tail_file_reset_on_truncate(size_delta, caller)
TailFile->>SQLDB: update_resumable_offset_state() (if enabled)
else rotation/delete/pending
Reconciler->>TailFS: mark removed/rotated or pending_bytes > 0
end
TailFS->>TailFS: schedule read if pending
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4ebe842a79
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
🧹 Nitpick comments (3)
plugins/in_tail/tail_fs_inotify.c (3)
364-383: IN_MOVE_SELF rotation handling now duplicatesreconcile_file_state.
reconcile_file_statealready callsflb_tail_file_is_rotated→flb_tail_file_rotated→flb_tail_fs_remove→flb_tail_fs_add_rotated(lines 260-279), andflb_tail_file_is_rotatedcorrectly detects the inode/name divergence after aMOVE_SELFrename. After this explicit block runs,file->rotated != 0makesflb_tail_file_is_rotatedshort-circuit toFLB_FALSE, so the second pass is a no-op — i.e., the only thing this block buys is doing the rotation work slightly earlier in the same callback.Consider dropping the explicit block and relying on
reconcile_file_stateforMOVE_SELFtoo; this also keeps all rotation re-registration failure handling in one place. Optional, since the current code is correct.♻️ Suggested simplification
- /* Check file rotation (only if it has not been rotated before) */ - if (ev.mask & IN_MOVE_SELF && file->rotated == 0) { - flb_plg_debug(ins, "inode=%"PRIu64" rotated IN_MOVE SELF '%s'", - file->inode, file->name); - - /* A rotated file must be re-registered */ - ret = flb_tail_file_rotated(file); - if (ret == -1) { - return -1; - } - - ret = flb_tail_fs_remove(ctx, file); - if (ret == -1) { - return -1; - } - - ret = flb_tail_fs_add_rotated(file); - if (ret == -1) { - return -1; - } - } - ret = reconcile_file_state(ctx, file, "tail_fs_event", &pending_data_detected);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/in_tail/tail_fs_inotify.c` around lines 364 - 383, The IN_MOVE_SELF handling block duplicates rotation logic already performed by reconcile_file_state; remove the explicit ev.mask & IN_MOVE_SELF branch (the block that calls flb_tail_file_rotated, flb_tail_fs_remove, flb_tail_fs_add_rotated) and rely on reconcile_file_state which uses flb_tail_file_is_rotated → flb_tail_file_rotated → flb_tail_fs_remove → flb_tail_fs_add_rotated to detect and re-register rotated files, keeping all rotation and failure handling centralized in reconcile_file_state.
167-197: Truncate-reset semantics now diverge between this helper andadjust_countersintail_file.c.
reset_file_on_truncateresetsstream_offset,last_processed_bytes, andpending_bytesafter a truncate, but the truncate branch inadjust_counters(plugins/in_tail/tail_file.clines 1707-1727) only resetsoffset/buf_lenand callsupdate_resumable_offset_state. Records emitted after a truncate detected viaadjust_counterswill keep the pre-truncatestream_offset, producing wrongoffset_keyvalues, while ones detected via inotify reconcile will be correct. Worth unifying both paths through a single shared helper.Two smaller things to consider in this helper:
file->pending_bytes = file->size;at Line 188 is immediately overwritten byreconcile_file_stateat lines 232-237 — it can be dropped.- For the non-SQLDB case, the in-memory marker (
file->db_offset_marker/db_offset_marker_size) isn't refreshed here; it works out today becauseflb_tail_file_offset_marker_matchesearly-returns whendb_offset <= 0, but a defensiveflb_tail_file_update_offset_marker(file)(or reuse ofupdate_resumable_offset_state) would make the post-truncate state explicit.♻️ Sketch for unifying truncate state reset
static int reset_file_on_truncate(struct flb_tail_config *ctx, struct flb_tail_file *file, int64_t size_delta, const char *caller) { int64_t offset; offset = lseek(file->fd, 0, SEEK_SET); if (offset == -1) { flb_errno(); return -1; } flb_plg_debug(ctx->ins, "%s: inode=%"PRIu64" file truncated %s (diff: %"PRId64" bytes)", caller, file->inode, file->name, size_delta); file->offset = offset; file->stream_offset = offset; file->last_processed_bytes = 0; file->buf_len = 0; - file->pending_bytes = file->size; - -#ifdef FLB_HAVE_SQLDB - if (ctx->db) { - flb_tail_db_file_offset(file, ctx); - } -#endif + + /* Persist offset and refresh the in-memory marker for both SQLDB + * and non-SQLDB code paths (mirrors update_resumable_offset_state + * in tail_file.c). */ +#ifdef FLB_HAVE_SQLDB + if (ctx->db) { + flb_tail_db_file_offset(file, ctx); + } + else +#endif + { + flb_tail_file_update_offset_marker(file); + } return 0; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/in_tail/tail_fs_inotify.c` around lines 167 - 197, The truncate reset in reset_file_on_truncate currently diverges from adjust_counters: make both paths share the same state-reset semantics by updating reset_file_on_truncate to reset the same fields used by adjust_counters (clear file->offset and file->buf_len) and also refresh resumable-offset state (call the existing helper flb_tail_file_update_offset_marker(file) or reuse update_resumable_offset_state) so stream_offset/last_processed_bytes and the in-memory db offset marker are consistent after a truncate; remove the redundant file->pending_bytes = file->size assignment (it is overwritten by reconcile_file_state) and keep the FLB_HAVE_SQLDB branch to call flb_tail_db_file_offset(file, ctx) so persistent state is updated as before.
385-388: Minor:pending_data_detectedintail_fs_eventis written but never read.
reconcile_file_staterequires the out param, but the caller at Line 385 ignores it; the unconditionaltail_signal_pending(ctx)in the success path further down already handles the “there might be more data” case. Consider either using the value to skipin_tail_collect_eventwhen nothing is pending, or making the out param optional (NULL-able) so callers that don't care don't need a placeholder local.Returning
0onreconcile_file_state == -1is fine for the “file was removed” case but does swallow watch re-registration failures from the rotation block insidereconcile_file_state; given the periodicin_tail_progress_check_callbackwill reconcile again, recovery is still possible.The
in_tail_progress_check_callbackrewrite (Line 416-446) is clean — delegating per-file work toreconcile_file_stateremoves the previous duplicated size/pending math, and the(void) ins;is the right way to silence the unused-parameter warning.Also applies to: 416-449
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/in_tail/tail_fs_inotify.c` around lines 385 - 388, The call in tail_fs_event to reconcile_file_state currently passes a local pending_data_detected that is never read; either use that out-param to avoid unnecessary work (i.e., after calling reconcile_file_state check pending_data_detected and skip calling in_tail_collect_event when false, while still calling tail_signal_pending only when needed), or make reconcile_file_state's out parameter optional (allow NULL) and update callers (tail_fs_event and in_tail_progress_check_callback) to pass NULL when they don't care; adjust tail_fs_event to stop creating the unused local if you go the NULL route. Keep the existing return behavior for reconcile_file_state == -1 (return 0 from tail_fs_event) so rotation/watch re-registration failures continue to be handled by periodic in_tail_progress_check_callback. Ensure references to reconcile_file_state, tail_fs_event, in_tail_collect_event, tail_signal_pending, and in_tail_progress_check_callback are updated consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@plugins/in_tail/tail_fs_inotify.c`:
- Around line 364-383: The IN_MOVE_SELF handling block duplicates rotation logic
already performed by reconcile_file_state; remove the explicit ev.mask &
IN_MOVE_SELF branch (the block that calls flb_tail_file_rotated,
flb_tail_fs_remove, flb_tail_fs_add_rotated) and rely on reconcile_file_state
which uses flb_tail_file_is_rotated → flb_tail_file_rotated → flb_tail_fs_remove
→ flb_tail_fs_add_rotated to detect and re-register rotated files, keeping all
rotation and failure handling centralized in reconcile_file_state.
- Around line 167-197: The truncate reset in reset_file_on_truncate currently
diverges from adjust_counters: make both paths share the same state-reset
semantics by updating reset_file_on_truncate to reset the same fields used by
adjust_counters (clear file->offset and file->buf_len) and also refresh
resumable-offset state (call the existing helper
flb_tail_file_update_offset_marker(file) or reuse update_resumable_offset_state)
so stream_offset/last_processed_bytes and the in-memory db offset marker are
consistent after a truncate; remove the redundant file->pending_bytes =
file->size assignment (it is overwritten by reconcile_file_state) and keep the
FLB_HAVE_SQLDB branch to call flb_tail_db_file_offset(file, ctx) so persistent
state is updated as before.
- Around line 385-388: The call in tail_fs_event to reconcile_file_state
currently passes a local pending_data_detected that is never read; either use
that out-param to avoid unnecessary work (i.e., after calling
reconcile_file_state check pending_data_detected and skip calling
in_tail_collect_event when false, while still calling tail_signal_pending only
when needed), or make reconcile_file_state's out parameter optional (allow NULL)
and update callers (tail_fs_event and in_tail_progress_check_callback) to pass
NULL when they don't care; adjust tail_fs_event to stop creating the unused
local if you go the NULL route. Keep the existing return behavior for
reconcile_file_state == -1 (return 0 from tail_fs_event) so rotation/watch
re-registration failures continue to be handled by periodic
in_tail_progress_check_callback. Ensure references to reconcile_file_state,
tail_fs_event, in_tail_collect_event, tail_signal_pending, and
in_tail_progress_check_callback are updated consistently.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a8aac234-f0a4-47a4-ac71-1734ecd362ac
📒 Files selected for processing (2)
plugins/in_tail/tail_file.cplugins/in_tail/tail_fs_inotify.c
|
Thanks for the thorough review @coderabbitai — all three nitpicks are addressed in 68a562d. 364-383 (IN_MOVE_SELF redundancy): Yep, dropped the explicit block. 167-197 (truncate-reset divergence): Good call on this one — that was a real bug. The 385-388 (unused All five targeted in_tail integration tests still pass under both regular and |
|
Tip For best results, initiate chat on the files or code changes.
Green on 26/27 integration tests under Valgrind strict mode is a strong signal. All three concerns are well addressed. 🎉 ✏️ Learnings added
🧠 Learnings usedIf you found this review helpful, would you consider giving us a shout-out on X? Thank you for using CodeRabbit! |
- retry inotify_add_watch on next reconcile when a previously rotated file ended up unwatched (e.g. ENOSPC on the first add) - drop the IN_MOVE_SELF block in tail_fs_event; reconcile_file_state already detects inode/name divergence and re-registers the watch via the same retry path - share the truncate reset between adjust_counters and the inotify reconcile path through a new flb_tail_file_reset_on_truncate helper so post-truncate offset_key values stay consistent across both paths - make reconcile_file_state's pending_data_detected out param NULL-able and pass NULL from tail_fs_event where the value was unused Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
68a562d to
dc50818
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@plugins/in_tail/tail_file.c`:
- Around line 172-182: The function update_resumable_offset_state currently
swallows errors from the persistence calls; change it to return an int status
and propagate errors from flb_tail_db_file_offset and
flb_tail_file_update_offset_marker (i.e., return their failure codes instead of
ignoring them), keeping the FLB_HAVE_SQLDB conditional logic intact; update all
callers to check the returned status and handle/report persistence failures so
truncate recovery and post-read updates can act on real persistence errors.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f5b3f23f-3a7a-4ec9-9191-2a8d19e4fbce
📒 Files selected for processing (3)
plugins/in_tail/tail_file.cplugins/in_tail/tail_file.hplugins/in_tail/tail_fs_inotify.c
There was a problem hiding this comment.
♻️ Duplicate comments (1)
plugins/in_tail/tail_file.c (1)
172-182:⚠️ Potential issue | 🟠 MajorPropagate resumable-offset persistence failures from the shared helper.
flb_tail_db_file_offset()andflb_tail_file_update_offset_marker()can fail, butupdate_resumable_offset_state()drops that status. As a result, truncate recovery and the post-read updates at Line 1795 and Line 1979 can still report success with stale resumable state.Also applies to: 206-207, 1795-1795, 1979-1979
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/in_tail/tail_file.c` around lines 172 - 182, update_resumable_offset_state currently ignores failures from flb_tail_db_file_offset and flb_tail_file_update_offset_marker; change update_resumable_offset_state to return an int status (0 on success, non-zero on failure), propagate and return the error from flb_tail_db_file_offset and flb_tail_file_update_offset_marker instead of dropping it, and update all callers (including the truncate-recovery and post-read update sites that invoke update_resumable_offset_state) to check the returned status and handle/propagate failures accordingly so resumable-offset persistence errors are not treated as success.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@plugins/in_tail/tail_file.c`:
- Around line 172-182: update_resumable_offset_state currently ignores failures
from flb_tail_db_file_offset and flb_tail_file_update_offset_marker; change
update_resumable_offset_state to return an int status (0 on success, non-zero on
failure), propagate and return the error from flb_tail_db_file_offset and
flb_tail_file_update_offset_marker instead of dropping it, and update all
callers (including the truncate-recovery and post-read update sites that invoke
update_resumable_offset_state) to check the returned status and handle/propagate
failures accordingly so resumable-offset persistence errors are not treated as
success.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ff41faaf-e067-4b74-8019-d5f63343acf2
📒 Files selected for processing (3)
plugins/in_tail/tail_file.cplugins/in_tail/tail_file.hplugins/in_tail/tail_fs_inotify.c
🚧 Files skipped from review as they are similar to previous changes (1)
- plugins/in_tail/tail_file.h
cosmo0920
left a comment
There was a problem hiding this comment.
I reviewed this patch and I noticed a nitpick for unintended indentation.
Other part of this patch looks fine so we need to address an indentation glitch.
Bring the tail_signal_pending() call back to its original indentation; the whitespace tweak slipped in with the reconcile refactor and was unrelated to the reconcile work. Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@cosmo0920 |
This makes the inotify tail backend recover from missed events instead of waiting for the next modify event to happen to the same file. That matters when the inotify queue overflows, or when a truncate/copytruncate/delete/rotation event is missed while a file still has readable data.
What changed:
IN_Q_OVERFLOWby reconciling tracked files and rescanning pathsIN_MODIFYTesting:
cmake -S . -B /build -DFLB_TESTS_RUNTIME=On -DFLB_TESTS_INTERNAL=On -DFLB_EXAMPLES=Off && cmake --build /build --target fluent-bit-bin -j"$(nproc)"FLUENT_BIT_BINARY=/build/bin/fluent-bit /tmp/flb-it-venv/bin/python -m pytest tests/integration/scenarios/in_tail/tests/test_in_tail_001.py -q -k "copytruncate_with_stale_writer or follows_rename_rotation or handles_multiple_rename_rotations or multi_file_rapid_rotation or rotate_wait_keeps_old_inode_then_purges_it"FLUENT_BIT_BINARY=/build/bin/fluent-bit VALGRIND=1 VALGRIND_STRICT=1 /tmp/flb-it-venv/bin/python -m pytest tests/integration/scenarios/in_tail/tests/test_in_tail_001.py -q -k "copytruncate_with_stale_writer or follows_rename_rotation or handles_multiple_rename_rotations or multi_file_rapid_rotation or rotate_wait_keeps_old_inode_then_purges_it"Summary by CodeRabbit
Refactor
Bug Fixes