feat(core): Add destination-controlled LSN flushing#626
feat(core): Add destination-controlled LSN flushing#626flak153 wants to merge 1 commit intosupabase:mainfrom
Conversation
Add `confirmed_flush_lsn()` method to the `Destination` trait, enabling asynchronous destinations to control when the flush LSN is advanced to PostgreSQL. This prevents WAL buildup for destinations that need to durably process data before confirming (e.g., checkpointing systems). Changes: - Add `confirmed_flush_lsn(&self) -> Option<(PgLsn, bool)>` to Destination trait with strong contract docs (must return CommitEvent.end_lsn values only) - Add `poll_destination_flush()` helper on ApplyLoop with debug_assert validation and min-cap safety net for release builds - Move `effective_flush_lsn()` from ApplyLoopState to ApplyLoop to access destination — now considers in-flight write status when idle - Modify `process_syncing_tables_after_batch_flush()` to skip auto-advance when destination controls flushing - Poll destination in keepalive handler and graceful shutdown for reactive progress - Delegate `confirmed_flush_lsn()` in TestDestinationWrapper - Add 30 unit tests covering all new logic, edge cases, and interactions Closes supabase#621
📝 WalkthroughSummary by CodeRabbitRelease Notes
WalkthroughThis PR introduces destination-controlled flush LSN tracking to etl's replication apply loop. A new Sequence Diagram(s)sequenceDiagram
participant ApplyLoop
participant Destination
participant PostgreSQL
loop Replication Apply Loop
ApplyLoop->>Destination: write_events(batched_data)
Destination-->>ApplyLoop: Ok(written)
ApplyLoop->>ApplyLoop: poll_destination_flush()
ApplyLoop->>Destination: confirmed_flush_lsn()
Destination-->>ApplyLoop: Some((confirmed_lsn, inflight))
ApplyLoop->>ApplyLoop: advance last_flush_lsn to confirmed_lsn
ApplyLoop->>PostgreSQL: standby_status_update(confirmed_flush_lsn)
PostgreSQL-->>ApplyLoop: ack
end
Assessment against linked issues
Out-of-scope changes
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@etl/src/replication/apply.rs`:
- Around line 1449-1458: After calling poll_destination_flush(), do not
unconditionally early-return when self.state.last_commit_end_lsn.take() is None;
instead determine an effective flush LSN to pass into
process_syncing_tables_after_batch_flush() by using the taken
last_commit_end_lsn if present or falling back to the destination-advanced flush
position (e.g. self.state.last_flush_lsn or whatever field
poll_destination_flush() updates) when last_commit_end_lsn is None, then call
process_syncing_tables_after_batch_flush(effective_lsn) before returning; update
the logic around poll_destination_flush(),
self.state.last_commit_end_lsn.take(), and the ApplyLoopAction::Continue path so
mid-transaction flushes that advance last_flush_lsn still trigger post-flush
sync processing.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 6091168e-a299-4dd9-9add-c6072b76608e
📒 Files selected for processing (3)
etl/src/destination/base.rsetl/src/replication/apply.rsetl/src/test_utils/test_destination_wrapper.rs
| // Always poll destination first — even for mid-transaction flushes where | ||
| // last_commit_end_lsn is None, the destination may have confirmed previously | ||
| // sent transactions. | ||
| self.poll_destination_flush(); | ||
|
|
||
| // Take the last commit end LSN, which is the highest LSN from any commit message in this | ||
| // batch. Batches can flush mid-transaction, so this may refer to the previous transaction. | ||
| let Some(last_commit_end_lsn) = self.state.last_commit_end_lsn.take() else { | ||
| return Ok(ApplyLoopAction::Continue); | ||
| }; |
There was a problem hiding this comment.
Don't skip post-flush sync processing when destination polling advanced last_flush_lsn.
After poll_destination_flush() runs, a controlled destination may already have advanced last_flush_lsn for an earlier committed transaction. The last_commit_end_lsn.take() early return on Line 1456 means mid-transaction flushes never pass that new flush position into process_syncing_tables_after_batch_flush(). In practice, a table-sync worker can stay stuck in Catchup/SyncDone until some later commit or idle period, even though the destination has already made the relevant commit durable.
Suggested fix
- let Some(last_commit_end_lsn) = self.state.last_commit_end_lsn.take() else {
- return Ok(ApplyLoopAction::Continue);
- };
-
- if self.destination.confirmed_flush_lsn().is_none() {
+ let last_commit_end_lsn = self.state.last_commit_end_lsn.take();
+ let destination_controls_flush = self.destination.confirmed_flush_lsn().is_some();
+
+ if !destination_controls_flush {
+ let Some(last_commit_end_lsn) = last_commit_end_lsn else {
+ return Ok(ApplyLoopAction::Continue);
+ };
// Legacy: no destination control, auto-advance immediately after write_events().
// Only reports progress up to the last completed transaction, which may cause
// duplicates on restart for partial transactions. Destinations must handle
// at-least-once delivery semantics.
self.state
.replication_progress
.update_last_flush_lsn(last_commit_end_lsn);
}
// else: destination-controlled — poll_destination_flush() above already
// advanced to the confirmed position. Don't auto-advance past it.As per coding guidelines, "Comment only when the issue must be resolved before merge" and "Highlight only issues that could cause runtime errors, data loss, or severe maintainability issues."
Also applies to: 1460-1470
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@etl/src/replication/apply.rs` around lines 1449 - 1458, After calling
poll_destination_flush(), do not unconditionally early-return when
self.state.last_commit_end_lsn.take() is None; instead determine an effective
flush LSN to pass into process_syncing_tables_after_batch_flush() by using the
taken last_commit_end_lsn if present or falling back to the destination-advanced
flush position (e.g. self.state.last_flush_lsn or whatever field
poll_destination_flush() updates) when last_commit_end_lsn is None, then call
process_syncing_tables_after_batch_flush(effective_lsn) before returning; update
the logic around poll_destination_flush(),
self.state.last_commit_end_lsn.take(), and the ApplyLoopAction::Continue path so
mid-transaction flushes that advance last_flush_lsn still trigger post-flush
sync processing.
There was a problem hiding this comment.
This is a false positive — the early return on last_commit_end_lsn.take() being None is the pre-existing behavior and is correct.
When last_commit_end_lsn is None, it means no commit was in this batch (mid-transaction flush). There is no new commit boundary to process table syncs against. The poll_destination_flush() call at line 1452 still runs and advances last_flush_lsn — that state change is preserved regardless of the early return.
Table sync processing will pick up the advanced last_flush_lsn at the next opportunity:
process_syncing_tables_after_commit_event()— on the next commit eventprocess_syncing_tables_when_idle()— during idle periods (which already useseffective_flush_lsn())
The suggested fix would call process_syncing_tables_after_batch_flush() without a meaningful commit boundary, which is not useful — the worker dispatch functions need a commit position to determine if syncing tables have caught up.
|
Thanks for the PR! Over the weekend, I did think about this problem and have a better solution in mind, which is also generic. Meaning that it changes how the write method acks progress to the apply worker. I might try to put up a PR for that and link it here. But I am pretty busy, so if you want, you can also try to experiment with the idea. The higher-level design idea is, instead of having a hybrid approach, whenever we send a request to write events, the apply loop will use a LSN + oneshot channel combination and store the response oneshot channel + end LSN and on the select! in the loop wait for the response from the destination. This way, we can have many oneshot channels each of them tied to an end LSN that will define the progress to track. So that destinations don't have to track any LSN, they can just say, “Oh, this batch was inserted, now let's confirm it.” It's basically a callback mechanism that is much more ergonomic and easier to reason about. Then the logic for idle progress tracking becomes just saying "if there is no active transaction, no data in the batch, and no response channels waiting, we use the write_lsn as effective_flush_lsn). Let me know if you need any clarifications! |
|
Thanks for the feedback! I really like this direction — the callback approach is fundamentally better than what we have. Let me walk through how I understand it and flag one design question. How I understand the proposalInstead of the destination polling model ( The key win is inversion of control: the destination doesn't track LSNs at all. It just receives a callback and fires it when the batch is durably processed. The apply loop owns all LSN bookkeeping. Trait changeThe fn write_events(
&self,
events: Vec<Event>,
confirm: oneshot::Sender<()>,
) -> impl Future<Output = EtlResult<()>> + Send;Legacy destinations confirm immediately: async fn write_events(&self, events: Vec<Event>, confirm: oneshot::Sender<()>) -> EtlResult<()> {
// ... process events ...
let _ = confirm.send(()); // confirm inline
Ok(())
}Async destinations (like Feldera) defer: async fn write_events(&self, events: Vec<Event>, confirm: oneshot::Sender<()>) -> EtlResult<()> {
self.work_queue.send((events, confirm)).await?; // hand off to background worker
Ok(()) // return immediately
}Apply loop changesThe // New state in ApplyLoop:
pending_confirmations: BTreeMap<PgLsn, oneshot::Receiver<()>>,
// In send_batch_to_destination:
let (tx, rx) = oneshot::channel();
let end_lsn = self.state.last_commit_end_lsn.unwrap_or(current_lsn);
self.destination.write_events(events_batch, tx).await?;
self.pending_confirmations.insert(end_lsn, rx);
// New select! branch:
Some((confirmed_lsn, _)) = poll_next_confirmation(&mut self.pending_confirmations) => {
self.state.replication_progress.update_last_flush_lsn(confirmed_lsn);
}Idle progress tracking replacementCurrent: fn effective_flush_lsn(&self) -> PgLsn {
let is_idle = !self.state.handling_transaction()
&& self.state.events_batch.is_empty();
if is_idle && self.pending_confirmations.is_empty() {
self.state.replication_progress.last_received_lsn
} else {
self.state.replication_progress.last_flush_lsn
}
}This is cleaner — the legacy vs controlled distinction disappears entirely. All destinations use the same code path; the only difference is when they fire the oneshot. What this also solvesThe current code has this TODO at apply.rs:933: // TODO: in the future we want to investigate how to perform the writing asynchronously
// to avoid stalling the apply loop.
self.destination.write_events(events_batch).await?;The callback approach naturally enables this. If the destination returns from One design questionShutdown semantics: On graceful shutdown with pending confirmation channels — do we wait for them to resolve (destination is still processing), or report I'd lean toward waiting on graceful shutdown since the destination is still alive, and accepting replay on abrupt shutdown (channels get dropped naturally). But wanted to check if you see it differently. The rest follows naturally from keeping it simple:
Next stepsHappy to implement this redesign — I have the full codebase context and working integration tests that should transfer to the new approach with minimal changes. The integration tests test the observable behavior (slot position, data convergence), so they stay valid regardless of the internal mechanism. Let me know if my understanding is correct and I can start prototyping. |
|
Hi, I spent a day doing some experiments with asynchronous flushing. For your specific case I recommend you to implement a blocking check in the The asynchronous flushing implementation could theoretically improve performance but it dramatically increases the complexity of the apply loop state machine. For this reason, I will do some bechmarks but I am not 100% sure I will keep it. |
|
If you are curious about my experimental PR: #628 |
Summary
Adds
confirmed_flush_lsn()to theDestinationtrait so that asynchronous destinations can control exactly whenflush_lsnis reported to PostgreSQL.This is needed for destinations that cannot confirm data as durably processed at the moment
write_events()returns — for example, a destination that buffers writes and only confirms after an external checkpoint. Without this, the apply loop auto-advancesflush_lsnimmediately afterwrite_events(), which tells PostgreSQL it's safe to discard WAL that the destination hasn't actually persisted yet.Problem
The current apply loop has a single LSN advancement path: after every batch flush,
last_flush_lsnis unconditionally advanced tolast_commit_end_lsn. This works for synchronous destinations wherewrite_events()returning means the data is durable. But for asynchronous destinations:flush_lsnwas advanced.flush_lsntolast_received_lsnregardless, which is incorrect when the destination has in-flight writes.Approach
No tracking needed
The value chain is structurally guaranteed to be a commit boundary:
CommitEvent.end_lsnflows throughwrite_events()→ destination records it →confirmed_flush_lsn()returns it. The value is never transformed, so there's nothing to validate against a tracking set. Tracking pending commit LSNs would also be expensive at high throughput — at 100K TPS with lag, it could hold millions of entries.Instead, the contract is enforced through:
CommitEvent.end_lsnpreviously delivered viawrite_events().debug_assert!: In development builds,poll_destination_flush()assertsconfirmed_lsn <= last_received_lsn. This catches violations during testing.std::cmp::mincap: In release builds, confirmed LSN is capped atlast_received_lsnas a safety net, preventing any possible over-advance.API design
Noneby default → zero behavioral change for existing destinations (BigQuery, Iceberg, etc.)PgLsn: The confirmed flush position. Must be aCommitEvent.end_lsnvalue.bool(has_inflight_writes): Distinguishes "idle, nothing pending" (false) from "idle, still processing" (true). This matters foreffective_flush_lsn()— when idle with no inflight writes, it's safe to advanceflush_lsntolast_received_lsn(preventing WAL buildup). When idle with inflight writes,flush_lsnstays at the last confirmed position.Where
poll_destination_flush()is calledThree call sites, each serving a different purpose:
process_syncing_tables_after_batch_flush()— After every batch of events is written to the destination. This is the primary advancement path during active streaming.last_flush_lsnwould stall even though the destination may have confirmed new progress. This also feeds into the existingprocess_syncing_tables_when_idle()call at line 892 ofhandle_replication_message_and_flush(), making table state transitions reactive during idle.initiate_graceful_shutdown()— Final poll before shutdown to capture any last confirmed progress, ensuring the shutdown status update reports the most accurate flush position.effective_flush_lsn()moved fromApplyLoopStatetoApplyLoopPreviously lived on
ApplyLoopStateand had no destination access — it just returnedlast_received_lsnwhen idle. Now it needs to checkself.destination.confirmed_flush_lsn()to decide whether it's safe to advance, so it was moved toApplyLoop<S, D>. The old method onApplyLoopStatewas removed (it was the only dead code warning).Conditional auto-advance in
process_syncing_tables_after_batch_flush()For legacy destinations (
confirmed_flush_lsn()returnsNone), the existing behavior is preserved:last_flush_lsnis auto-advanced tolast_commit_end_lsnafter each batch flush.For controlled destinations (
Some), auto-advance is skipped.poll_destination_flush()(called at the top of the method) handles advancement to whatever position the destination has confirmed. This meanslast_flush_lsnmay lag behindlast_commit_end_lsn, which is exactly the point — the destination hasn't confirmed that far yet.Changes
etl/src/destination/base.rsconfirmed_flush_lsn()with defaultNoneimpl and contract documentationetl/src/replication/apply.rspoll_destination_flush(), moveeffective_flush_lsn()toApplyLoop, modify batch flush / keepalive / shutdown paths, add 30 unit testsetl/src/test_utils/test_destination_wrapper.rsconfirmed_flush_lsn()to wrapped destinationTest plan
30 unit tests added directly in
apply.rs(first unit tests in this file), covering:effective_flush_lsn(9 tests): All combinations of legacy/controlled × idle/in-transaction/non-empty-batch × inflight/no-inflight, all-zeros startup, poll-then-query interactionpoll_destination_flush(8 tests): Legacy no-op, advances to confirmed, caps at received, no regression on lower confirmed, multiple sequential advances, exact boundary, confirmed=0 startup, inflight bool ignoredprocess_syncing_tables_after_batch_flush(7 tests): Legacy auto-advance, controlled skip auto-advance, no commit LSN, polls-before-decision ordering,.take()consumes commit LSN, confirmed > commit LSN, legacy commit < current flush monotonic guaranteedebug_assertvalidation (1 test):#[should_panic]when confirmed exceeds receivedAll existing tests continue to pass (161 total including the 30 new ones). Integration tests are unaffected (they require
TESTS_DATABASE_HOST).Closes #621