experimental(core): Implement asynchronous flushing by iambriccardo · Pull Request #628 · supabase/etl

iambriccardo · 2026-03-09T16:31:04Z

Summary

This PR introduces asynchronous progress tracking for streaming destination writes.

The main goal is to let destinations accept a batch synchronously, finish the actual write asynchronously, and report completion back to the apply loop only when the batch is durably flushed. That lets ETL preserve its replication guarantees without forcing every destination to block write_events() until all downstream work is done.

What Changed

Destination API

Destination::write_events now takes a BatchFlushResult<()> in addition to the event batch.

This splits destination write handling into two phases:

Immediate dispatch/setup failure still returns through write_events().
Actual flush completion is reported later through BatchFlushResult.

To support that, this PR adds:

BatchFlushResult
PendingBatchFlushResult
CompletedBatchFlushResult
BatchFlushMetrics

These types carry both the final result and metadata needed by the apply loop, including dispatch timing and the last commit LSN associated with the batch.

A new DestinationTaskSet helper was also added so destinations can safely manage spawned background tasks and clean them up during shutdown.

Apply Loop

The apply loop now tracks in-flight destination flushes explicitly instead of assuming a batch is complete as soon as write_events() returns.

Key behavioral changes:

only one streaming flush result is tracked at a time
if a second flush is needed while one is still in flight, intake pauses and the queued batch is retried once the pending flush resolves
replication progress is only advanced after the async flush result completes
post-flush table state transitions now happen after durable completion, not at dispatch time

Exit / Lifecycle Handling

The previous loop-control flow was refactored around an internal ExitIntent (Pause or Complete).

That makes it easier to merge exit requests coming from different places, such as:

table-sync catchup completion
shutdown
table errors
idle-state processing

A worker can now request that the current invocation eventually pause or complete, while the apply loop still drains any required flush/shutdown barriers before returning.

Keepalive And Shutdown Behavior

The replication client now reads PostgreSQL wal_sender_timeout from pg_settings.

The apply loop uses that value to compute a proactive keepalive deadline and sends periodic reply-requesting heartbeats when needed. This helps in cases where the loop is healthy but temporarily stalled on async flush completion or the source is quiet.

Shutdown was also tightened up, it doesn't anymore have a complex deferred mechanism, but it rather initiates graceful shutdown immediately, even if work is pending. This is fine, since the system is designed around at-least-once delivery semantics and some repeated data is fine.

Technical Decisions

Why use an explicit flush result channel?

Because dispatch success and durable completion are different events.

Returning only Result<()> from write_events() made ETL treat "destination accepted the work" and "destination finished the work" as the same thing. That is fine for fully synchronous destinations, but incorrect for destinations that queue, fan out, or flush in background tasks.

The new channel-based result keeps the trait generic while making that distinction explicit.

Why keep only one in-flight flush result?

To keep ordering and progress accounting simple and safe.

The apply loop still allows destinations to do internal async work, but ETL itself only advances replication state once the currently tracked batch has completed. That preserves ordered per-table streaming semantics without introducing a more complicated multi-batch acknowledgment model.

Why tie state transitions to post-flush completion?

Because transitions like catchup completion or sync_done -> ready should only happen after the relevant destination work is durably finished.

If those transitions happened earlier, ETL could advertise progress or unblock workers based on writes that were only queued, not actually flushed yet.

coderabbitai · 2026-03-09T16:31:25Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 7b7e53a7-db7e-47f0-9c32-4b1e4b6dfa94

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coveralls · 2026-03-10T07:56:43Z

coverage: 77.715% (-0.3%) from 78.044%
when pulling 036d9c5 on asynchronous-progress-tracking
into 749612a on main.

iambriccardo · 2026-03-18T08:10:39Z

I am re-opening it to experiment with some ideas.

etl-destinations/src/bigquery/core.rs

etl-destinations/tests/bigquery_pipeline.rs

Copilot

Pull request overview

This PR updates ETL’s streaming destination contract to support asynchronous durable flush acknowledgements, allowing write_events() to return after dispatch while replication progress only advances once a separate flush result completes.

Changes:

Extend Destination::write_events to accept a BatchFlushResult<()> and introduce flush-result + task-management primitives (flush_result, DestinationTaskSet).
Refactor the replication apply loop to explicitly track a single in-flight flush result, pause intake when another flush is needed, and add proactive periodic keepalive behavior based on wal_sender_timeout.
Update destinations, benchmarks, and tests to the new async-flush semantics, plus add a replication client helper to read wal_sender_timeout.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
etl/tests/replication.rs	Adds coverage for reading `wal_sender_timeout` from PostgreSQL.
etl/tests/pipeline.rs	Adjusts pipeline test synchronization to align with post-flush state transitions.
etl/src/workers/pool.rs	Minor rename in join error handling for clarity.
etl/src/workers/apply.rs	Adds an invariant check for apply-loop completion (but currently does not fail hard).
etl/src/test_utils/test_destination_wrapper.rs	Updates test destination wrapper to forward async flush completion via a spawned task.
etl/src/test_utils/memory_destination.rs	Updates in-memory destination to send flush completion via `BatchFlushResult`.
etl/src/replication/stream.rs	Adds `PeriodicKeepAlive` status update type for proactive heartbeats.
etl/src/replication/client.rs	Adds `get_wal_sender_timeout()` parsing from `pg_settings`.
etl/src/replication/apply.rs	Major apply-loop refactor: explicit flush-result tracking, exit intent, and proactive keepalives.
etl/src/pipeline.rs	Tweaks log levels/messages around shutdown/error collection.
etl/src/lib.rs	Updates crate docs/example to the new `write_events(..., flush_result)` signature.
etl/src/destination/task_set.rs	New helper to track, reap, and abort destination-owned background tasks.
etl/src/destination/mod.rs	Exposes new destination submodules (`flush_result`, `task_set`).
etl/src/destination/flush_result.rs	New async flush result channel + metrics plumbing.
etl/src/destination/base.rs	Updates the `Destination` trait docs + signature for async flush reporting.
etl-destinations/tests/bigquery_pipeline.rs	Increases a sleep to accommodate async background writes.
etl-destinations/src/iceberg/core.rs	Adopts async flush result contract and task tracking for streaming writes.
etl-destinations/src/bigquery/test_utils.rs	Updates generic bounds to support spawned tasks (`'static`, `Clone`).
etl-destinations/src/bigquery/core.rs	Adopts async flush result contract and task tracking for streaming writes.
etl-benchmarks/benches/table_copies.rs	Updates benchmark destinations to satisfy the new `write_events` signature.

Comments suppressed due to low confidence (1)

etl/src/workers/apply.rs:318

If the apply loop ever returns Completed here, the worker logs an error but still returns Ok(()), which will make the pipeline think streaming finished successfully and stop retrying. Since this state is declared impossible for the apply worker, it should be treated as a hard error (e.g., return an EtlError / panic) so the failure is surfaced and the worker can restart or the pipeline can fail fast.

        // The apply loop when used via the apply worker, should never complete since it's always
        // streaming indefinitely.
        debug_assert!(!matches!(apply_loop_result, ApplyLoopResult::Completed));

        match apply_loop_result {
            ApplyLoopResult::Completed => {
                error!("apply worker apply loop completed, but it should never complete");
            }
            ApplyLoopResult::Paused => {
                info!("apply worker apply loop paused for shutdown");
            }
        }

        Ok(())

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

etl/src/replication/apply.rs

etl-destinations/src/bigquery/core.rs

etl/src/destination/base.rs

etl/src/destination/flush_result.rs

etl/src/destination/task_set.rs

etl/src/replication/apply.rs

etl/src/replication/client.rs

bnjjj

Just the comment on rust version otherwise LGTM

iambriccardo added 3 commits March 9, 2026 14:45

feat(core): Implement asynchronous flushing

b4e7013

Improve

09e049f

Improve

624c96f

iambriccardo and others added 4 commits March 9, 2026 17:43

Improve

fced9fd

Improve

4804bdd

Improve

e1e6daa

Merge branch 'main' into asynchronous-progress-tracking

10d7e6b

iambriccardo changed the title ~~asynchronous progress tracking~~ feat(core): Implement asynchronous flushing Mar 10, 2026

iambriccardo added 2 commits March 10, 2026 09:13

Improve

7ef74cc

Improve

e2971a9

iambriccardo mentioned this pull request Mar 10, 2026

feat(core): Add destination-controlled LSN flushing #626

Closed

iambriccardo changed the title ~~feat(core): Implement asynchronous flushing~~ experimental(core): Implement asynchronous flushing Mar 10, 2026

iambriccardo closed this Mar 10, 2026

flak153 mentioned this pull request Mar 11, 2026

Allow Destination to control confirmed_flush_lsn #621

Closed

iambriccardo reopened this Mar 18, 2026

iambriccardo added 12 commits March 18, 2026 09:24

Improve

2666219

Improve

4abe2be

Improve

e0f58ab

Improve

4306254

Improve

b24b453

Improve

d9da3a1

Improve

8aaf784

Improve

e0d9cd3

Improve

a7fac01

Improve

b9be0b3

Improve

691d1bc

Improve

8755d97

Improve

a54ce45

iambriccardo commented Mar 19, 2026

View reviewed changes

etl-destinations/src/bigquery/core.rs Outdated Show resolved Hide resolved

iambriccardo commented Mar 19, 2026

View reviewed changes

etl-destinations/tests/bigquery_pipeline.rs Show resolved Hide resolved

iambriccardo added 7 commits March 19, 2026 16:41

Improve

d64343c

Improve

87b5e9f

Improve

28c0ec9

Improve

6005be5

Improve

b3c3995

Improve

10a0cc2

Improve

5f4fdee

iambriccardo marked this pull request as ready for review March 20, 2026 09:10

iambriccardo requested a review from a team as a code owner March 20, 2026 09:10

Copilot AI review requested due to automatic review settings March 20, 2026 09:10

Copilot started reviewing on behalf of iambriccardo March 20, 2026 09:10 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

etl/src/replication/apply.rs Outdated Show resolved Hide resolved

iambriccardo and others added 2 commits March 20, 2026 10:56

Improve

9aef34a

Merge branch 'main' into asynchronous-progress-tracking

b56b0f7

bnjjj requested changes Mar 23, 2026

View reviewed changes

iambriccardo added 2 commits March 23, 2026 16:47

Merge

382c1be

Improve

61f96fd

flak153 mentioned this pull request Mar 23, 2026

test(core): Add integration tests for async flush confirmation #639

Closed

iambriccardo added 3 commits March 24, 2026 10:35

Improve

9a37493

Merge

1c94359

Improve

036d9c5

iambriccardo requested a review from bnjjj March 24, 2026 10:07

bnjjj approved these changes Mar 24, 2026

View reviewed changes

iambriccardo force-pushed the asynchronous-progress-tracking branch from 189ce90 to 036d9c5 Compare March 24, 2026 13:49

iambriccardo merged commit ab6a3e7 into main Mar 24, 2026
21 checks passed

iambriccardo deleted the asynchronous-progress-tracking branch March 24, 2026 14:16

flak153 mentioned this pull request Mar 27, 2026

test(core): Add integration tests for async flush confirmation #644

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

experimental(core): Implement asynchronous flushing#628

experimental(core): Implement asynchronous flushing#628
iambriccardo merged 38 commits intomainfrom
asynchronous-progress-tracking

iambriccardo commented Mar 9, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Mar 9, 2026 •

edited

Loading

Review skipped

Uh oh!

coveralls commented Mar 10, 2026 •

edited

Loading

Uh oh!

iambriccardo commented Mar 18, 2026

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bnjjj left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

iambriccardo commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Destination API

Apply Loop

Exit / Lifecycle Handling

Keepalive And Shutdown Behavior

Technical Decisions

Why use an explicit flush result channel?

Why keep only one in-flight flush result?

Why tie state transitions to post-flush completion?

Uh oh!

coderabbitai bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

coveralls commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iambriccardo commented Mar 18, 2026

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bnjjj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

iambriccardo commented Mar 9, 2026 •

edited

Loading

coderabbitai bot commented Mar 9, 2026 •

edited

Loading

coveralls commented Mar 10, 2026 •

edited

Loading