perf: parallel partition writers via per-stream JoinSet by fvaleye · Pull Request #4193 · delta-io/delta-rs

fvaleye · 2026-02-11T18:54:03Z

Summary

Introducing parallelized partitioned writes in the DataFusion write path:

For partitioned tables, we hash-repartition by partition columns and run one writer task per stream (JoinSet), then merge produced add actions.
For unpartitioned tables, we keep the existing single-writer fan-in path unchanged.

Tried to keep behavior stable while improving partitioned write.

Benchmark

Local benchmark on tables to see the improvements

Scenario	main	branch	speedup vs main
partitioned 1M / 10	89.132	22.305	+74.98%
partitioned 1M / 100	146.300	48.960	+66.53%
partitioned 5M / 10	371.270	76.867	+79.30%
unpartitioned 1M	30.491	32.490	-6.56%

Partitioned writes are faster in these runs.
Unpartitioned path is functionally unchanged.

Notes

I tried not to introduce public API changes.
On writer task failure, remaining tasks are aborted (abort_all()), but files could have already written (will be vacuumed later)

codecov · 2026-02-11T18:56:47Z

Codecov Report

❌ Patch coverage is 83.95303% with 82 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.81%. Comparing base (e684ce9) to head (c7fc3fc).
⚠️ Report is 7 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/core/src/operations/write/execution.rs	75.47%	42 Missing and 35 partials ⚠️
crates/core/src/operations/write/mod.rs	97.46%	0 Missing and 5 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4193      +/-   ##
==========================================
+ Coverage   76.74%   76.81%   +0.07%     
==========================================
  Files         166      166              
  Lines       47830    48209     +379     
  Branches    47830    48209     +379     
==========================================
+ Hits        36706    37031     +325     
- Misses       9287     9320      +33     
- Partials     1837     1858      +21

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

crates/core/src/operations/write/execution.rs

crates/core/src/operations/write/mod.rs

ion-elgreco

This is great! But I have some concerns/theories, which I would like to see validated :)

Before we have one writer, but we would push multiple partition streams to a single channel. Partition streams were not defined by partition columns so the data could come in an unsorted order if you look at it from a Delta partition columns perspective. This means we open multiple partition_writers inside Delta writer. Even though we stream to disk, we still have multiple open so in theory this could also grow in memory if the amount of partitions grows. Could you check what sorting the datafusion plan by the partition columns while still keeping the old behavior of having one writer?
Now we have a writer task per actual Delta partition stream, but repartitioning in datafusion should also come at a cost I assume. I can believe their repartitioning is less memory intensive than what we do in our writer but I still like to see this in a benchmark. So can you rerun the benchmarks but also with a memory profile graph, and peak rss?
Since datafusion drives the partition stream according to our delta lake partition columns, we might be able to increase performance by opening a PartitionWriter directly instead of a DeltaWriter. Can you also test this?

crates/core/src/operations/write/execution.rs

fvaleye · 2026-02-13T18:56:10Z

This is great! But I have some concerns/theories, which I would like to see validated :)

Sure, and feedback is always appreciated!

Before we have one writer, but we would push multiple partition streams to a single channel. Partition streams were not defined by partition columns so the data could come in an unsorted order if you look at it from a Delta partition columns perspective. This means we open multiple partition_writers inside Delta writer. Even though we stream to disk, we still have multiple open so in theory this could also grow in memory if the amount of partitions grows. Could you check what sorting the datafusion plan by the partition columns while still keeping the old behavior of having one writer?

After checking it locally:
When one DeltaWriter with rows unsorted, multiple internal PartitionWriters can remain open, which could increase memory usage as partition cardinality grows.

I validated this with the one-writer behavior, comparing unsorted vs sorted-by-partition input.
Using the previous single-writer system, sorting input by partition columns improves performance, but only modestly (1M rows / 100 partitions)

unsorted: 1.58s
sorted by partition column: 1.43s
So sorting helps (~9.6%) with identical output cardinality, but it's not comparable to the performance gains of multiple writers.

Now we have a writer task per actual Delta partition stream, but repartitioning in datafusion should also come at a cost I assume. I can believe their repartitioning is less memory intensive than what we do in our writer but I still like to see this in a benchmark. So can you rerun the benchmarks but also with a memory profile graph, and peak rss?

Sure!

I took the time to run the benchmarks sequentially and locally.

Throughput (criterion, 10 samples):

Scenario	Main	Branch	Speedup
1M rows / 10 partitions	159.82 ms	65.95 ms	2.42x
1M rows / 100 partitions	227.25 ms	113.74 ms	2.00x
5M rows / 10 partitions	491.09 ms	142.23 ms	3.45x
1M rows unpartitioned	60.30 ms	53.83 ms	~1.12x
5M rows unpartitioned	204.39 ms	200.47 ms	~1.02x

Peak RSS (/usr/bin/time -l):

Scenario	Main	Branch	Delta
1M rows / 10 partitions	607 MB	626 MB	+3.1%
1M rows / 100 partitions	656 MB	666 MB	+1.4%
5M rows / 10 partitions	712 MB	979 MB	+37.5%
1M rows unpartitioned	448 MB	445 MB	-0.6%
5M rows unpartitioned	700 MB	480 MB	-31.4%

Charts generated with Python/matplotlib

Since datafusion drives the partition stream according to our delta lake partition columns, we might be able to increase performance by opening a PartitionWriter directly instead of a DeltaWriter. Can you also test this?

Good idea, it’s definitely better with a quick local running (same work, 2 adds), the numbers were:

median DeltaWriter: ~654.9 ms
median with PartitionWriter: ~564.5 ms
So, I implemented it!

The bump in memory might come from DataFusion's repartition buffers (shuffle) before writing.

ion-elgreco

Thanks for the benchmarks! Just curious in your examples the partitions are evenly distributed? Can you have a look at larger evenly distributed and skewed distributions?

Curious what it does there because maybe for skewed source data it might make sense to have the old behavior

crates/core/src/operations/write/execution.rs

ion-elgreco · 2026-02-13T23:37:52Z

Looking at our DeltaWriter, I think we can actually just call DeltaWriter::write_partition, and you manually give the correct values here inside the writer task. I believe this will yield same performance while avoiding the divide_by_partition_values!

fvaleye · 2026-02-14T15:24:35Z

Thanks for the benchmarks! Just curious in your examples the partitions are evenly distributed? Can you have a look at larger evenly distributed and skewed distributions?

Curious what it does there because maybe for skewed source data it might make sense to have the old behavior

I ran some additional benchmarks covering both when data is evenly distributed and when it's 90% skewed distribution.

Results:

Scenario	Main (time)	Branch (time)	Speedup	Main (RSS)	Branch (RSS)	RSS Delta
10M rows / 100 partitions (even)	1247.6 ms	151.9 ms	8.21x	2641 MB	2327 MB	-11.9%
10M rows / 1000 partitions (even)	3238.4 ms	620.1 ms	5.22x	3451 MB	2676 MB	-22.5%
1M rows / 10 partitions (90% skew)	52.8 ms	38.5 ms	1.37x	1398 MB	1527 MB	+9.3%
1M rows / 100 partitions (90% skew)	92.2 ms	52.1 ms	1.77x	1519 MB	2118 MB	+39.5%
5M rows / 10 partitions (90% skew)	252.4 ms	171.3 ms	1.47x	1890 MB	2245 MB	+18.8%
10M rows / 100 partitions (90% skew)	549.5 ms	343.8 ms	1.60x	1890 MB	2467 MB	+30.6%
10M rows / 1000 partitions (90% skew)	1103.2 ms	492.8 ms	2.24x	2095 MB	2997 MB	+43.1%

DataFusion's repartitioning is very efficient when the data is evenly spread. And when it's not, the memory is increased and might come from DataFusion's repartition buffers.
Even in the worst-case skew scenario, it's 1.37x faster.

fvaleye · 2026-02-14T16:20:30Z

Looking at our DeltaWriter, I think we can actually just call DeltaWriter::write_partition, and you manually give the correct values here inside the writer task. I believe this will yield same performance while avoiding the divide_by_partition_values!

After RepartitionExec by partition columns, hash collisions can cause a stream to contain multiple partition values, so we can't reliably peek at the first row or skip partitioning. divide_by_partition_values is still needed to split batches that contain multiple partitions (hash collisions), but the sort cost per batch is tiny compared to the actual writes.
We can have a check that the batch is single-partition, then call write_partition directly, falling back to divide_by_partition_values when it's not. But I don't know if it's worth it.

ion-elgreco · 2026-02-14T22:06:30Z

Looking at our DeltaWriter, I think we can actually just call DeltaWriter::write_partition, and you manually give the correct values here inside the writer task. I believe this will yield same performance while avoiding the divide_by_partition_values!

After RepartitionExec by partition columns, hash collisions can cause a stream to contain multiple partition values, so we can't reliably peek at the first row or skip partitioning. divide_by_partition_values is still needed to split batches that contain multiple partitions (hash collisions), but the sort cost per batch is tiny compared to the actual writes.
We can have a check that the batch is single-partition, then call write_partition directly, falling back to divide_by_partition_values when it's not. But I don't know if it's worth it.

Hmm but what are the chances of hash collision?

Hey! @alamb , in terms of repartitioning in datafusion how likely is it that a partition stream contains parts from another stream?

ion-elgreco · 2026-02-16T15:33:33Z

Looking at our DeltaWriter, I think we can actually just call DeltaWriter::write_partition, and you manually give the correct values here inside the writer task. I believe this will yield same performance while avoiding the divide_by_partition_values!

After RepartitionExec by partition columns, hash collisions can cause a stream to contain multiple partition values, so we can't reliably peek at the first row or skip partitioning. divide_by_partition_values is still needed to split batches that contain multiple partitions (hash collisions), but the sort cost per batch is tiny compared to the actual writes. We can have a check that the batch is single-partition, then call write_partition directly, falling back to divide_by_partition_values when it's not. But I don't know if it's worth it.

The docs and code imply that this doesn't happen, and that each partition stream has exclusively data from that partition. I posted the question as well on discord for extra verification, so lets see. If that's true we can use write_partition directly

fvaleye · 2026-02-16T17:26:21Z

Looking at our DeltaWriter, I think we can actually just call DeltaWriter::write_partition, and you manually give the correct values here inside the writer task. I believe this will yield same performance while avoiding the divide_by_partition_values!

After RepartitionExec by partition columns, hash collisions can cause a stream to contain multiple partition values, so we can't reliably peek at the first row or skip partitioning. divide_by_partition_values is still needed to split batches that contain multiple partitions (hash collisions), but the sort cost per batch is tiny compared to the actual writes. We can have a check that the batch is single-partition, then call write_partition directly, falling back to divide_by_partition_values when it's not. But I don't know if it's worth it.

The docs and code imply that this doesn't happen, and that each partition stream has exclusively data from that partition. I posted the question as well on discord for extra verification, so lets see. If that's true we can use write_partition directly

I think there is no strong guarantee using HashRepartition of one partition value per stream.
The concern I have is the following: RepartitionExec assigns rows via partition_index = hash(partition_columns) % num_partitions

And, nums_partitions comes from DataFusion's target_partitions configuration, which is by default the number of available CPU cores.

If we have 8 cores on a machine, then we have num_partitions equal to 8, and if we have 100 distinct partition values in a Delta Table... we only have 8 buckets for 100 values.

Example:
- hash(year=2000) % 8 = 4
- hash(year=2001) % 8 = 4 -> same stream, different partition

In any case, it is worth checking if we require divide_by_partition_values, let me know 👍

ethan-tyler · 2026-02-16T17:35:37Z

@ion-elgreco - I did some research into this and here's my understanding. Hash repartitioning gives bucket exclusivity, not partition value exclusivity. All rows for a given partition key land in the same stream, but a stream can contain multiple partition values when distinct keys > N (normal bucket collision from % N).

I don't think we can peek the first row and assume the whole stream is one partition and still need divide_by_partition_values. The value here is that no two streams can write the same partition key concurrently.

Let see what responses the DF discord gets too.

rtyler

Looks cool @fvaleye !

I have some questions inline, which if they are just a gap in my own understanding, no need to explain in the PR, but code comments to remind us later would be apprecaited

rtyler · 2026-02-16T21:07:47Z

crates/core/src/operations/write/execution.rs

+                )?)
+            }
+        };
+        writer.write(&part.record_batch).await?;


As best as I can tell this is also a potential serialization point. Is there a reason why the JoinSet approach wasn't applied here as well? 🤔

The parallelism is set at the stream level (one task per stream via JoinSet), while within the stream, writes are ordered (sequential) to keep each partition writer's state consistent. If we wanted parallelism here, it would require a synchronization (like a lock) that would hurt the performance. And the performance gain would be minimal because after hash repartitioning, each batch typically contains just a small number of partition fragments, so the loop is very short. I can add a comment if you want!

rtyler · 2026-02-16T21:10:25Z

crates/core/src/operations/write/execution.rs

+    // Keep the previous single-writer fan-in path for unpartitioned tables.
+    if partition_columns.is_empty() {


These functions are quite lengthy, I don't see a compelling reason to keep this divergent branch around even if the section below is only going to end up with a JoinSet containing one task?

Was there severely different performance that would justify the pretty big deviation between these two code paths?

You're right, I wanted to split the paths intentionally between partition / no partition. That said, we can refactor duplicated code into helpers!

rtyler · 2026-02-16T21:11:25Z

crates/core/src/operations/write/execution.rs

+    // Keep the previous single-writer fan-in path for unpartitioned tables.
+    if partition_columns.is_empty() {


Same concern about fairly redundant but different branches between writers with partitions and those without

rtyler · 2026-02-16T21:16:17Z

crates/core/src/operations/write/execution.rs

+                    while let Some(mut normal_batch) = normal_stream.try_next().await? {
+                        let mut idx: Option<usize> = None;
+                        for (i_field, field) in
+                            normal_batch.schema_ref().fields().iter().enumerate()
+                        {
+                            if field.name() == CDC_COLUMN_NAME {
+                                idx = Some(i_field);
+                                break;
+                            }
+                        }
+                        normal_batch.remove_column(idx.ok_or(DeltaTableError::generic(
+                            "idx of _change_type col not found. This shouldn't have happened.",
+                        ))?);
+
+                        txn.send(normal_batch).await.map_err(|_| {
+                            DeltaTableError::Generic(
+                                "normal writer closed unexpectedly".to_string(),
+                            )
+                        })?;
+                    }


I'm not a datafusion wizard. 🪄 Is there a strong reason why this couldn't/shouldn't be operations on the normal_df that get handled by Datafusion in a way that adds some potential performance or parallelism gains?

I don't have a great mental model of how the normal_df plan would be executed by execute_stream but I would assume it does something (🤞) in parallel, which would mean that there's another potential serialization point here where the stream gets collapsed to a single task only to fill up txn one by one 🤔

I wrapped up this code (with JoinSet) to ensure that we follow the same path as before (it already existed, u).
We have two options if we want to improve this:

Don't rely on Datafusion and use Arrow compute, replace the per-batch Datafusion with direct arrow::compute::filter_record_batch using a boolean mask on the _change_type column.

Rely more on DataFusion plan restructuring: instead of the upstream UNION ALL that merges normal + CDC rows with a _change_type tag, keep them as two separate execution plans from the start to avoid downstream splitting.

ion-elgreco · 2026-02-17T11:13:16Z

Looking at our DeltaWriter, I think we can actually just call DeltaWriter::write_partition, and you manually give the correct values here inside the writer task. I believe this will yield same performance while avoiding the divide_by_partition_values!

After RepartitionExec by partition columns, hash collisions can cause a stream to contain multiple partition values, so we can't reliably peek at the first row or skip partitioning. divide_by_partition_values is still needed to split batches that contain multiple partitions (hash collisions), but the sort cost per batch is tiny compared to the actual writes. We can have a check that the batch is single-partition, then call write_partition directly, falling back to divide_by_partition_values when it's not. But I don't know if it's worth it.

The docs and code imply that this doesn't happen, and that each partition stream has exclusively data from that partition. I posted the question as well on discord for extra verification, so lets see. If that's true we can use write_partition directly

I think there is no strong guarantee using HashRepartition of one partition value per stream. The concern I have is the following: RepartitionExec assigns rows via partition_index = hash(partition_columns) % num_partitions

And, nums_partitions comes from DataFusion's target_partitions configuration, which is by default the number of available CPU cores.

If we have 8 cores on a machine, then we have num_partitions equal to 8, and if we have 100 distinct partition values in a Delta Table... we only have 8 buckets for 100 values.
Example:
- hash(year=2000) % 8 = 4
- hash(year=2001) % 8 = 4 -> same stream, different partition
In any case, it is worth checking if we require divide_by_partition_values, let me know 👍

If thats the case, and num_partitions is always set by the default available cpu cores. Then you should see roughly the same performance without hash repartioning, right? Do you mind running the same benchmarks without the repartitioning step?

- Replace the single-writer (N partition streams > mpsc channel > 1 writer) with per-partition-stream concurrent writers using JoinSet - Hash repartition by partition columns ensures each stream writes to disjoint Delta partitions, avoiding duplicate small files - Unpartitioned tables coalesce to a single stream, preserving file count - Abort remaining tasks on error via JoinSet::abort_all() Signed-off-by: Florian Valeye <florian.valeye@gmail.com>

Signed-off-by: Florian Valeye <florian.valeye@gmail.com>

Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

rtyler

yolo

ion said in slack that this was cool once the refactoring was done

ethan-tyler · 2026-02-17T21:48:06Z

yolo

To the moon!!

fvaleye requested review from hntd187, roeap and rtyler as code owners February 11, 2026 18:54

github-project-automation bot added this to delta-rust Feb 11, 2026

github-actions bot added the binding/rust Issues for the Rust crate label Feb 11, 2026

ion-elgreco self-assigned this Feb 11, 2026

ethan-tyler reviewed Feb 11, 2026

View reviewed changes

crates/core/src/operations/write/execution.rs Outdated Show resolved Hide resolved

crates/core/src/operations/write/mod.rs Show resolved Hide resolved

fvaleye force-pushed the performance/parallel-partition-writers branch from 84312ec to 739d8e5 Compare February 11, 2026 20:40

ethan-tyler mentioned this pull request Feb 11, 2026

feat: route DeltaDataSink through shared write_streams #4194

Merged

ion-elgreco requested changes Feb 12, 2026

View reviewed changes

crates/core/src/operations/write/execution.rs Outdated Show resolved Hide resolved

crates/core/src/operations/write/execution.rs Outdated Show resolved Hide resolved

ion-elgreco previously requested changes Feb 13, 2026

View reviewed changes

crates/core/src/operations/write/execution.rs Show resolved Hide resolved

rtyler self-assigned this Feb 16, 2026

rtyler reviewed Feb 16, 2026

View reviewed changes

fvaleye force-pushed the performance/parallel-partition-writers branch from 02e5046 to 776f696 Compare February 17, 2026 18:36

github-actions bot added the binding/python Issues for the Python package label Feb 17, 2026

fvaleye added 5 commits February 17, 2026 19:39

Add tests to cover partitioned multi-stream cdc write path

1a614cd

Signed-off-by: Florian Valeye <florian.valeye@gmail.com>

allowing zero buffer size and simplify repartition docs

693a4fa

Signed-off-by: Florian Valeye <florian.valeye@gmail.com>

use PartitionWriter directly, bypass DeltaWriter

8ca351e

Signed-off-by: Florian Valeye <florian.valeye@gmail.com>

refactor: use DeltaWriter in partitioned write paths

3b9cbec

Signed-off-by: Florian Valeye <florian.valeye@gmail.com>

add DELTARS_MAX_CONCURRENT_WRITERS guardrail

2689355

Signed-off-by: Florian Valeye <florian.valeye@gmail.com>

fvaleye force-pushed the performance/parallel-partition-writers branch from 776f696 to 2689355 Compare February 17, 2026 18:40

github-actions bot removed the binding/python Issues for the Python package label Feb 17, 2026

fvaleye requested review from ethan-tyler, ion-elgreco and rtyler February 17, 2026 18:42

rtyler enabled auto-merge (rebase) February 17, 2026 18:50

rtyler disabled auto-merge February 17, 2026 18:51

rtyler enabled auto-merge (rebase) February 17, 2026 18:51

fix: guard writer batch channel size

9d5b2d3

Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

fvaleye force-pushed the performance/parallel-partition-writers branch from 9611956 to 9d5b2d3 Compare February 17, 2026 20:41

ethan-tyler approved these changes Feb 17, 2026

View reviewed changes

Merge branch 'main' into performance/parallel-partition-writers

c7fc3fc

rtyler approved these changes Feb 17, 2026

View reviewed changes

rtyler merged commit b5bd6ac into delta-io:main Feb 17, 2026
29 of 30 checks passed

github-project-automation bot moved this to Done in delta-rust Feb 17, 2026

		// Keep the previous single-writer fan-in path for unpartitioned tables.
		if partition_columns.is_empty() {

Conversation

fvaleye commented Feb 11, 2026

Summary

Benchmark

Notes

Uh oh!

codecov bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

ion-elgreco left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fvaleye commented Feb 13, 2026

Uh oh!

ion-elgreco left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ion-elgreco commented Feb 13, 2026

Uh oh!

fvaleye commented Feb 14, 2026

Uh oh!

fvaleye commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ion-elgreco commented Feb 14, 2026

Uh oh!

ion-elgreco commented Feb 16, 2026

Uh oh!

fvaleye commented Feb 16, 2026

Uh oh!

ethan-tyler commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rtyler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ion-elgreco commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rtyler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ethan-tyler commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Feb 11, 2026 •

edited

Loading

fvaleye commented Feb 14, 2026 •

edited

Loading

ethan-tyler commented Feb 16, 2026 •

edited

Loading

ion-elgreco commented Feb 17, 2026 •

edited

Loading