Skip to content

perf: parallel partition writers via per-stream JoinSet#4193

Merged
rtyler merged 8 commits intodelta-io:mainfrom
fvaleye:performance/parallel-partition-writers
Feb 17, 2026
Merged

perf: parallel partition writers via per-stream JoinSet#4193
rtyler merged 8 commits intodelta-io:mainfrom
fvaleye:performance/parallel-partition-writers

Conversation

@fvaleye
Copy link
Collaborator

@fvaleye fvaleye commented Feb 11, 2026

Summary

Introducing parallelized partitioned writes in the DataFusion write path:

  • For partitioned tables, we hash-repartition by partition columns and run one writer task per stream (JoinSet), then merge produced add actions.
  • For unpartitioned tables, we keep the existing single-writer fan-in path unchanged.

Tried to keep behavior stable while improving partitioned write.

Benchmark

Local benchmark on tables to see the improvements

Scenario main branch speedup vs main
partitioned 1M / 10 89.132 22.305 +74.98%
partitioned 1M / 100 146.300 48.960 +66.53%
partitioned 5M / 10 371.270 76.867 +79.30%
unpartitioned 1M 30.491 32.490 -6.56%

Partitioned writes are faster in these runs.
Unpartitioned path is functionally unchanged.

Notes

  • I tried not to introduce public API changes.
  • On writer task failure, remaining tasks are aborted (abort_all()), but files could have already written (will be vacuumed later)

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Feb 11, 2026
@codecov
Copy link

codecov bot commented Feb 11, 2026

Codecov Report

❌ Patch coverage is 83.95303% with 82 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.81%. Comparing base (e684ce9) to head (c7fc3fc).
⚠️ Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
crates/core/src/operations/write/execution.rs 75.47% 42 Missing and 35 partials ⚠️
crates/core/src/operations/write/mod.rs 97.46% 0 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4193      +/-   ##
==========================================
+ Coverage   76.74%   76.81%   +0.07%     
==========================================
  Files         166      166              
  Lines       47830    48209     +379     
  Branches    47830    48209     +379     
==========================================
+ Hits        36706    37031     +325     
- Misses       9287     9320      +33     
- Partials     1837     1858      +21     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ion-elgreco ion-elgreco self-assigned this Feb 11, 2026
Copy link
Collaborator

@ion-elgreco ion-elgreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! But I have some concerns/theories, which I would like to see validated :)

  1. Before we have one writer, but we would push multiple partition streams to a single channel. Partition streams were not defined by partition columns so the data could come in an unsorted order if you look at it from a Delta partition columns perspective. This means we open multiple partition_writers inside Delta writer. Even though we stream to disk, we still have multiple open so in theory this could also grow in memory if the amount of partitions grows. Could you check what sorting the datafusion plan by the partition columns while still keeping the old behavior of having one writer?
  2. Now we have a writer task per actual Delta partition stream, but repartitioning in datafusion should also come at a cost I assume. I can believe their repartitioning is less memory intensive than what we do in our writer but I still like to see this in a benchmark. So can you rerun the benchmarks but also with a memory profile graph, and peak rss?
  3. Since datafusion drives the partition stream according to our delta lake partition columns, we might be able to increase performance by opening a PartitionWriter directly instead of a DeltaWriter. Can you also test this?

@fvaleye
Copy link
Collaborator Author

fvaleye commented Feb 13, 2026

This is great! But I have some concerns/theories, which I would like to see validated :)

Sure, and feedback is always appreciated!

  1. Before we have one writer, but we would push multiple partition streams to a single channel. Partition streams were not defined by partition columns so the data could come in an unsorted order if you look at it from a Delta partition columns perspective. This means we open multiple partition_writers inside Delta writer. Even though we stream to disk, we still have multiple open so in theory this could also grow in memory if the amount of partitions grows. Could you check what sorting the datafusion plan by the partition columns while still keeping the old behavior of having one writer?

After checking it locally:
When one DeltaWriter with rows unsorted, multiple internal PartitionWriters can remain open, which could increase memory usage as partition cardinality grows.

I validated this with the one-writer behavior, comparing unsorted vs sorted-by-partition input.
Using the previous single-writer system, sorting input by partition columns improves performance, but only modestly (1M rows / 100 partitions)

  • unsorted: 1.58s
  • sorted by partition column: 1.43s
    So sorting helps (~9.6%) with identical output cardinality, but it's not comparable to the performance gains of multiple writers.
  1. Now we have a writer task per actual Delta partition stream, but repartitioning in datafusion should also come at a cost I assume. I can believe their repartitioning is less memory intensive than what we do in our writer but I still like to see this in a benchmark. So can you rerun the benchmarks but also with a memory profile graph, and peak rss?

Sure!

I took the time to run the benchmarks sequentially and locally.

Throughput (criterion, 10 samples):

Scenario Main Branch Speedup
1M rows / 10 partitions 159.82 ms 65.95 ms 2.42x
1M rows / 100 partitions 227.25 ms 113.74 ms 2.00x
5M rows / 10 partitions 491.09 ms 142.23 ms 3.45x
1M rows unpartitioned 60.30 ms 53.83 ms ~1.12x
5M rows unpartitioned 204.39 ms 200.47 ms ~1.02x

Peak RSS (/usr/bin/time -l):

Scenario Main Branch Delta
1M rows / 10 partitions 607 MB 626 MB +3.1%
1M rows / 100 partitions 656 MB 666 MB +1.4%
5M rows / 10 partitions 712 MB 979 MB +37.5%
1M rows unpartitioned 448 MB 445 MB -0.6%
5M rows unpartitioned 700 MB 480 MB -31.4%

Charts generated with Python/matplotlib

rss_timeseries peak_rss_and_throughput
  1. Since datafusion drives the partition stream according to our delta lake partition columns, we might be able to increase performance by opening a PartitionWriter directly instead of a DeltaWriter. Can you also test this?

Good idea, it’s definitely better with a quick local running (same work, 2 adds), the numbers were:

  • median DeltaWriter: ~654.9 ms
  • median with PartitionWriter: ~564.5 ms
    So, I implemented it!

The bump in memory might come from DataFusion's repartition buffers (shuffle) before writing.

Copy link
Collaborator

@ion-elgreco ion-elgreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the benchmarks! Just curious in your examples the partitions are evenly distributed? Can you have a look at larger evenly distributed and skewed distributions?

Curious what it does there because maybe for skewed source data it might make sense to have the old behavior

@ion-elgreco
Copy link
Collaborator

Looking at our DeltaWriter, I think we can actually just call DeltaWriter::write_partition, and you manually give the correct values here inside the writer task. I believe this will yield same performance while avoiding the divide_by_partition_values!

@fvaleye
Copy link
Collaborator Author

fvaleye commented Feb 14, 2026

Thanks for the benchmarks! Just curious in your examples the partitions are evenly distributed? Can you have a look at larger evenly distributed and skewed distributions?

Curious what it does there because maybe for skewed source data it might make sense to have the old behavior

I ran some additional benchmarks covering both when data is evenly distributed and when it's 90% skewed distribution.

Results:

Scenario Main (time) Branch (time) Speedup Main (RSS) Branch (RSS) RSS Delta
10M rows / 100 partitions (even) 1247.6 ms 151.9 ms 8.21x 2641 MB 2327 MB -11.9%
10M rows / 1000 partitions (even) 3238.4 ms 620.1 ms 5.22x 3451 MB 2676 MB -22.5%
1M rows / 10 partitions (90% skew) 52.8 ms 38.5 ms 1.37x 1398 MB 1527 MB +9.3%
1M rows / 100 partitions (90% skew) 92.2 ms 52.1 ms 1.77x 1519 MB 2118 MB +39.5%
5M rows / 10 partitions (90% skew) 252.4 ms 171.3 ms 1.47x 1890 MB 2245 MB +18.8%
10M rows / 100 partitions (90% skew) 549.5 ms 343.8 ms 1.60x 1890 MB 2467 MB +30.6%
10M rows / 1000 partitions (90% skew) 1103.2 ms 492.8 ms 2.24x 2095 MB 2997 MB +43.1%

DataFusion's repartitioning is very efficient when the data is evenly spread. And when it's not, the memory is increased and might come from DataFusion's repartition buffers.
Even in the worst-case skew scenario, it's 1.37x faster.

@fvaleye
Copy link
Collaborator Author

fvaleye commented Feb 14, 2026

Looking at our DeltaWriter, I think we can actually just call DeltaWriter::write_partition, and you manually give the correct values here inside the writer task. I believe this will yield same performance while avoiding the divide_by_partition_values!

After RepartitionExec by partition columns, hash collisions can cause a stream to contain multiple partition values, so we can't reliably peek at the first row or skip partitioning. divide_by_partition_values is still needed to split batches that contain multiple partitions (hash collisions), but the sort cost per batch is tiny compared to the actual writes.
We can have a check that the batch is single-partition, then call write_partition directly, falling back to divide_by_partition_values when it's not. But I don't know if it's worth it.

@ion-elgreco
Copy link
Collaborator

Looking at our DeltaWriter, I think we can actually just call DeltaWriter::write_partition, and you manually give the correct values here inside the writer task. I believe this will yield same performance while avoiding the divide_by_partition_values!

After RepartitionExec by partition columns, hash collisions can cause a stream to contain multiple partition values, so we can't reliably peek at the first row or skip partitioning. divide_by_partition_values is still needed to split batches that contain multiple partitions (hash collisions), but the sort cost per batch is tiny compared to the actual writes.
We can have a check that the batch is single-partition, then call write_partition directly, falling back to divide_by_partition_values when it's not. But I don't know if it's worth it.

Hmm but what are the chances of hash collision?

Hey! @alamb , in terms of repartitioning in datafusion how likely is it that a partition stream contains parts from another stream?

@rtyler rtyler self-assigned this Feb 16, 2026
@ion-elgreco
Copy link
Collaborator

Looking at our DeltaWriter, I think we can actually just call DeltaWriter::write_partition, and you manually give the correct values here inside the writer task. I believe this will yield same performance while avoiding the divide_by_partition_values!

After RepartitionExec by partition columns, hash collisions can cause a stream to contain multiple partition values, so we can't reliably peek at the first row or skip partitioning. divide_by_partition_values is still needed to split batches that contain multiple partitions (hash collisions), but the sort cost per batch is tiny compared to the actual writes. We can have a check that the batch is single-partition, then call write_partition directly, falling back to divide_by_partition_values when it's not. But I don't know if it's worth it.

The docs and code imply that this doesn't happen, and that each partition stream has exclusively data from that partition. I posted the question as well on discord for extra verification, so lets see. If that's true we can use write_partition directly

@fvaleye
Copy link
Collaborator Author

fvaleye commented Feb 16, 2026

Looking at our DeltaWriter, I think we can actually just call DeltaWriter::write_partition, and you manually give the correct values here inside the writer task. I believe this will yield same performance while avoiding the divide_by_partition_values!

After RepartitionExec by partition columns, hash collisions can cause a stream to contain multiple partition values, so we can't reliably peek at the first row or skip partitioning. divide_by_partition_values is still needed to split batches that contain multiple partitions (hash collisions), but the sort cost per batch is tiny compared to the actual writes. We can have a check that the batch is single-partition, then call write_partition directly, falling back to divide_by_partition_values when it's not. But I don't know if it's worth it.

The docs and code imply that this doesn't happen, and that each partition stream has exclusively data from that partition. I posted the question as well on discord for extra verification, so lets see. If that's true we can use write_partition directly

I think there is no strong guarantee using HashRepartition of one partition value per stream.
The concern I have is the following: RepartitionExec assigns rows via partition_index = hash(partition_columns) % num_partitions

And, nums_partitions comes from DataFusion's target_partitions configuration, which is by default the number of available CPU cores.

If we have 8 cores on a machine, then we have num_partitions equal to 8, and if we have 100 distinct partition values in a Delta Table... we only have 8 buckets for 100 values.

Example:
- hash(year=2000) % 8 = 4
- hash(year=2001) % 8 = 4 -> same stream, different partition

In any case, it is worth checking if we require divide_by_partition_values, let me know 👍

@ethan-tyler
Copy link
Collaborator

ethan-tyler commented Feb 16, 2026

@ion-elgreco - I did some research into this and here's my understanding. Hash repartitioning gives bucket exclusivity, not partition value exclusivity. All rows for a given partition key land in the same stream, but a stream can contain multiple partition values when distinct keys > N (normal bucket collision from % N).

I don't think we can peek the first row and assume the whole stream is one partition and still need divide_by_partition_values. The value here is that no two streams can write the same partition key concurrently.

Let see what responses the DF discord gets too.

Copy link
Member

@rtyler rtyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks cool @fvaleye !

I have some questions inline, which if they are just a gap in my own understanding, no need to explain in the PR, but code comments to remind us later would be apprecaited

)?)
}
};
writer.write(&part.record_batch).await?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As best as I can tell this is also a potential serialization point. Is there a reason why the JoinSet approach wasn't applied here as well? 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parallelism is set at the stream level (one task per stream via JoinSet), while within the stream, writes are ordered (sequential) to keep each partition writer's state consistent. If we wanted parallelism here, it would require a synchronization (like a lock) that would hurt the performance. And the performance gain would be minimal because after hash repartitioning, each batch typically contains just a small number of partition fragments, so the loop is very short. I can add a comment if you want!

Comment on lines +472 to +473
// Keep the previous single-writer fan-in path for unpartitioned tables.
if partition_columns.is_empty() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions are quite lengthy, I don't see a compelling reason to keep this divergent branch around even if the section below is only going to end up with a JoinSet containing one task?

Was there severely different performance that would justify the pretty big deviation between these two code paths?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I wanted to split the paths intentionally between partition / no partition. That said, we can refactor duplicated code into helpers!

Comment on lines +661 to +662
// Keep the previous single-writer fan-in path for unpartitioned tables.
if partition_columns.is_empty() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern about fairly redundant but different branches between writers with partitions and those without

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

Comment on lines +728 to +747
while let Some(mut normal_batch) = normal_stream.try_next().await? {
let mut idx: Option<usize> = None;
for (i_field, field) in
normal_batch.schema_ref().fields().iter().enumerate()
{
if field.name() == CDC_COLUMN_NAME {
idx = Some(i_field);
break;
}
}
normal_batch.remove_column(idx.ok_or(DeltaTableError::generic(
"idx of _change_type col not found. This shouldn't have happened.",
))?);

txn.send(normal_batch).await.map_err(|_| {
DeltaTableError::Generic(
"normal writer closed unexpectedly".to_string(),
)
})?;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a datafusion wizard. 🪄 Is there a strong reason why this couldn't/shouldn't be operations on the normal_df that get handled by Datafusion in a way that adds some potential performance or parallelism gains?

I don't have a great mental model of how the normal_df plan would be executed by execute_stream but I would assume it does something (🤞) in parallel, which would mean that there's another potential serialization point here where the stream gets collapsed to a single task only to fill up txn one by one 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrapped up this code (with JoinSet) to ensure that we follow the same path as before (it already existed, u).
We have two options if we want to improve this:

  1. Don't rely on Datafusion and use Arrow compute, replace the per-batch Datafusion with direct arrow::compute::filter_record_batch using a boolean mask on the _change_type column.
  2. Rely more on DataFusion plan restructuring: instead of the upstream UNION ALL that merges normal + CDC rows with a _change_type tag, keep them as two separate execution plans from the start to avoid downstream splitting.

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Feb 17, 2026

Looking at our DeltaWriter, I think we can actually just call DeltaWriter::write_partition, and you manually give the correct values here inside the writer task. I believe this will yield same performance while avoiding the divide_by_partition_values!

After RepartitionExec by partition columns, hash collisions can cause a stream to contain multiple partition values, so we can't reliably peek at the first row or skip partitioning. divide_by_partition_values is still needed to split batches that contain multiple partitions (hash collisions), but the sort cost per batch is tiny compared to the actual writes. We can have a check that the batch is single-partition, then call write_partition directly, falling back to divide_by_partition_values when it's not. But I don't know if it's worth it.

The docs and code imply that this doesn't happen, and that each partition stream has exclusively data from that partition. I posted the question as well on discord for extra verification, so lets see. If that's true we can use write_partition directly

I think there is no strong guarantee using HashRepartition of one partition value per stream. The concern I have is the following: RepartitionExec assigns rows via partition_index = hash(partition_columns) % num_partitions

And, nums_partitions comes from DataFusion's target_partitions configuration, which is by default the number of available CPU cores.

If we have 8 cores on a machine, then we have num_partitions equal to 8, and if we have 100 distinct partition values in a Delta Table... we only have 8 buckets for 100 values.

Example:
- hash(year=2000) % 8 = 4
- hash(year=2001) % 8 = 4 -> same stream, different partition

In any case, it is worth checking if we require divide_by_partition_values, let me know 👍

If thats the case, and num_partitions is always set by the default available cpu cores. Then you should see roughly the same performance without hash repartioning, right? Do you mind running the same benchmarks without the repartitioning step?

@fvaleye fvaleye force-pushed the performance/parallel-partition-writers branch from 02e5046 to 776f696 Compare February 17, 2026 18:36
@github-actions github-actions bot added the binding/python Issues for the Python package label Feb 17, 2026
- Replace the single-writer (N partition streams > mpsc channel > 1 writer) with per-partition-stream concurrent writers using JoinSet
- Hash repartition by partition columns ensures each stream writes to disjoint Delta partitions, avoiding duplicate small files
- Unpartitioned tables coalesce to a single stream, preserving file count
- Abort remaining tasks on error via JoinSet::abort_all()

Signed-off-by: Florian Valeye <florian.valeye@gmail.com>
Signed-off-by: Florian Valeye <florian.valeye@gmail.com>
Signed-off-by: Florian Valeye <florian.valeye@gmail.com>
Signed-off-by: Florian Valeye <florian.valeye@gmail.com>
Signed-off-by: Florian Valeye <florian.valeye@gmail.com>
Signed-off-by: Florian Valeye <florian.valeye@gmail.com>
@fvaleye fvaleye force-pushed the performance/parallel-partition-writers branch from 776f696 to 2689355 Compare February 17, 2026 18:40
@github-actions github-actions bot removed the binding/python Issues for the Python package label Feb 17, 2026
@rtyler rtyler enabled auto-merge (rebase) February 17, 2026 18:50
@rtyler rtyler disabled auto-merge February 17, 2026 18:51
@rtyler rtyler enabled auto-merge (rebase) February 17, 2026 18:51
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
@fvaleye fvaleye force-pushed the performance/parallel-partition-writers branch from 9611956 to 9d5b2d3 Compare February 17, 2026 20:41
Copy link
Member

@rtyler rtyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yolo

@rtyler rtyler dismissed ion-elgreco’s stale review February 17, 2026 21:46

ion said in slack that this was cool once the refactoring was done

@rtyler rtyler merged commit b5bd6ac into delta-io:main Feb 17, 2026
29 of 30 checks passed
@ethan-tyler
Copy link
Collaborator

yolo

To the moon!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binding/rust Issues for the Rust crate

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants