Skip to content

Commit 12d5ee2

Browse files
Site/gene.bordegaray/2025/12/consecutive repartitions blog post title (#129)
* initial blog post * better images and formatting * realigned some images * added links for Nga and Andrew's github * added links for Nga and Andrew's github * fixed to DataFusion and some word selection * reformatted some images for clarity and minor changes to punctuation * Update file name to match publish date * updated images * fix title --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
1 parent f10cb93 commit 12d5ee2

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

content/blog/2025-12-15-avoid-consecutive-repartitions.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
layout: post
3-
title: Optimizing Repartitions in DataFusion: How I Went From Database Nood to Core Contribution
3+
title: Optimizing Repartitions in DataFusion: How I Went From Database Noob to Core Contribution
44
date: 2025-12-15
55
author: Gene Bordegaray
66
categories: [tutorial]
@@ -198,7 +198,7 @@ SELECT a, SUM(b) FROM data.parquet GROUP BY a;
198198

199199
Repartitions would appear back-to-back in query plans, specifically a round-robin followed by a hash repartition.
200200

201-
Why is this such a big deal? Well, repartitions do not process the data; their purpose is to redistribute it in ways that enable more efficient computation for other operators. Having consecutive repartitions is counterintuitive because we are redistributing data, then immediately redistributing it again, making the first repartition pointless. While this didn't create extreme overhead for queries, since round-robin repartitioning does not copy data, just the pointers to batches, the behavior was unclear and unecessary.
201+
Why is this such a big deal? Well, repartitions do not process the data; their purpose is to redistribute it in ways that enable more efficient computation for other operators. Having consecutive repartitions is counterintuitive because we are redistributing data, then immediately redistributing it again, making the first repartition pointless. While this didn't create extreme overhead for queries, since round-robin repartitioning does not copy data, just the pointers to batches, the behavior was unclear and unnecessary.
202202

203203
<div class="text-center">
204204
<img

0 commit comments

Comments
 (0)