-
Notifications
You must be signed in to change notification settings - Fork 535
feat: in-flight, streaming PartitionWriter
#3857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: in-flight, streaming PartitionWriter
#3857
Conversation
ACTION NEEDED delta-rs follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3857 +/- ##
=======================================
Coverage 73.78% 73.78%
=======================================
Files 152 151 -1
Lines 39138 39139 +1
Branches 39138 39139 +1
=======================================
+ Hits 28877 28880 +3
+ Misses 8987 8981 -6
- Partials 1274 1278 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
let upload_future = multi_part_upload.put_part(PutPayload::from(part)); | ||
|
||
// wait until one spot frees up before spawning new task | ||
if tasks.len() >= max_concurrent_tasks { | ||
tasks.join_next().await; | ||
} | ||
tasks.spawn(upload_future); | ||
offset = end; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have talked about similar approaches with @ion-elgreco in the past. My belief based on what I have seen is that parallelizing I/O operations in this way doesn't buy too much since the upload of one part/file typically is going to go at line speed, splitting into parts will be line speed / num parts
My hunch is that if we're able to parallelize the parquet encoding then we'd see some good improvements.
Do you have some measurements to share to help inform the discussion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this code was copied from the original implementation, I don't have any opinion on whether this works or not. I know ParquetObjectWriter
does something similar internally.
The more benefit of this PR is that rather than yielding on writing the new parquet file to the store before starting to write the second one, we can kick off the write job for the first file in the background and immediately start writing to the second one.
As for benchmarks, I'm writing a python-based one. But I might just give this PR a whirl in my dev environment anyways for my own curiosity with AWS :)
2ec9c98
to
c771b97
Compare
about a ~5-10% increase in write performance |
Signed-off-by: Abhi Agarwal <[email protected]>
Signed-off-by: Abhi Agarwal <[email protected]>
Signed-off-by: Abhi Agarwal <[email protected]>
Signed-off-by: Abhi Agarwal <[email protected]>
Signed-off-by: Abhi Agarwal <[email protected]>
Signed-off-by: Abhi Agarwal <[email protected]>
Signed-off-by: Abhi Agarwal <[email protected]>
52b7d3d
to
826025c
Compare
PartitionWriter
Signed-off-by: Abhi Agarwal <[email protected]>
Signed-off-by: Abhi Agarwal <[email protected]>
async fn write_batch(&mut self, batch: &RecordBatch) -> DeltaResult<()> { | ||
match self { | ||
LazyArrowWriter::Initialized(path, object_store, config) => { | ||
let writer = ParquetObjectWriter::from_buf_writer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, didn't even know Parquet crate had this streamed write functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Me neither, it's pretty new and I only found it after starting this PR
@abhiaagarwal we might still suffer from smaller files, we create a Writer for each datafusion partition output and then call close at the end of the execution of each inner plan. Depending on how datafusion partitioned this, it might still be below the file-size even though combined it's above the filesize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff @abhiaagarwal ^^, this should definitely improve also memory pressure
Yeah, my suspicion is that when the datafusion plan is created it uses |
I think we can solve it if we have a single global writer here: delta-rs/crates/core/src/operations/write/execution.rs Lines 283 to 308 in e6d9a2a
Instead a new writer for each inner plan |
Signed-off-by: Abhi Agarwal <[email protected]>
Signed-off-by: Abhi Agarwal <[email protected]>
# Description This was based on me taking a closer look at the `PartitionWriter` code, and realizing it could be made a bit more efficient by deferring the parquet writing to a background task and removing the async locking associated with `AsyncShareableBuffer`. Instead, it switches the underlying writer to use a `ParquetObjectWriter`, which internally uses a `BufWriter`, which keeps a certain capacity of bytes in-memory until reaching a threshold, where it starts a multipart write. Additionally, once a writer has enough bytes in-progress/about to be flushed, we finish its writing on a background tokio task instead of yielding while putting new writes on the new file. # Related Issue(s) Probably closes delta-io#3578, as it still uses a small buffer internally through `ParquetObjectWriter` but adaptively flushes via a multipart upload using `BufWriter` for streaming writes. I'm also hoping this solves the issue I faced in delta-io#3855, though I'll aim to test this in production. - closes delta-io#3675 (superseded) # Documentation [`ParquetObjectWriter`](https://docs.rs/parquet/56.2.0/parquet/arrow/async_writer/struct.ParquetObjectWriter.html) [`BufWriter`](https://docs.rs/object_store/0.12.4/object_store/buffered/struct.BufWriter.html) --------- Signed-off-by: Abhi Agarwal <[email protected]>
Description
This was based on me taking a closer look at the
PartitionWriter
code, and realizing it could be made a bit more efficient by deferring the parquet writing to a background task and removing the async locking associated withAsyncShareableBuffer
.Instead, it switches the underlying writer to use a
ParquetObjectWriter
, which internally uses aBufWriter
, which keeps a certain capacity of bytes in-memory until reaching a threshold, where it starts a multipart write. Additionally, once a writer has enough bytes in-progress/about to be flushed, we finish its writing on a background tokio task instead of yielding while putting new writes on the new file.Related Issue(s)
Probably closes #3578, as it still uses a small buffer internally through
ParquetObjectWriter
but adaptively flushes via a multipart upload usingBufWriter
for streaming writes. I'm also hoping this solves the issue I faced in #3855, though I'll aim to test this in production.Documentation
ParquetObjectWriter
BufWriter