Skip to content

Commit f7688c4

Browse files
authored
fix: use single writer for all partition streams (#3870)
# Description Collecting the streams concurrently and sending them over a sync channel provides the benefit of us keeping a single writer instance where we write through. This prevents the edge case of potentially writing many small files, when datafusion decided to lots of partitioned streams that are relatively small. Since we now maintain one writer we just keep writing from all partition streams into that. I guess this stacks well with your PR @abhiaagarwal? Wdyt? - closes #3871 We now only create one file with the same code of the above issue instead of 3 small files: ```json {"protocol":{"minReaderVersion":1,"minWriterVersion":2}} {"metaData":{"id":"0d1846c0-71e6-4c6d-aba1-5b9f8d752937","name":null,"description":null,"format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"foo\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"createdTime":1760889081860,"configuration":{}}} {"add":{"path":"part-00000-352305f0-eff1-44d3-993e-6bd31cdd118c-c000.snappy.parquet","partitionValues":{},"size":510,"modificationTime":1760889081877,"dataChange":true,"stats":"{\"numRecords\":9,\"minValues\":{\"foo\":1},\"maxValues\":{\"foo\":3},\"nullCount\":{\"foo\":0}}","tags":null,"baseRowId":null,"defaultRowCommitVersion":null,"clusteringProvider":null}} {"commitInfo":{"timestamp":1760889081878,"operation":"WRITE","operationParameters":{"mode":"ErrorIfExists"},"engineInfo":"delta-rs:py-1.2.0","clientVersion":"delta-rs.py-1.2.0","operationMetrics":{"execution_time_ms":19,"num_added_files":1,"num_added_rows":9,"num_partitions":0,"num_removed_files":0}}} ``` --------- Signed-off-by: Ion Koutsouris <[email protected]> Signed-off-by: Ion Koutsouris
1 parent d0ae849 commit f7688c4

File tree

2 files changed

+241
-168
lines changed

2 files changed

+241
-168
lines changed

0 commit comments

Comments
 (0)