-
Notifications
You must be signed in to change notification settings - Fork 556
Description
Environment
Delta-rs version: 1.1.4
Bug
What happened:
When I tried to optimize a large delta table (10G of parquet), I get the following error on my M4 Mac:
_internal.DeltaError: Failed to parse parquet: Parquet error: Z-order failed while scanning data: ResourcesExhausted("The used disk space during the spilling process has exceeded the allowable limit of 100.0 GB. Try increasing the max_temp_directory_size in the disk manager configuration.")
Spelunking I'm pretty sure this is a datafusion limit. Is there any way to increase it? I tried to tweak all the possible settings to no avail:
#!/usr/bin/env python
import sys
from deltalake import WriterProperties
from deltalake import DeltaTable
SPILL_SIZE = 40 * 1024 * 1024 * 1024 # 40 gb
agg = DeltaTable(sys.argv[1], storage_options={"AWS_DEFAULT_REGION": "us-west-2"})
wp = WriterProperties(compression="ZSTD", compression_level=3)
print("Optimizing...")
# agg.optimize.compact()
# agg.vacuum(retention_hours=0, enforce_retention_duration=False, dry_run=False)
agg.optimize.z_order(
columns=["object_id"],
writer_properties=wp,
max_concurrent_tasks=1,
max_spill_size=SPILL_SIZE,
)What you expected to happen:
Allowed to either increase the limit or have less spillage
How to reproduce it:
Not sure, dataset is proprietary. Probably could create a reproduction case with safe data if it helps though.
More details:
Metadata
Metadata
Assignees
Labels
Type
Projects
Status