Skip to content

Asked to increase max_temp_directory_size in the disk manager configuration when optimizing large table #3833

@wlach

Description

@wlach

Environment

Delta-rs version: 1.1.4

Bug

What happened:

When I tried to optimize a large delta table (10G of parquet), I get the following error on my M4 Mac:

_internal.DeltaError: Failed to parse parquet: Parquet error: Z-order failed while scanning data: ResourcesExhausted("The used disk space during the spilling process has exceeded the allowable limit of 100.0 GB. Try increasing the max_temp_directory_size in the disk manager configuration.")

Spelunking I'm pretty sure this is a datafusion limit. Is there any way to increase it? I tried to tweak all the possible settings to no avail:

#!/usr/bin/env python

import sys
from deltalake import WriterProperties
from deltalake import DeltaTable

SPILL_SIZE = 40 * 1024 * 1024 * 1024  # 40 gb

agg = DeltaTable(sys.argv[1], storage_options={"AWS_DEFAULT_REGION": "us-west-2"})
wp = WriterProperties(compression="ZSTD", compression_level=3)
print("Optimizing...")
# agg.optimize.compact()
# agg.vacuum(retention_hours=0, enforce_retention_duration=False, dry_run=False)

agg.optimize.z_order(
    columns=["object_id"],
    writer_properties=wp,
    max_concurrent_tasks=1,
    max_spill_size=SPILL_SIZE,
)

What you expected to happen:

Allowed to either increase the limit or have less spillage

How to reproduce it:

Not sure, dataset is proprietary. Probably could create a reproduction case with safe data if it helps though.

More details:

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions