Incremental processing of a large dataset (~50TB) #6060

erandagan · 2026-01-18T13:02:31Z

erandagan
Jan 18, 2026

Closely related to #5931, #5868

I’m looking for the recommended "best practice" for processing large datasets incrementally in Daft, specifically regarding robustness against failures like OOMs or VM preemptions. My goal is to be able to resume from a checkpoint rather than restarting a multi-hour job from scratch.

While I see there is active work on native checkpointing (#5931), it currently lacks support for the Native Runner and Lance. I was surprised this wasn't more prominent in the documentation, as stop/resume functionality is a standard requirement for large-scale ETLs.

In the interim, I’ve considered the following approaches but found significant drawbacks:

Manual Slicing:
Description: manually slice the DataFrame into chunks, process and save each chunk.

Issue: Loses a lot of the efficiency gains from Daft's parallelization, causes schema friction (e.g., adding a column to a Lance table requires a full reload after saving before you can continue). Window functions and more 'sophisticated' work can also behave unexpectedly when slicing the DataFrame.

External Caching:
Description: have UDFs maintain a standalone cache store

Issue: Effective for expensive UDFs, but doubles storage requirements and requires caching each UDF individually. Additionally, I'm unsure how the Daft engine handles a dramatic throughput "cliff" (e.g., dropping from 1000 rows/s to 1 row/s as it hits the end of the cache).

Is there a recommended pattern for "manual" checkpointing that doesn't tank performance?

everySympathy · 2026-02-08T14:07:49Z

everySympathy
Feb 8, 2026
Collaborator

Hi @erandagan, I'am sorry for the late response.
I'm working on the checkpointing PR (#5931). Thank you very much because you already provided deep analysis!.

Some thoughts on your points:

Table Formats (Lance/Iceberg):
You are spot on. Since these formats handle their own atomic commits and transactions, implementing external checkpointing for them is complex. I am prioritizing file-based outputs (Parquet/JSON) in the initial release.

Slicing:
I actually think file-level partitioning is a robust pattern for stateless workloads (like Parquet ETL), especially for very large datasets.

We could split the dataset into multiple partitions (essentially the chunks you mentioned) based on file granularity, and the splitting is deterministic/reproducible for the same datasets. To be deterministic, we may need to list and sort all filepath in the dataset before partitioning. And finally, the filtering & skipping happens at this partition level.

In most cases, even a dataset with trillions of rows typically consists of tens of millions of files. By grouping these files, we end up with a manageable number of partitions (e.g., hundreds of thousands). This allows us to check and skip processed data efficiently at the partition level. We could even inject this skipping logic directly into each UDF operator.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental processing of a large dataset (~50TB) #6060

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Incremental processing of a large dataset (~50TB) #6060

Uh oh!

Uh oh!

erandagan Jan 18, 2026

Replies: 1 comment

Uh oh!

Uh oh!

everySympathy Feb 8, 2026 Collaborator

erandagan
Jan 18, 2026

everySympathy
Feb 8, 2026
Collaborator