Replies: 1 comment
-
|
Hi @erandagan, I'am sorry for the late response. Some thoughts on your points: Table Formats (Lance/Iceberg): Slicing: We could split the dataset into multiple partitions (essentially the chunks you mentioned) based on file granularity, and the splitting is deterministic/reproducible for the same datasets. To be deterministic, we may need to list and sort all filepath in the dataset before partitioning. And finally, the filtering & skipping happens at this partition level. In most cases, even a dataset with trillions of rows typically consists of tens of millions of files. By grouping these files, we end up with a manageable number of partitions (e.g., hundreds of thousands). This allows us to check and skip processed data efficiently at the partition level. We could even inject this skipping logic directly into each UDF operator. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Closely related to #5931, #5868
I’m looking for the recommended "best practice" for processing large datasets incrementally in Daft, specifically regarding robustness against failures like OOMs or VM preemptions. My goal is to be able to resume from a checkpoint rather than restarting a multi-hour job from scratch.
While I see there is active work on native checkpointing (#5931), it currently lacks support for the Native Runner and Lance. I was surprised this wasn't more prominent in the documentation, as stop/resume functionality is a standard requirement for large-scale ETLs.
In the interim, I’ve considered the following approaches but found significant drawbacks:
Manual Slicing:
Description: manually slice the DataFrame into chunks, process and save each chunk.
Issue: Loses a lot of the efficiency gains from Daft's parallelization, causes schema friction (e.g., adding a column to a Lance table requires a full reload after saving before you can continue). Window functions and more 'sophisticated' work can also behave unexpectedly when slicing the DataFrame.
External Caching:
Description: have UDFs maintain a standalone cache store
Issue: Effective for expensive UDFs, but doubles storage requirements and requires caching each UDF individually. Additionally, I'm unsure how the Daft engine handles a dramatic throughput "cliff" (e.g., dropping from 1000 rows/s to 1 row/s as it hits the end of the cache).
Is there a recommended pattern for "manual" checkpointing that doesn't tank performance?
Beta Was this translation helpful? Give feedback.
All reactions