Daft checkpoint design #5868

everySympathy · 2025-12-22T18:06:55Z

everySympathy
Dec 22, 2025
Collaborator

Context

A long-running job may fail and terminate due to various reasons (such as resource limitations, unstable environment, code bugs, etc.). Once a failure occurs during the intermediate process, restarting often means the entire workflow runs from the beginning, which leads to the already processed data being re-executed. This redundant computation is a huge waste of resources and time.
Therefore, we propose a design of checkpoint: to implement "incremental processing". For example, if the previous run terminates after processing and writing part of the data, subsequent runs will skip already processed data and only complete the missing part.

Design

The checkpoint in Daft enables incremental processing. Its core principle is using primary key (or composite primary key) to filter out rows that have already been processed, ensuring that only new data is processed and appended to target path.
This is achieved by injecting a filter predicate into the logical plan, immediately after the source node. When a write operation is initiated with a checkpoint_config, Daft first reads the primary keys from the existing data at the destination. This set of primary keys is then loaded into memory, distributed across a pool of checkpoint actors. During execution, the injected filter (actually is a UDF Actor) consults these actors to efficiently discard rows with primary keys that already exist. The DataFrame.write_* APIs should have been extended to accept checkpoint_config as a parameter, which controls this behavior mentioned above.

Planning

Milestone 1: Checkpointing for major and basic scenarios

Status: ✅ Completed
Tasks:

1. Implement checkpoint filter.
2. Implement checkpoint actor based on ray actor.
3. Extend write_* APIs(Csv, Parquet, Json) to accept checkpoint_config parameters, and construct the filter predicate with checkpoint filter and checkpoint actor.
4. Support validate and parse checkpoint_config parameter: must be a dictionary containing:
- key_column: The name of the column(s) to use as the primary key/composite primary keys.
- num_buckets(optional): The number of checkpoint actors to create for sharding the primary keys set.
- num_cpus(optional): The number of CPUs to allocate for each checkpoint actor.
- batch_size(optional): The batch size of checkpoint filter operation.
5. Enable write_* APIs(Csv, Parquet, Json) to insert the filter predicate immediately after the source node in the logical plan of current dataframe.
6. Support using primary keys to efficiently distinguish which row has been processed.
7. Support checkpointing on ray runner, focus on distributed scenarios.
8. Support checkpointing for single-source plan.
9. Support users to set batch size of filter operation for checkpointing to improve performance.

Limits in Milestone 1:

Formats: Do not support lake formats like iceberg and lance, because their two-phase commit (2PC) mechanism makes it very hard to integrate.
Primary key: The dataset must have primary key because checkpointing relies on this to skip processed rows.
Do not support checkpointing on native runner.
Do not support checkpointing for dataframes with a multi-source plan generated by join, concat, etc.

Milestone 2: Checkpoint Enhancement.

Status: ⌛️ In Progress
Tasks:

1. Support composite primary key to distinguish which row has been processed.
2. Implement checkpoint actor based on thread pool to support checkpointing on native runner.
3. Support checkpointing for dataframes with a multi-source plan generated by join, concat, etc.

Limits in Milestone 2:

Formats: Do not support lake formats like iceberg and lance.

Milestone 3: Checkpoint available for all formats, like Flink CDC

Today: an actor-based filter delivers incremental processes without external state.
Long term: we could consider a stateful checkpointing mode inspired by Flink CDC/checkpoints

Tasks:

1. State Backend: A persistent storage layer (e.g., RocksDB, HDFS, S3) where checkpoint metadata (e.g., task state, source offsets) is stored — decoupled from compute nodes for durability.
2. Checkpoint Barrier: Source offsets and watermarks for barrier alignment, coordinated by a central checkpoint manager.
3. A global coordinator: managing barriers and snapshots.

Limits in Milestone 3:

State persistence depends on external storage solutions like RocksDB, HDFS and S3.

Benchmark

We conducted a test: reading data from parquet files, and dedupping all the data. We tested two methods:

Anti-join: we generate two dataframes by reading from the same locations, and apply an anti-join to get the final dataframe, which has 0 rows, because all the data in two dataframes are the same.
Checkpoint implemented in Milestone1: we generate a dataframe by reading from a location, and writes to the same location. A filter is constructed and all the data rows will be skipped. Finally, zero rows are appended to the location.

It is observed that this Milestone 1 checkpoint exhibits greater stability compared to anti-join-based deduplication, and can support larger-scale datasets without triggering OOM. This advantage stems from the fact that actor-based checkpointing eliminates the need for costly data shuffling operations.

Number of total rows	Number of files	method	cost time(seconds)
18,900,000	100	checkpoint	50.13
	100	anti-join(broad-cast)	65.04
	1890	checkpoint	52.32
	1890	anti-join(broad-cast)	75.39
189,000,000	100	checkpoint	443.89
	100	anti-join(broad-cast)	OOM
	1890	checkpoint	377.30
	1890	anti-join(broad-cast)	OOM
	30000	checkpoint	593.79
	30000	anti-join(broad-cast)	OOM

Note about the above dedup benchmark:

All files are parquet files stored in TOS(Object Store).
Daft version: v0.7.1
Running on Ray k8s Cluster with 1 header and 4 workers. Each pod is 8c 32G. Head pod only for scheduling.

everySympathy · 2025-12-22T18:25:31Z

everySympathy
Dec 22, 2025
Collaborator Author

Looking forward to discussing our checkpoint implementation with the community! We’ve built v1 of the checkpoint and would love to contribute!

0 replies

caican00 · 2025-12-24T03:10:53Z

caican00
Dec 24, 2025

Hi @everySympathy

Great work on the checkpoint proposal! This is a solid step forward for Daft's robustness. Excited to see this feature take shape!

I have a few questions about this:

Data Uniqueness & Correctness: Since not all datasets have a primary key, and even composite keys might not guarantee uniqueness (e.g., duplicate rows), how should we best handle deduplication to ensure checkpoint correctness?
Performance & Memory Considerations: If the composite key involves many columns, the initial data load for the checkpoint actor could be larger. We might need to consider the impact on startup performance and memory footprint.
Usability & User Experience: This approach requires users to have a clear understanding of their data's unique characteristics. How can we make this intuitive or provide tools to help users identify suitable key columns?

Checkpoint implemented in Milestone1: we generate a dataframe by reading from a location, and writes to the same location. A filter is constructed and all the data rows will be skipped. Finally, zero rows are appended to the location.

This approach seems rather customized and this doesn't seem to be the standard usage of ck.

4 replies

everySympathy Dec 24, 2025
Collaborator Author

Hi @caican00 , Thank you very much for your reply and thoughts! Happy to discuss together!

Let me reply your questions：

Correctness issues and Usability issues: Yes, the actor-based filter checkpoint solution need the primary key (or composite primary key) as a unique identifier to each rows, so actually users need to set the primary key for checkpoint. If there is no unique id, or the composite primary key is large, one solution is to generate uuid for each rows. uuid function #3706. Another solution could be use a built-in hash function to generate unique hash based on the rows content.
Performance issues:
- Memory: There are several checkpoint actors whose numbers could also be set by user. Each ck actor is distributed across different nodes with best effort, sharing a portion of all the keys. Therefore, the memory pressure on a single machine is significantly reduced.
- CPU: The filtering is not CPU-intensive. Each ck actor stores processed primary keys as a set, and every time a batch of keys from the source comes in to query their existence. The hash computation, addressing and matching are not CPU-heavy operations, in our test, the average latency of filtering 100K string url keys in one batch is about 100ms(even lower), with a very low CPU utilization.
- Composite key: Yes, extremely large composite keys are not recommended and other solutions like uuid could be considered.
Approach issues: The standard usage of ck is like below:

 df = daft.read_parquet(source_path, ...)
 df = df.with_column(...)
 df = df.write_parquet(target_path, ..., checkpoint_config = {"key_column": "unique_id"})

In the benchmark case we set target_path as same as source_path to test the performance of full-set filtering, to compare with the performance with anti-join, regardless of the transform operations behind.

Thanks again for your insights!

caican00 Dec 24, 2025

Thanks for your explaination @everySympathy

Please allow me to continue submitting some comments. Thank you!

The use of UUIDs to ensure uniqueness represents a robust solution. However, one consideration is that it necessitates the inclusion of a UUID column in upstream data generation processes, which would require establishing a standardized convention across the entire data pipeline，since the upstream data could originate from computation engines such as Spark, among others.
For hash function, since hash collisions are theoretically inevitable, could this result in erroneous data filtration?
Furthermore, distributing primary keys across multiple checkpoint actors presents a reasonable approach for enhancing scalability. maybe a critical design challenge to address is the data routing mechanism—specifically, how to effectively route each data record to the checkpoint actor responsible for its corresponding primary key shard.

everySympathy Dec 24, 2025
Collaborator Author

Thanks for your explaination @everySympathy

Please allow me to continue submitting some comments. Thank you!

The use of UUIDs to ensure uniqueness represents a robust solution. However, one consideration is that it necessitates the inclusion of a UUID column in upstream data generation processes, which would require establishing a standardized convention across the entire data pipeline，since the upstream data could originate from computation engines such as Spark, among others.

For hash function, since hash collisions are theoretically inevitable, could this result in erroneous data filtration?

Furthermore, distributing primary keys across multiple checkpoint actors presents a reasonable approach for enhancing scalability. maybe a critical design challenge to address is the data routing mechanism—specifically, how to effectively route each data record to the checkpoint actor responsible for its corresponding primary key shard.

Hi @caican00! Thank you for giving more insights!

Please let me give quickly answers and thoughts:

For hash: Blake2b-128 could be used to compute the hash fingerprint, which has a collision resistance of 2^64. Or for more safety, Blake2b-256 and Blake2b-512 could be considered. As you mentioned, collisions are theoretically inevitable, but could be extremely low and provide an option when there is no unique id as well as uuid.
For distributed ck actors: For ck actors, filtering 100K keys at a time has a extremely slight difference between 1K keys. Nevertheless, we implemented routing based on primary key hash bucketing in similar scenarios, and we could test the performance gain compared to no routing on a very large dataset.

caican00 Dec 24, 2025

Thank you @everySympathy, i have no further comments for the time being.

Jay-ju · 2025-12-24T08:18:24Z

Jay-ju
Dec 24, 2025
Collaborator

Enable write_* APIs(Csv, Parquet, Json) to insert the filter predicate immediately after the source node in the logical plan of current dataframe.

Is there a mistake here? Should it be read_xxx?

4 replies

everySympathy Dec 24, 2025
Collaborator Author

Enable write_* APIs(Csv, Parquet, Json) to insert the filter predicate immediately after the source node in the logical plan of current dataframe.

Is there a mistake here? Should it be read_xxx?

No mistake here, there is write_xxx. We extend write_xxx apis to accept checkpoint config. If we set it in read_xxx apis, we couldn't get the target writing path, and must need users to specifically set target path, ak/sk and other infos into the checkpoint config.

The usage is:

df = daft.read_parquet(source_path, ...)
df = df.with_column(...)
df = df.write_parquet(target_path, ..., checkpoint_config = {"key_column": "unique_id"})

Jay-ju Dec 24, 2025
Collaborator

Um, sorry. So will a separate file be generated here? Why can't we directly use the generated file independently?

everySympathy Dec 24, 2025
Collaborator Author

Um, sorry. So will a separate file be generated here? Why can't we directly use the generated file independently?

No separate files generated. We read the processed file that should be stored in target_path, extract the unique identifiers, and store them into ck actors. Then we use ck actors to filter the all data in source_path, skip the processed data that should already exist in target_path

everySympathy Dec 24, 2025
Collaborator Author

The checkpoint_config here is to let the workflow launch the checkpoint filter.

Jay-ju · 2025-12-24T08:21:46Z

Jay-ju
Dec 24, 2025
Collaborator

Extend write_* APIs(Csv, Parquet, Json) to accept checkpoint_config parameters, and construct the filter predicate with checkpoint filter and checkpoint actor.

Why is it necessary to add some checkpoint_config in write_xxx?

1 reply

everySympathy Dec 24, 2025
Collaborator Author

Extend write_* APIs(Csv, Parquet, Json) to accept checkpoint_config parameters, and construct the filter predicate with checkpoint filter and checkpoint actor.

Why is it necessary to add some checkpoint_config in write_xxx?

In write_xxx we've got the target path and io_config infomations, we could easy give these to checkpoint filter to deduplicate already processed rows that stores in target path.

caican00 · 2025-12-25T07:04:26Z

caican00
Dec 25, 2025

If needed later, I would be glad to participate in co-construction.

1 reply

everySympathy Dec 25, 2025
Collaborator Author

If needed later, I would be glad to participate in co-construction.

Sure, thank you! I’m planning to submit a PR for the checkpoint later, so we’ll have something concrete to iterate on together.

zzhpro · 2026-01-04T02:31:54Z

zzhpro
Jan 4, 2026

Thank you for the clear design and thorough documentation!

I have a question about how the current checkpointing mechanism handles “row splits”. For example, in a RAG pipeline using Daft, a single PDF file (identified originally by its file_path) might be split into multiple text chunks during processing. As a result, the primary key for each output row changes from just file_path to a composite key like file_path + chunk_id.

In this scenario, it becomes challenging for the sink to determine when all chunks corresponding to a given file_path have been fully processed and can safely be included in a checkpoint. How does Daft’s checkpointing design address this case?

Another related issue arises when a PDF file is filtered out midway through the pipeline (e.g., due to quality or relevance criteria) and never reaches the sink. In that case, since no output rows are emitted for that file, it won’t be recorded in the checkpoint at all. Consequently, if the pipeline restarts, the same file would be reprocessed—even though it was already handled (and legitimately discarded) in a prior run.

How does Daft’s checkpointing system address these cases to ensure exactly-once processing semantics and avoid redundant work?

2 replies

everySympathy Jan 4, 2026
Collaborator Author

Thank you for the clear design and thorough documentation!

I have a question about how the current checkpointing mechanism handles “row splits”. For example, in a RAG pipeline using Daft, a single PDF file (identified originally by its file_path) might be split into multiple text chunks during processing. As a result, the primary key for each output row changes from just file_path to a composite key like file_path + chunk_id.

In this scenario, it becomes challenging for the sink to determine when all chunks corresponding to a given file_path have been fully processed and can safely be included in a checkpoint. How does Daft’s checkpointing design address this case?

Another related issue arises when a PDF file is filtered out midway through the pipeline (e.g., due to quality or relevance criteria) and never reaches the sink. In that case, since no output rows are emitted for that file, it won’t be recorded in the checkpoint at all. Consequently, if the pipeline restarts, the same file would be reprocessed—even though it was already handled (and legitimately discarded) in a prior run.

How does Daft’s checkpointing system address these cases to ensure exactly-once processing semantics and avoid redundant work?

Thank you @zzhpro for your question and insights! RAG pipelines are indeed a more special scenario because their transformation process is not a one-to-one relationship, but a one-to-many relationship.
For you first question, how to handle row splits: The solution could be composite primary key + Custom Checkpoint Injection, as shown below.

I think the checkpoint system should allow user to choose the custom place to inject the checkpoint filter. For your question, you could set file_path + chunk_id as the composite primary key and inject the checkpoint filter after you split pdfs into chunks. Does this solve the row splits issue?

For your second question, I think this maybe a limitation of the filter-based checkpointing. It cannot preemptively filter out rows that are discarded in the later process and emit no outputs.

zzhpro Jan 4, 2026

@everySympathy Thank you for your patient and detailed explanation—you’ve clarified everything. I don’t have any further questions~

everySympathy · 2026-01-06T03:18:31Z

everySympathy
Jan 6, 2026
Collaborator Author

I believe this proposal would benefit significantly from PR #5924.
Because the dynamic batching of filter op would Increase throughput of the checkpoint filter.

0 replies

everySympathy · 2026-01-08T09:05:02Z

everySympathy
Jan 8, 2026
Collaborator Author

#5931
Here is draft pr, still work in progress

0 replies

rohitkulshreshtha · 2026-02-18T21:38:45Z

rohitkulshreshtha
Feb 18, 2026
Maintainer

I want to revisit the benchmark comparison from the discussion.

The benchmark compares the actor-based filter against anti-join(broadcast), and shows OOM at 189M rows. But Daft also supports hash join, which repartitions both sides by key hash so no single node holds the full key set.

Could you re-run the benchmark with strategy="hash"?

existing = daft.read_parquet("s3://bucket/output/")
df = df.join(existing.select("id"), on="id", how="anti", strategy="hash")

Curious to see how the numbers compare.

1 reply

everySympathy Feb 22, 2026
Collaborator Author

I want to revisit the benchmark comparison from the discussion.

The benchmark compares the actor-based filter against anti-join(broadcast), and shows OOM at 189M rows. But Daft also supports hash join, which repartitions both sides by key hash so no single node holds the full key set.

Could you re-run the benchmark with strategy="hash"?
existing = daft.read_parquet("s3://bucket/output/")
df = df.join(existing.select("id"), on="id", how="anti", strategy="hash")
Curious to see how the numbers compare.

I re-run the benchmark with hash strategy anti-join, meet OOM, the log is below:

2026-02-22 11:49:26,991 INFO worker.py:1694 -- Connecting to existing Ray cluster at address: 192.168.215.90:6379...
2026-02-22 11:49:26,999 INFO worker.py:1879 -- Connected to Ray cluster. View the dashboard at http://192.168.215.90:8265 
/usr/local/lib/python3.11/dist-packages/daft/runners/flotilla.py:298: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
Use get_job_id() instead
  job_id = getattr(runtime_ctx, "job_id", None)
(raylet, ip=192.168.215.105) Spilled 2397 MiB, 2000 objects, write throughput 129 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.                                      
(raylet, ip=192.168.215.105) Spilled 7222 MiB, 6000 objects, write throughput 332 MiB/s.                                                                                            
(raylet, ip=192.168.215.85) Spilled 2642 MiB, 2000 objects, write throughput 100 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message. [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(raylet, ip=192.168.215.98) Spilled 4647 MiB, 4000 objects, write throughput 203 MiB/s.                                                                                             
(raylet, ip=192.168.215.85) Spilled 5403 MiB, 4000 objects, write throughput 193 MiB/s.                                                                                             
(raylet, ip=192.168.215.76) Spilled 2896 MiB, 2000 objects, write throughput 76 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.                                        
(raylet, ip=192.168.215.98) Spilled 9680 MiB, 8009 objects, write throughput 307 MiB/s. [repeated 3x across cluster]                                                                
(raylet, ip=192.168.215.85) Spilled 16669 MiB, 10240 objects, write throughput 175 MiB/s. [repeated 3x across cluster]                                                              
(raylet, ip=192.168.215.98) Spilled 17855 MiB, 13358 objects, write throughput 146 MiB/s.                                                                                           
(raylet, ip=192.168.215.105) Spilled 17870 MiB, 13628 objects, write throughput 143 MiB/s.                                                                                          
(raylet, ip=192.168.215.76) Spilled 18009 MiB, 10867 objects, write throughput 132 MiB/s.                                                                                           
(pid=9818) PhysicalScan->Repartition 0:  33%|█████████████████████████████████▎                                                                   | 66.0/200 [02:38<04:10, 1.87s/itTraceback (most recent call last):ion 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [02:34<00:00, 10.4it/s]
  File "//test2.py", line 78, in <module> 0%|                                                                                                           | 0.00/1.00 [00:00<?, ?it/s]
    test_e2e_anti_join_workflow("parquet_data_v8_100")  # 189,000,000 rows, 100 files
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "//test2.py", line 62, in test_e2e_anti_join_workflow
    df.write_parquet(parquet_root_dir + postfix, io_config=io_cfg)
  File "/usr/local/lib/python3.11/dist-packages/daft/dataframe/dataframe.py", line 820, in write_parquet
    write_df.collect()
  File "/usr/local/lib/python3.11/dist-packages/daft/dataframe/dataframe.py", line 4211, in collect
    self._materialize_results()
  File "/usr/local/lib/python3.11/dist-packages/daft/dataframe/dataframe.py", line 4174, in _materialize_results
    self._result_cache = get_or_create_runner().run(self._builder)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/daft/runners/ray_runner.py", line 695, in run
    return self._collect_into_cache(results_iter)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/daft/runners/ray_runner.py", line 686, in _collect_into_cache
    for i, result in enumerate(results_iter):
  File "/usr/local/lib/python3.11/dist-packages/daft/runners/ray_runner.py", line 670, in run_iter
    yield from self.flotilla_plan_runner.stream_plan(
  File "/usr/local/lib/python3.11/dist-packages/daft/runners/flotilla.py", line 365, in stream_plan
    materialized_result = ray.get(self.runner.get_next_partition.remote(plan_id))
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/ray/_private/worker.py", line 2822, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/ray/_private/worker.py", line 930, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfMemoryError): ray::RemoteFlotillaRunner.get_next_partition() (pid=9818, ip=192.168.215.90, actor_id=a9077e77d17a73d02a114a3e03000000, repr=<daft.runners.flotilla.RemoteFlotillaRunner object at 0x7fb371363f10>)
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/daft/runners/flotilla.py", line 268, in get_next_partition
    next_partition_ref = await self.curr_result_gens[plan_id].__anext__()
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/daft/runners/flotilla.py", line 160, in get_result
    return await self.task
           ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/daft/runners/flotilla.py", line 156, in _get_result
    raise e
  File "/usr/local/lib/python3.11/dist-packages/daft/runners/flotilla.py", line 137, in _get_result
    await self.result_handle.completed()
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 192.168.215.105, ID: 30c08d56797e3f0d598c2f978bfd7099e56d8b0ff235d3af081746e0) where the task (actor ID: 3519022caf9e8342a65db41903000000, name=RaySwordfishActor.__init__, pid=2953, memory used=12.22GB) was running was 30.46GB / 32.00GB (0.951853), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: f28b7f672b10bb17dd364ee95ba2a2bb280d6089bfd9a624694413c1) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 192.168.215.105`. To see the logs of the worker, use `ray logs worker-f28b7f672b10bb17dd364ee95ba2a2bb280d6089bfd9a624694413c1*out -ip 192.168.215.105. Top 10 memory users:
PID     MEM(GB) COMMAND
2953    12.22   ray::PhysicalScan->Repartition
1       0.13    /usr/bin/python3 /usr/local/bin/ray start --address=rayjob-test-daftcluster-raycluster-xtxb9-head-sv...
53      0.07    /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray...
99      0.07    /usr/bin/python3 -u /usr/local/lib/python3.11/dist-packages/ray/dashboard/agent.py --node-ip-address...
14624   0.06    ray::IDLE_SpillWorker
14609   0.06    ray::IDLE_SpillWorker
14616   0.05    ray::IDLE_SpillWorker
14617   0.04    ray::IDLE_SpillWorker
207     0.04    python -u /ray-history-server/log_to_tos.py --session-dir=/tmp/ray/session_2026-02-22_11-35-03_06753...
54      0.04    /usr/bin/python3 -u /usr/local/lib/python3.11/dist-packages/ray/_private/log_monitor.py --session-di...
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
(pid=9818) PhysicalScan->Repartition 0:  36%|████████████████████████████████████▊                                                                | 73.0/200 [02:38<04:35, 2.17s/it]
(pid=9818) PhysicalScan->Repartition 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [02:38<00:00, 1.26it/s]
(pid=9818) InMemoryScan->CommitWrite 2:   0%|                                                                                                           | 0.00/1.00 [00:00<?, ?it/s]

And the test code is here:

from __future__ import annotations

import time


import daft
from daft.daft import S3Config

ENV_AK = "AAA"
ENV_SK = "AAA"
parquet_root_dir = "s3://AAA"
parquet_root_dir_tos = "tos://AAA"
s3_config = S3Config(
    endpoint_url="https://tos-s3-cn-beijing.ivolces.com",
    region_name="cn-beijing",
    key_id=ENV_AK,
    access_key=ENV_SK,
    force_virtual_addressing=True,
    connect_timeout_ms=300_000,
    read_timeout_ms=300_000,
)
io_cfg = daft.io.IOConfig(s3=s3_config)

def test_e2e_workflow(postfix):
    checkpoint_config = {
        "num_buckets": 4,
        "key_column": "object_key",
    }

    time1 = time.time()
    df = daft.read_parquet(parquet_root_dir + postfix, io_config=io_cfg)

    df = df.resume(
         parquet_root_dir + postfix,
         on=checkpoint_config["key_column"],
         format="parquet",
         num_buckets=checkpoint_config["num_buckets"],
         io_config=io_cfg,
    )
    df.write_parquet(parquet_root_dir + postfix, io_config=io_cfg)

    time2 = time.time()
    print("test_e2e_workflow time2 - time1:", time2 - time1)

def test_e2e_anti_join_workflow(postfix):

    time1 = time.time()

    df = daft.read_parquet(parquet_root_dir + postfix, io_config=io_cfg)
    df1= daft.read_parquet(parquet_root_dir + postfix, io_config=io_cfg)
    df1 = df1.select("object_key")
    # df = df.repartition(4, "object_key")
    # df1 = df1.repartition(4, "object_key")
    df = df.join(df1, on=["object_key"], how="anti", strategy="hash")
    df.write_parquet(parquet_root_dir + postfix, io_config=io_cfg)
    
    time2 =time.time()
    print("test_e2e_anti_join_workflow time2 - time1:", time2 - time1)

if __name__ == "__main__":
    # checkpoint/skip-existing
    # test_e2e_workflow("parquet_data_v7_1890")
    # test_e2e_workflow("parquet_data_v9_100")
    # test_e2e_workflow("parquet_data_v8_100")  # 189,000,000 rows, 100 files
    # test_e2e_workflow("parquet_data_v8_1890") # 189,000,000 rows, 1890 files
    # test_e2e_workflow("parquet_data_v4_30000")

    # anti-join
    # test_e2e_anti_join_workflow("parquet_data_v7_1890")
    # test_e2e_anti_join_workflow("parquet_data_v9_100")
    test_e2e_anti_join_workflow("parquet_data_v8_100")  # 189,000,000 rows, 100 files
    # test_e2e_anti_join_workflow("parquet_data_v8_1890") # 189,000,000 rows, 1890 files
    # test_e2e_anti_join_workflow("parquet_data_v4_30000")

Daft checkpoint design #5868

Uh oh!

Uh oh!

everySympathy Dec 22, 2025 Collaborator

Context

Design

Planning

Milestone 1: Checkpointing for major and basic scenarios

Milestone 2: Checkpoint Enhancement.

Milestone 3: Checkpoint available for all formats, like Flink CDC

Benchmark

Replies: 9 comments · 13 replies

Uh oh!

Uh oh!

everySympathy Dec 22, 2025 Collaborator Author

Uh oh!

Uh oh!

caican00 Dec 24, 2025

Uh oh!

Uh oh!

everySympathy Dec 24, 2025 Collaborator Author

Uh oh!

Uh oh!

caican00 Dec 24, 2025

Uh oh!

Uh oh!

everySympathy Dec 24, 2025 Collaborator Author

Uh oh!

caican00 Dec 24, 2025

Uh oh!

Jay-ju Dec 24, 2025 Collaborator

Uh oh!

Uh oh!

everySympathy Dec 24, 2025 Collaborator Author

Uh oh!

Jay-ju Dec 24, 2025 Collaborator

Uh oh!

everySympathy Dec 24, 2025 Collaborator Author

Uh oh!

everySympathy Dec 24, 2025 Collaborator Author

Uh oh!

Jay-ju Dec 24, 2025 Collaborator

Uh oh!

everySympathy Dec 24, 2025 Collaborator Author

Uh oh!

caican00 Dec 25, 2025

Uh oh!

everySympathy Dec 25, 2025 Collaborator Author

Uh oh!

Uh oh!

zzhpro Jan 4, 2026

Uh oh!

everySympathy Jan 4, 2026 Collaborator Author

Uh oh!

zzhpro Jan 4, 2026

Uh oh!

everySympathy Jan 6, 2026 Collaborator Author

Uh oh!

everySympathy Jan 8, 2026 Collaborator Author

Uh oh!

rohitkulshreshtha Feb 18, 2026 Maintainer

Uh oh!

Uh oh!

everySympathy Feb 22, 2026 Collaborator Author

everySympathy
Dec 22, 2025
Collaborator

Replies: 9 comments 13 replies

everySympathy
Dec 22, 2025
Collaborator Author

caican00
Dec 24, 2025

everySympathy Dec 24, 2025
Collaborator Author

everySympathy Dec 24, 2025
Collaborator Author

Jay-ju
Dec 24, 2025
Collaborator

everySympathy Dec 24, 2025
Collaborator Author

Jay-ju Dec 24, 2025
Collaborator

everySympathy Dec 24, 2025
Collaborator Author

everySympathy Dec 24, 2025
Collaborator Author

Jay-ju
Dec 24, 2025
Collaborator

everySympathy Dec 24, 2025
Collaborator Author

caican00
Dec 25, 2025

everySympathy Dec 25, 2025
Collaborator Author

zzhpro
Jan 4, 2026

everySympathy Jan 4, 2026
Collaborator Author

everySympathy
Jan 6, 2026
Collaborator Author

everySympathy
Jan 8, 2026
Collaborator Author

rohitkulshreshtha
Feb 18, 2026
Maintainer

everySympathy Feb 22, 2026
Collaborator Author