Skip to content

Conversation

@Lucas61000
Copy link
Contributor

Changes Made

  • Add support for writing a _SUCCESS file upon successful completion of parquet write operations.
  • Add parameter to existing write_parquet method.
    df.write_parquet("/path/to/output", write_success_file=True)

Related Issues

#4085

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 27, 2026

Greptile Overview

Greptile Summary

Added support for writing _SUCCESS marker files in Parquet operations by introducing a write_success_file parameter to the write_parquet method (defaults to False). The implementation correctly handles all write modes (append, overwrite, overwrite-partitions) and creates the marker file after successful completion. The feature is well-tested with comprehensive coverage across different write modes and partition configurations.

Key Changes:

  • Python API: Added write_success_file parameter to DataFrame.write_parquet()
  • Rust implementation: Threaded the parameter through logical plan builder and sink infrastructure
  • File creation happens after overwrite cleanup operations, ensuring proper behavior in all scenarios
  • Graceful error handling with warning logs if _SUCCESS file creation fails

Minor Issue:

  • Missing documentation for the new write_success_file parameter in the method's docstring

Confidence Score: 4/5

  • This PR is safe to merge with one minor documentation fix needed
  • The implementation is sound with proper parameter threading from Python to Rust, correct placement of _SUCCESS file creation after overwrite cleanup, comprehensive test coverage, and graceful error handling. The only issue is missing parameter documentation in the docstring, which should be added before merging.
  • daft/dataframe/dataframe.py needs the write_success_file parameter documented in the docstring

Important Files Changed

Filename Overview
daft/dataframe/dataframe.py Added write_success_file parameter to write_parquet method, missing docstring documentation
src/daft-logical-plan/src/sink_info.rs Added write_success_file field to OutputFileInfo struct with proper serialization support
src/daft-local-execution/src/sinks/commit_write.rs Implemented _SUCCESS file creation after successful write operations with proper error handling
tests/io/test_parquet.py Comprehensive test coverage for _SUCCESS file creation across different write modes and configurations

Sequence Diagram

sequenceDiagram
    participant User
    participant DataFrame
    participant LogicalPlanBuilder
    participant CommitWriteSink
    participant IOClient
    participant Storage

    User->>DataFrame: write_parquet(root_dir, write_success_file=True)
    DataFrame->>LogicalPlanBuilder: write_tabular(write_success_file=True)
    LogicalPlanBuilder->>LogicalPlanBuilder: table_write() creates OutputFileInfo
    Note over LogicalPlanBuilder: OutputFileInfo stores write_success_file flag
    
    DataFrame->>DataFrame: collect() - executes plan
    DataFrame->>CommitWriteSink: finalize(states)
    
    alt Overwrite Mode
        CommitWriteSink->>IOClient: glob files in root_dir
        IOClient->>Storage: list existing files
        Storage-->>IOClient: file list
        IOClient-->>CommitWriteSink: existing files
        CommitWriteSink->>Storage: delete old files
    end
    
    alt write_success_file is true
        CommitWriteSink->>IOClient: get_source(root_uri)
        IOClient-->>CommitWriteSink: source
        CommitWriteSink->>Storage: put("/_SUCCESS", empty bytes)
        alt Success
            Storage-->>CommitWriteSink: ok
        else Failure
            CommitWriteSink->>CommitWriteSink: log::warn(error)
        end
    end
    
    CommitWriteSink-->>DataFrame: written file paths
    DataFrame-->>User: DataFrame with file paths
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 27, 2026

Additional Comments (1)

daft/dataframe/dataframe.py
Missing write_success_file parameter in the docstring. The parameter should be documented.

            write_mode (str, optional): Operation mode of the write. `append` will add new data, `overwrite` will replace the contents of the root directory with new data. `overwrite-partitions` will replace only the contents in the partitions that are being written to. Defaults to "append".
            write_success_file (bool, optional): Whether to write a `_SUCCESS` file upon successful completion. Defaults to False.
Prompt To Fix With AI
This is a comment left during a code review.
Path: daft/dataframe/dataframe.py
Line: 794:795

Comment:
Missing `write_success_file` parameter in the docstring. The parameter should be documented.

```suggestion
            write_mode (str, optional): Operation mode of the write. `append` will add new data, `overwrite` will replace the contents of the root directory with new data. `overwrite-partitions` will replace only the contents in the partitions that are being written to. Defaults to "append".
            write_success_file (bool, optional): Whether to write a `_SUCCESS` file upon successful completion. Defaults to False.
```

How can I resolve this? If you propose a fix, please make it concise.

@Lucas61000 Lucas61000 changed the title Add support for writing _SUCCESS files in parquet operations feat: add support for writing _SUCCESS files in parquet operations Jan 27, 2026
@github-actions github-actions bot added the feat label Jan 27, 2026
@codecov
Copy link

codecov bot commented Jan 27, 2026

Codecov Report

❌ Patch coverage is 0% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.41%. Comparing base (aa8add2) to head (06d895c).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-local-execution/src/sinks/commit_write.rs 0.00% 12 Missing ⚠️
src/daft-logical-plan/src/sink_info.rs 0.00% 3 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##             main    #6090       +/-   ##
===========================================
- Coverage   72.91%   43.41%   -29.51%     
===========================================
  Files         973      909       -64     
  Lines      126196   112753    -13443     
===========================================
- Hits        92016    48948    -43068     
- Misses      34180    63805    +29625     
Files with missing lines Coverage Δ
daft/dataframe/dataframe.py 77.38% <ø> (ø)
daft/logical/builder.py 92.92% <ø> (ø)
src/daft-logical-plan/src/builder/mod.rs 40.79% <ø> (-37.64%) ⬇️
src/daft-logical-plan/src/sink_info.rs 0.00% <0.00%> (-44.45%) ⬇️
src/daft-local-execution/src/sinks/commit_write.rs 0.00% <0.00%> (-87.44%) ⬇️

... and 650 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant