Skip to content

feat: Export Query Results to CSV Format File#33

Open
shirly121 wants to merge 6 commits intomainfrom
arrow_csv_export
Open

feat: Export Query Results to CSV Format File#33
shirly121 wants to merge 6 commits intomainfrom
arrow_csv_export

Conversation

@shirly121
Copy link
Collaborator

@shirly121 shirly121 commented Mar 12, 2026

What do these changes do?

Related issue number

Fixes #38

Greptile Summary

This PR refactors the CSV export pipeline (COPY (...) TO '...') by replacing the old multi-callback ExportFunction with a simpler write_exec_func_t approach, moving the actual write logic into a new writer::ArrowCsvExportWriter/CSVStringFormatBuffer stack, and updating the query planner and proto to use FileSchema/EntrySchema instead of raw file_path + property_mappings. A comprehensive Python integration test suite is added and the HEADER default is changed from false to true.

Key concerns found during review:

  • Output stream never closedArrowCsvExportWriter::writeTable opens an Arrow OutputStream but never calls stream->Close(), which can silently truncate the output file before all buffered data is written to disk.
  • EXPORT DATABASE completely removedplanExportDatabase now unconditionally throws NOT_SUPPORTED_EXCEPTION, removing an existing feature with no deprecation notice or migration path.
  • Overly restrictive child-operator type checkconvertCopyTo rejects any COPY TO whose direct child is not a PROJECTION or AGGREGATE, breaking valid queries that use ORDER BY, LIMIT, SKIP, or DISTINCT as the outermost clause.
  • Header not written for empty result sets — The header row is emitted only when addValue(0, 0) is called; if the query returns zero rows, the header is never written despite HEADER = true.
  • Buffer capacity overflowDEFAULT_CAPACITY * batchSize * arrays_size() is computed without overflow protection; a large user-supplied BATCH_SIZE can silently wrap size_t and create an undersized allocation.
  • Raw pointer to catalog functionDataExportOpr holds a raw function::ExportFunction* from the catalog with no lifetime guarantee.

Confidence Score: 2/5

  • Not safe to merge — the output stream is never closed, which can silently corrupt or truncate exported files.
  • Multiple logic-level bugs are present: the most critical is that the Arrow output stream is never explicitly closed, meaning export files may be silently incomplete. Additionally, EXPORT DATABASE is removed as a breaking change, the child-operator type whitelist breaks valid ORDER BY/LIMIT queries, the header is not written for empty result sets, and the buffer capacity calculation can overflow. These issues need resolution before the feature can be relied upon in production.
  • Primary: src/utils/writer/writer.cc (stream not closed, capacity overflow, header-on-empty). Secondary: src/compiler/gopt/g_query_converter.cpp (restrictive type check), src/compiler/planner/plan/plan_port_db.cpp (EXPORT DATABASE removed), src/execution/execute/ops/batch/data_export.cc (raw function pointer).

Important Files Changed

Filename Overview
src/utils/writer/writer.cc Core CSV writer implementation (376 lines added). Contains three critical issues: (1) output stream never closed after writing, risking data loss; (2) buffer capacity multiplication can overflow size_t with large BATCH_SIZE; (3) header row is not written for empty result sets even when HEADER=true.
include/neug/utils/writer/writer.h New header defining ExportWriter class hierarchy (ExportWriter → ArrowExportWriter → ArrowCsvExportWriter), WriteOptions struct, CSVStringFormatBuffer, and BinaryData. Structure is clean; no issues found in the header itself.
src/compiler/planner/plan/plan_port_db.cpp Completely removes the EXPORT DATABASE implementation, replacing it with an unconditional NOT_SUPPORTED_EXCEPTION — a breaking change with no deprecation notice or migration path.
src/compiler/gopt/g_query_converter.cpp Refactors CopyTo conversion to use the new ExportFunction/FileSchema approach. Adds a plan-reorder hack to swap sink/data_export if in wrong order. Adds an overly-restrictive child-operator type check that rejects ORDER BY, LIMIT, and other valid wrapper operators.
src/execution/execute/ops/batch/data_export.cc Replaces the old multi-field DataExportOpr constructor with a simpler schema/entry_schema/exportFunction pattern. Stores a raw pointer to a catalog function entry without lifetime guarantees.
src/compiler/function/csv_export_function.cpp New replacement for export_csv_function.cpp. Registers the writeExecFunc as the ExportCSVFunction's execFunc. DELIMITER validation logic is sound; LocalFileSystemProvider.provide(schema, false) is used correctly to skip glob expansion for write paths.
tools/python_bind/tests/test_export.py Comprehensive Python integration test suite covering nodes, edges, paths, column selection, filtering, and delimiter options. Missing a test for COPY TO on a query that returns zero rows (empty result set), which is where the header-not-written bug would surface.
include/neug/compiler/function/export/export_function.h Simplifies ExportFunction by removing the old sink/combine/finalize callbacks and replacing with a single write_exec_func_t. ExportCSVBindData and ExportParquetFunction structs removed. ExportFuncBindData now carries the raw options map.
src/compiler/binder/bind/copy/bind_copy_to.cpp Removes special-case CSV handling in getExportFunction and getExportFuncBindData; now delegates to the catalog for all file types uniformly. Clean simplification with no issues found.

Sequence Diagram

sequenceDiagram
    participant User
    participant Binder
    participant Planner
    participant GQueryConvertor
    participant DataExportOpr
    participant ArrowCsvExportWriter

    User->>Binder: COPY (query) TO 'file.csv' (OPTIONS)
    Binder->>Binder: getExportFunction() → ExportFunction (catalog lookup)
    Binder->>Binder: getExportFuncBindData() → ExportFuncBindData {columnNames, fileName, options}
    Binder-->>Planner: BoundCopyTo {bindData, exportFunc}

    Planner->>Planner: planCopyTo() → LogicalCopyTo
    Planner-->>GQueryConvertor: LogicalCopyTo

    GQueryConvertor->>GQueryConvertor: convertFileSchema(bindData) → FileSchema proto
    GQueryConvertor->>GQueryConvertor: convertEntrySchema(bindData) → EntrySchema proto
    GQueryConvertor->>GQueryConvertor: DataExport proto {file_schema, entry_schema}
    GQueryConvertor->>GQueryConvertor: Swap sink ↔ data_export if needed

    DataExportOpr->>DataExportOpr: Eval() - look up ExportFunction* from catalog
    DataExportOpr->>DataExportOpr: exportFunction_->execFunc(ctx, schema, entry_schema, graph)

    Note over DataExportOpr,ArrowCsvExportWriter: writeExecFunc() in csv_export_function.cpp
    DataExportOpr->>ArrowCsvExportWriter: ArrowCsvExportWriter::write(ctx, graph)
    ArrowCsvExportWriter->>ArrowCsvExportWriter: Sink::sink_results → QueryResponse
    ArrowCsvExportWriter->>ArrowCsvExportWriter: writeTable(response)
    ArrowCsvExportWriter->>ArrowCsvExportWriter: OpenOutputStream(path)
    loop per batch
        ArrowCsvExportWriter->>ArrowCsvExportWriter: CSVStringFormatBuffer::addValue(row, col)
        ArrowCsvExportWriter->>ArrowCsvExportWriter: flush(stream)
    end
    Note over ArrowCsvExportWriter: ⚠️ stream->Close() never called
Loading

Comments Outside Diff (14)

  1. tests/storage/test_export.cc, line 2142 (link)

    Incorrect path in file-existence check

    The condition checks for a file literally named "c" instead of "/tmp/rel_a.csv". Because "c" almost never exists, the cleanup guard is effectively dead code and the stale test artifact is never removed before the test runs. This can cause the subsequent EXPECT_TRUE to pass spuriously when running tests in a dirty environment.

  2. src/compiler/planner/plan/plan_port_db.cpp, line 1381 (link)

    planExportDatabase unconditionally returns nullptr

    The entire EXPORT DATABASE implementation was removed and replaced with return nullptr. Any caller that does not guard against a null LogicalPlan* will dereference null and crash at runtime. Even if EXPORT DATABASE is intentionally being disabled as part of this refactor, returning nullptr silently is dangerous. At minimum, the function should throw an appropriate "not implemented" or "not supported" exception so the failure is explicit and diagnosable.

  3. include/neug/utils/writer/writer.h, line 652 (link)

    Default for has_header contradicts documentation

    has_header defaults to true in WriteOptions, but the documentation in doc/source/data_io/export_data.md explicitly documents the default as false:

    |`HEADER`|Whether to output a header row.|`false`|
    

    This discrepancy means a user who omits the HEADER option will get a header row in the output despite the docs saying otherwise. One of the two must be corrected.

  4. src/utils/writer/writer.cc, line 2109 (link)

    Potential null dereference of entry_schema_

    ArrowCsvExportWriter inherits from ArrowExportWriter, whose constructor accepts entry_schema = nullptr as a default. Here, *entry_schema_ is dereferenced unconditionally — if entry_schema_ is a null shared_ptr, this will crash.

    A null-check (or an assertion) should be added before dereferencing:

    if (!entry_schema_) {
      return Status(StatusCode::ERR_INVALID_ARGUMENT, "entry_schema is null");
    }
    auto csvBuffer = CSVStringFormatBuffer(table, schema_, *entry_schema_);
  5. src/compiler/function/csv_export_function.cpp, line 885 (link)

    DELIMITER silently ignored when DELIM is also set

    std::map::insert (and its case-insensitive equivalent) does not overwrite an existing key. If a user specifies both DELIM and DELIMITER in their options (which may be unusual but not impossible), the value from DELIMITER is silently dropped and DELIM takes precedence without any warning or error. Consider using operator[] / emplace with try_emplace or logging a warning when both are supplied.

  6. tests/storage/test_export.cc, line 2142 (link)

    Truncated path in existence check

    The file path "/tmp/rel_a.csv" was accidentally truncated to just "c" in the guard condition. This means std::filesystem::remove("/tmp/rel_a.csv") will never be called to clean up the stale file before the test runs — potentially causing the subsequent EXPECT_TRUE to operate on leftover data from a previous run.

  7. src/compiler/planner/plan/plan_port_db.cpp, line 1346-1347 (link)

    planExportDatabase unconditionally returns nullptr

    The entire implementation of planExportDatabase was removed and replaced with return nullptr. Any caller that invokes this path (e.g., an EXPORT DATABASE statement) will receive a null LogicalPlan* and almost certainly crash with a null-pointer dereference downstream.

    If export-database support is intentionally being dropped or deferred, the function should at minimum throw an informative "not implemented" exception rather than silently returning null.

  8. tools/python_bind/tests/test_export.py, line 2227-2236 (link)

    Test name / query mismatch — queries movies, not person

    test_export_person_without_header is named as if it exports person nodes, but both the count query and the COPY TO statement target movies. This is almost certainly a copy-paste error and will silently test the wrong label under the wrong name.

  9. src/utils/writer/writer.cc, line 2010-2072 (link)

    WriteOptions reconstructed on every cell write

    A new WriteOptions object is constructed and schema_.options is looked up (via get()) for every single cell written — once for has_header, once for delimiter, once for ignore_errors, once for escape_char, and once for quote_char. For large result sets this creates significant per-cell overhead.

    Consider constructing WriteOptions once (e.g., in the CSVStringFormatBuffer constructor) and caching the resolved option values as fields, rather than re-resolving them in every addValue call.

  10. src/utils/writer/writer.cc, line 1910-1924 (link)

    kStructArray case missing null-validity check

    Every other typed-array case in formatValueToStr calls validateProtoValue(...) before reading the value, but the kStructArray branch proceeds without any validity check:

    case neug::Array::TypedArrayCase::kStructArray: {
        auto struct_arr = arr.struct_array();
        std::string list_val;
        list_val.append("[");
        for (int i = 0; i < struct_arr.fields_size(); ++i) {
            const auto& field = struct_arr.fields(i);
            ARROW_ASSIGN_OR_RAISE(auto elem, formatValueToStr(field, rowIdx));

    If struct_arr has a validity bitmap, this can silently produce output for null rows rather than returning an error or an empty string. Add the same validity check used by the other branches.

  11. src/compiler/function/csv_export_function.cpp, line 88 (link)

    Missing newline at end of file

    The file ends without a trailing newline (the diff shows \ No newline at end of file). This can cause compiler warnings and may confuse some tooling.

    Add a newline after the closing } of the namespace.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

  12. src/utils/writer/writer.cc, line 317-336 (link)

    Output stream never closed after writing

    ArrowCsvExportWriter::writeTable opens an arrow::io::OutputStream and writes all CSV data to it, but never calls stream->Close() before the function returns. Arrow's LocalFileSystem output stream uses buffered I/O, and without an explicit close the kernel buffers may not be fully flushed to disk. This means exported CSV files can be silently truncated or empty on many file systems.

  13. src/compiler/planner/plan/plan_port_db.cpp, line 23-29 (link)

    EXPORT DATABASE feature completely removed — breaking change

    The entire planExportDatabase implementation has been replaced with an unconditional THROW_NOT_SUPPORTED_EXCEPTION. Any user who relied on the EXPORT DATABASE Cypher statement will receive a runtime error after this PR merges, with no migration path or deprecation notice. The previous implementation exported all tables to files and was apparently functional (it was tested via test_load.py).

    If this removal is intentional (e.g. because it depended on the old ExportCSVBindData approach), it should at minimum be documented and the related parser/binder code for EXPORT DATABASE should also be removed to avoid confusing compiler errors, rather than silently failing at runtime.

  14. src/compiler/gopt/g_query_converter.cpp, line 1989-1997 (link)

    Overly-restrictive child operator type check breaks valid query shapes

    The new guard rejects any COPY TO whose direct child is not a PROJECTION or AGGREGATE, but valid Cypher queries often use ORDER BY (SORT), LIMIT, DISTINCT, or other operators as the outermost node:

    COPY (MATCH (v:person) RETURN v ORDER BY v.ID LIMIT 10) TO 'out.csv'

    These will now throw a compile-time exception. The previous code made no such restriction — it only required at least one child. If the intent is to guarantee that column names are available, checking for the schema expression list (as the old code did) is a safer approach than an operator-type whitelist.

  15. src/utils/writer/writer.cc, line 65-73 (link)

    Buffer capacity calculation can overflow size_t

    capacity_ = DEFAULT_CAPACITY * batchSize * response->arrays_size();

    DEFAULT_CAPACITY is size_t (64), batchSize is size_t (default 1024), and arrays_size() returns int. With the default values this is 64 * 1024 * N which is safe for small column counts, but a user could set BATCH_SIZE to a very large number (e.g., 1 << 30) and cause a silent size_t wrap-around, resulting in a wildly undersized allocation and subsequent heap corruption via the memcpy in write().

    Consider saturating/clamping the capacity or using a checked multiplication:

  16. src/utils/writer/writer.cc, line 238-250 (link)

    Header not emitted for empty result sets

    The header row is written only inside addValue when rowIdx == 0 && colIdx == 0. If the query produces zero rows, addValue is never called and the output file contains no header line — even when the user explicitly passed HEADER = true. Many downstream tools (Pandas, Spark, etc.) expect a header-only file to be valid CSV.

    The header should be written once during initialisation (or at the start of writeTable) rather than lazily on the first cell.

  17. src/execution/execute/ops/batch/data_export.cc, line 77-83 (link)

    Raw pointer to catalog function — lifetime not guaranteed

    writeFunc is a raw pointer obtained from the in-memory catalog:

    auto func = gCatalog->getFunctionWithSignature(signatureName);
    auto writeFunc = func->ptrCast<function::ExportFunction>();

    DataExportOpr stores this as function::ExportFunction* exportFunction_. If the catalog invalidates or reloads function entries (e.g. on LOAD EXTENSION, schema reload, or multi-tenant scenarios), the operator will hold a dangling pointer. The pointer is then dereferenced in Eval, which runs later.

    Consider either storing the function by value (it's a small struct) or using a shared_ptr/unique_ptr copy so the lifetime is explicit.

  18. src/utils/writer/writer.cc, line 195-215 (link)

    addEscapes uses sizeof(escape) as a skip distance — misleading and fragile

    found = val.find(toEscape, found + sizeof(escape));

    sizeof(escape) for a char is always 1, so the behaviour is correct: after inserting the escape prefix, the search advances past the current toEscape character so it isn't double-escaped. However, this looks as if the intent is to skip the newly-inserted escape character in the output string, which is incorrect conceptually. A future maintainer who changes escape to a multi-character type (e.g. std::string) would get a compile error or silent breakage.

Last reviewed commit: 91376f5

Committed-by: Xiaoli Zhou from Dev container
Committed-by: Xiaoli Zhou from Dev container
@shirly121
Copy link
Collaborator Author

@Greptile please review this PR for potential bugs and performance issues.

Committed-by: Xiaoli Zhou from Dev container
@shirly121
Copy link
Collaborator Author

@Greptile fixed in 91376f5

Committed-by: Xiaoli Zhou from Dev container
Committed-by: Xiaoli Zhou from Dev container
Committed-by: Xiaoli Zhou from Dev container
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Export] Export query results to CSV files

1 participant