Open
Conversation
Committed-by: Xiaoli Zhou from Dev container
Committed-by: Xiaoli Zhou from Dev container
Collaborator
Author
|
@Greptile please review this PR for potential bugs and performance issues. |
Committed-by: Xiaoli Zhou from Dev container
Collaborator
Author
|
@Greptile fixed in 91376f5 |
Committed-by: Xiaoli Zhou from Dev container
Committed-by: Xiaoli Zhou from Dev container
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What do these changes do?
Related issue number
Fixes #38
Greptile Summary
This PR refactors the CSV export pipeline (
COPY (...) TO '...') by replacing the old multi-callbackExportFunctionwith a simplerwrite_exec_func_tapproach, moving the actual write logic into a newwriter::ArrowCsvExportWriter/CSVStringFormatBufferstack, and updating the query planner and proto to useFileSchema/EntrySchemainstead of rawfile_path+property_mappings. A comprehensive Python integration test suite is added and theHEADERdefault is changed fromfalsetotrue.Key concerns found during review:
ArrowCsvExportWriter::writeTableopens an ArrowOutputStreambut never callsstream->Close(), which can silently truncate the output file before all buffered data is written to disk.EXPORT DATABASEcompletely removed —planExportDatabasenow unconditionally throwsNOT_SUPPORTED_EXCEPTION, removing an existing feature with no deprecation notice or migration path.convertCopyTorejects anyCOPY TOwhose direct child is not aPROJECTIONorAGGREGATE, breaking valid queries that useORDER BY,LIMIT,SKIP, orDISTINCTas the outermost clause.addValue(0, 0)is called; if the query returns zero rows, the header is never written despiteHEADER = true.DEFAULT_CAPACITY * batchSize * arrays_size()is computed without overflow protection; a large user-suppliedBATCH_SIZEcan silently wrapsize_tand create an undersized allocation.DataExportOprholds a rawfunction::ExportFunction*from the catalog with no lifetime guarantee.Confidence Score: 2/5
src/utils/writer/writer.cc(stream not closed, capacity overflow, header-on-empty). Secondary:src/compiler/gopt/g_query_converter.cpp(restrictive type check),src/compiler/planner/plan/plan_port_db.cpp(EXPORT DATABASE removed),src/execution/execute/ops/batch/data_export.cc(raw function pointer).Important Files Changed
Sequence Diagram
sequenceDiagram participant User participant Binder participant Planner participant GQueryConvertor participant DataExportOpr participant ArrowCsvExportWriter User->>Binder: COPY (query) TO 'file.csv' (OPTIONS) Binder->>Binder: getExportFunction() → ExportFunction (catalog lookup) Binder->>Binder: getExportFuncBindData() → ExportFuncBindData {columnNames, fileName, options} Binder-->>Planner: BoundCopyTo {bindData, exportFunc} Planner->>Planner: planCopyTo() → LogicalCopyTo Planner-->>GQueryConvertor: LogicalCopyTo GQueryConvertor->>GQueryConvertor: convertFileSchema(bindData) → FileSchema proto GQueryConvertor->>GQueryConvertor: convertEntrySchema(bindData) → EntrySchema proto GQueryConvertor->>GQueryConvertor: DataExport proto {file_schema, entry_schema} GQueryConvertor->>GQueryConvertor: Swap sink ↔ data_export if needed DataExportOpr->>DataExportOpr: Eval() - look up ExportFunction* from catalog DataExportOpr->>DataExportOpr: exportFunction_->execFunc(ctx, schema, entry_schema, graph) Note over DataExportOpr,ArrowCsvExportWriter: writeExecFunc() in csv_export_function.cpp DataExportOpr->>ArrowCsvExportWriter: ArrowCsvExportWriter::write(ctx, graph) ArrowCsvExportWriter->>ArrowCsvExportWriter: Sink::sink_results → QueryResponse ArrowCsvExportWriter->>ArrowCsvExportWriter: writeTable(response) ArrowCsvExportWriter->>ArrowCsvExportWriter: OpenOutputStream(path) loop per batch ArrowCsvExportWriter->>ArrowCsvExportWriter: CSVStringFormatBuffer::addValue(row, col) ArrowCsvExportWriter->>ArrowCsvExportWriter: flush(stream) end Note over ArrowCsvExportWriter: ⚠️ stream->Close() never calledComments Outside Diff (14)
tests/storage/test_export.cc, line 2142 (link)Incorrect path in file-existence check
The condition checks for a file literally named
"c"instead of"/tmp/rel_a.csv". Because"c"almost never exists, the cleanup guard is effectively dead code and the stale test artifact is never removed before the test runs. This can cause the subsequentEXPECT_TRUEto pass spuriously when running tests in a dirty environment.src/compiler/planner/plan/plan_port_db.cpp, line 1381 (link)planExportDatabaseunconditionally returnsnullptrThe entire
EXPORT DATABASEimplementation was removed and replaced withreturn nullptr. Any caller that does not guard against a nullLogicalPlan*will dereference null and crash at runtime. Even ifEXPORT DATABASEis intentionally being disabled as part of this refactor, returningnullptrsilently is dangerous. At minimum, the function should throw an appropriate "not implemented" or "not supported" exception so the failure is explicit and diagnosable.include/neug/utils/writer/writer.h, line 652 (link)Default for
has_headercontradicts documentationhas_headerdefaults totrueinWriteOptions, but the documentation indoc/source/data_io/export_data.mdexplicitly documents the default asfalse:This discrepancy means a user who omits the
HEADERoption will get a header row in the output despite the docs saying otherwise. One of the two must be corrected.src/utils/writer/writer.cc, line 2109 (link)Potential null dereference of
entry_schema_ArrowCsvExportWriterinherits fromArrowExportWriter, whose constructor acceptsentry_schema = nullptras a default. Here,*entry_schema_is dereferenced unconditionally — ifentry_schema_is a nullshared_ptr, this will crash.A null-check (or an assertion) should be added before dereferencing:
src/compiler/function/csv_export_function.cpp, line 885 (link)DELIMITERsilently ignored whenDELIMis also setstd::map::insert(and its case-insensitive equivalent) does not overwrite an existing key. If a user specifies bothDELIMandDELIMITERin their options (which may be unusual but not impossible), the value fromDELIMITERis silently dropped andDELIMtakes precedence without any warning or error. Consider usingoperator[]/emplacewithtry_emplaceor logging a warning when both are supplied.tests/storage/test_export.cc, line 2142 (link)Truncated path in existence check
The file path
"/tmp/rel_a.csv"was accidentally truncated to just"c"in the guard condition. This meansstd::filesystem::remove("/tmp/rel_a.csv")will never be called to clean up the stale file before the test runs — potentially causing the subsequentEXPECT_TRUEto operate on leftover data from a previous run.src/compiler/planner/plan/plan_port_db.cpp, line 1346-1347 (link)planExportDatabaseunconditionally returnsnullptrThe entire implementation of
planExportDatabasewas removed and replaced withreturn nullptr. Any caller that invokes this path (e.g., anEXPORT DATABASEstatement) will receive a nullLogicalPlan*and almost certainly crash with a null-pointer dereference downstream.If export-database support is intentionally being dropped or deferred, the function should at minimum throw an informative "not implemented" exception rather than silently returning null.
tools/python_bind/tests/test_export.py, line 2227-2236 (link)Test name / query mismatch — queries
movies, notpersontest_export_person_without_headeris named as if it exportspersonnodes, but both the count query and theCOPY TOstatement targetmovies. This is almost certainly a copy-paste error and will silently test the wrong label under the wrong name.src/utils/writer/writer.cc, line 2010-2072 (link)WriteOptionsreconstructed on every cell writeA new
WriteOptionsobject is constructed andschema_.optionsis looked up (viaget()) for every single cell written — once forhas_header, once fordelimiter, once forignore_errors, once forescape_char, and once forquote_char. For large result sets this creates significant per-cell overhead.Consider constructing
WriteOptionsonce (e.g., in theCSVStringFormatBufferconstructor) and caching the resolved option values as fields, rather than re-resolving them in everyaddValuecall.src/utils/writer/writer.cc, line 1910-1924 (link)kStructArraycase missing null-validity checkEvery other typed-array case in
formatValueToStrcallsvalidateProtoValue(...)before reading the value, but thekStructArraybranch proceeds without any validity check:If
struct_arrhas a validity bitmap, this can silently produce output for null rows rather than returning an error or an empty string. Add the same validity check used by the other branches.src/compiler/function/csv_export_function.cpp, line 88 (link)Missing newline at end of file
The file ends without a trailing newline (the diff shows
\ No newline at end of file). This can cause compiler warnings and may confuse some tooling.Add a newline after the closing
}of the namespace.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
src/utils/writer/writer.cc, line 317-336 (link)Output stream never closed after writing
ArrowCsvExportWriter::writeTableopens anarrow::io::OutputStreamand writes all CSV data to it, but never callsstream->Close()before the function returns. Arrow'sLocalFileSystemoutput stream uses buffered I/O, and without an explicit close the kernel buffers may not be fully flushed to disk. This means exported CSV files can be silently truncated or empty on many file systems.src/compiler/planner/plan/plan_port_db.cpp, line 23-29 (link)EXPORT DATABASEfeature completely removed — breaking changeThe entire
planExportDatabaseimplementation has been replaced with an unconditionalTHROW_NOT_SUPPORTED_EXCEPTION. Any user who relied on theEXPORT DATABASECypher statement will receive a runtime error after this PR merges, with no migration path or deprecation notice. The previous implementation exported all tables to files and was apparently functional (it was tested viatest_load.py).If this removal is intentional (e.g. because it depended on the old
ExportCSVBindDataapproach), it should at minimum be documented and the related parser/binder code forEXPORT DATABASEshould also be removed to avoid confusing compiler errors, rather than silently failing at runtime.src/compiler/gopt/g_query_converter.cpp, line 1989-1997 (link)Overly-restrictive child operator type check breaks valid query shapes
The new guard rejects any
COPY TOwhose direct child is not aPROJECTIONorAGGREGATE, but valid Cypher queries often useORDER BY(SORT),LIMIT,DISTINCT, or other operators as the outermost node:These will now throw a compile-time exception. The previous code made no such restriction — it only required at least one child. If the intent is to guarantee that column names are available, checking for the schema expression list (as the old code did) is a safer approach than an operator-type whitelist.
src/utils/writer/writer.cc, line 65-73 (link)Buffer capacity calculation can overflow
size_tcapacity_ = DEFAULT_CAPACITY * batchSize * response->arrays_size();DEFAULT_CAPACITYissize_t(64),batchSizeissize_t(default 1024), andarrays_size()returnsint. With the default values this is64 * 1024 * Nwhich is safe for small column counts, but a user could setBATCH_SIZEto a very large number (e.g.,1 << 30) and cause a silentsize_twrap-around, resulting in a wildly undersized allocation and subsequent heap corruption via thememcpyinwrite().Consider saturating/clamping the capacity or using a checked multiplication:
src/utils/writer/writer.cc, line 238-250 (link)Header not emitted for empty result sets
The header row is written only inside
addValuewhenrowIdx == 0 && colIdx == 0. If the query produces zero rows,addValueis never called and the output file contains no header line — even when the user explicitly passedHEADER = true. Many downstream tools (Pandas, Spark, etc.) expect a header-only file to be valid CSV.The header should be written once during initialisation (or at the start of
writeTable) rather than lazily on the first cell.src/execution/execute/ops/batch/data_export.cc, line 77-83 (link)Raw pointer to catalog function — lifetime not guaranteed
writeFuncis a raw pointer obtained from the in-memory catalog:DataExportOprstores this asfunction::ExportFunction* exportFunction_. If the catalog invalidates or reloads function entries (e.g. onLOAD EXTENSION, schema reload, or multi-tenant scenarios), the operator will hold a dangling pointer. The pointer is then dereferenced inEval, which runs later.Consider either storing the function by value (it's a small struct) or using a
shared_ptr/unique_ptrcopy so the lifetime is explicit.src/utils/writer/writer.cc, line 195-215 (link)addEscapesusessizeof(escape)as a skip distance — misleading and fragilefound = val.find(toEscape, found + sizeof(escape));sizeof(escape)for acharis always1, so the behaviour is correct: after inserting the escape prefix, the search advances past the currenttoEscapecharacter so it isn't double-escaped. However, this looks as if the intent is to skip the newly-inserted escape character in the output string, which is incorrect conceptually. A future maintainer who changesescapeto a multi-character type (e.g.std::string) would get a compile error or silent breakage.Last reviewed commit: 91376f5