Skip to content

Segfault when issuing multiple partitioned writes #127

@mx781

Description

@mx781

What happens?

Hi there DuckDB team - I just took some time to narrow down a segfault I was encountering in a data processing pipeline extending a hive-partitioned parquet dataset.

To Reproduce

Here is a minimal repro (the necessary file is attached):

from pathlib import Path
import shutil
import duckdb
import pandas as pd

table_dir = Path("./reprotmp")
shutil.rmtree(table_dir, ignore_errors=True)
table_dir.mkdir()

conn = duckdb.connect()
df = pd.read_csv("overwrite.csv")
for i in range(100):
    print(i)
    conn.sql(
        f"""
        COPY df TO '{table_dir.as_posix()}'
        (FORMAT parquet, PARTITION_BY (symbol, year, month), OVERWRITE, CODEC 'SNAPPY')
        """
    )

overwrite.csv

# mkdir repro; copy over repro.py and overwrite.csv, then:
uv init --bare --python 3.12.11 && uv add duckdb==1.4.1 pandas==2.3.3 numpy==2.3.3 && source .venv/bin/activate && python repro.py 

The issue seems to be related to the fact that there are nans/Nones in one of the partitioned cols (this was a bug on my side originally), but when narrowing the dataset down to just a few rows it no longer reproduced. Additionally, the error also disappears if duckdb.connect is moved inside the loop, so it seems some sort of atomicity guarantee is breaking down here if the connection is kept open.

Here is the stacktrace from gdb:

(gdb) bt
#0  __memmove_avx512_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:317
#1  0x00007e102e0d5d68 in duckdb::Value::Value(duckdb::string_t) ()
   from .venv/lib/python3.12/site-packages/_duckdb.cpython-312-x86_64-linux-gnu.so
#2  0x00007e102e0d5e0d in duckdb::Value duckdb::Value::CreateValue<duckdb::string_t>(duckdb::string_t) ()
   from .venv/lib/python3.12/site-packages/_duckdb.cpython-312-x86_64-linux-gnu.so
#3  0x00007e102e1c473d in ?? () from .venv/lib/python3.12/site-packages/_duckdb.cpython-312-x86_64-linux-gnu.so
#4  0x00007e102e1c7211 in duckdb::HivePartitionedColumnData::ComputePartitionIndices(duckdb::PartitionedColumnDataAppendState&, duckdb::DataChunk&) ()
   from .venv/lib/python3.12/site-packages/_duckdb.cpython-312-x86_64-linux-gnu.so
#5  0x00007e102e0bf1ad in duckdb::PartitionedColumnData::Append(duckdb::PartitionedColumnDataAppendState&, duckdb::DataChunk&) ()
   from .venv/lib/python3.12/site-packages/_duckdb.cpython-312-x86_64-linux-gnu.so
#6  0x00007e102e3ace5a in duckdb::PhysicalCopyToFile::Sink(duckdb::ExecutionContext&, duckdb::DataChunk&, duckdb::OperatorSinkInput&) const ()
   from .venv/lib/python3.12/site-packages/_duckdb.cpython-312-x86_64-linux-gnu.so
#7  0x00007e102e575ee1 in duckdb::PipelineExecutor::ExecutePushInternal(duckdb::DataChunk&, duckdb::ExecutionBudget&, unsigned long) ()
   from .venv/lib/python3.12/site-packages/_duckdb.cpython-312-x86_64-linux-gnu.so
#8  0x00007e102e57882e in duckdb::PipelineExecutor::Execute(unsigned long) ()
   from .venv/lib/python3.12/site-packages/_duckdb.cpython-312-x86_64-linux-gnu.so
#9  0x00007e102e578b82 in duckdb::PipelineTask::ExecuteTask(duckdb::TaskExecutionMode) ()
   from .venv/lib/python3.12/site-packages/_duckdb.cpython-312-x86_64-linux-gnu.so
#10 0x00007e102e5713d6 in duckdb::ExecutorTask::Execute(duckdb::TaskExecutionMode) ()
   from .venv/lib/python3.12/site-packages/_duckdb.cpython-312-x86_64-linux-gnu.so
#11 0x00007e102e579ed4 in duckdb::Executor::ExecuteTask(bool) ()
   from .venv/lib/python3.12/site-packages/_duckdb.cpython-312-x86_64-linux-gnu.so
#12 0x00007e102e538af0 in duckdb::ClientContext::ExecuteTaskInternal(duckdb::ClientContextLock&, duckdb::BaseQueryResult&, bool) ()
   from .venv/lib/python3.12/site-packages/_duckdb.cpython-312-x86_64-linux-gnu.so
#13 0x00007e102e538cb3 in duckdb::PendingQueryResult::ExecuteTask() ()
   from .venv/lib/python3.12/site-packages/_duckdb.cpython-312-x86_64-linux-gnu.so
#14 0x00007e102dbf7fd2 in duckdb::DuckDBPyConnection::CompletePendingQuery(duckdb::PendingQueryResult&) ()
   from .venv/lib/python3.12/site-packages/_duckdb.cpython-312-x86_64-linux-gnu.so
#15 0x00007e102dc063bd in ?? () from .venv/lib/python3.12/site-packages/_duckdb.cpython-312-x86_64-linux-gnu.so
#16 0x00007e102dc0e029 in ?? () from .venv/lib/python3.12/site-packages/_duckdb.cpython-312-x86_64-linux-gnu.so
#17 0x00007e102dc2d3b8 in ?? () from .venv/lib/python3.12/site-packages/_duckdb.cpython-312-x86_64-linux-gnu.so
#18 0x00007e102db9fcd5 in ?? () from .venv/lib/python3.12/site-packages/_duckdb.cpython-312-x86_64-linux-gnu.so
#19 0x0000000001a05c7c in cfunction_call ()
#20 0x0000000001a1635c in _PyEval_EvalFrameDefault ()
#21 0x0000000001a8bba2 in PyEval_EvalCode ()
#22 0x0000000001ab4742 in run_mod.llvm ()
#23 0x0000000001bcc53f in pyrun_file ()
#24 0x0000000001bcd018 in _PyRun_SimpleFileObject ()
#25 0x0000000001bcced0 in _PyRun_AnyFileObject ()
#26 0x0000000001bccc0a in pymain_run_file_obj ()
#27 0x0000000001bccb1e in pymain_run_file ()
#28 0x0000000001b36d3b in Py_RunMain ()
#29 0x0000000001b54dfa in pymain_main.llvm ()
#30 0x0000000001b54bed in main ()

OS:

Debian 12, Ubuntu 22.04

DuckDB Package Version:

1.4.1

Python Version:

3.12

Full Name:

mx

Affiliation:

Gravity Team

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions