Skip to content

[Bug] Insert data into the cloudberry via the copy method concurrently using Spark, "could not find segment file to use" errors may occur random when the data volume is extremely largeΒ #1494

@cleverxiao001

Description

@cleverxiao001

Apache Cloudberry version

apache-cloudberry-2.0.0-incubating

What happened

The database is configured with 1 coordinator node and 24 segment nodes, with no standby nodes or mirror nodes deployed. It uses NVMe hard drives for storage, and both the limits.conf and sysctl.conf files have been modified in accordance with the documentation requirements.
Now import paper data to db,one paper has multiple authors and multiple references. Specifically, one paper includes 10 authors and 50 references,with 10 million papers, this amounts to 100 million authors and 500 million references. The database consists of three tables: a basic paper information table, an author table, and a reference table. Each table contains 30 fields,include varchar,text,int,text[] format, all using append optimized,column orientation and zstd compression.
Data import is performed using Spark. After multiple tests, the basic paper data and author data can be imported successfully; however, random errors occur exclusively in the reference table.

Image spark error like this Image database error like this

What you think should happen instead

No response

How to reproduce

copy function is
def apply(df: DataFrame, pgUrl: String, tableName: String, connectionProperties: Properties): Unit = {
df.rdd.foreachPartition { iter =>
val conn = DriverManager.getConnection(pgUrl, connectionProperties)
val copyManager = new CopyManager(conn.asInstanceOf[BaseConnection])
val sql = s"COPY $tableName FROM STDIN WITH (FORMAT csv, NULL '\N')"
val sb = new StringBuilder
iter.foreach { row: Row =>
sb.append(rowToCsv(row) + "\n")
}
val reader = new StringReader(sb.toString)
copyManager.copyIn(sql, reader)
conn.close()
}
}
use this function import 1 billion data concurrently

Operating System

Rocky 9.7

Anything else

A temporary solution is to submit the data in multiple batches, which allows all data to be fully imported

Are you willing to submit PR?

  • Yes, I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    type: BugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions