Apache Cloudberry version
apache-cloudberry-2.0.0-incubating
What happened
The database is configured with 1 coordinator node and 24 segment nodes, with no standby nodes or mirror nodes deployed. It uses NVMe hard drives for storage, and both the limits.conf and sysctl.conf files have been modified in accordance with the documentation requirements.
Now import paper data to db,one paper has multiple authors and multiple references. Specifically, one paper includes 10 authors and 50 references,with 10 million papers, this amounts to 100 million authors and 500 million references. The database consists of three tables: a basic paper information table, an author table, and a reference table. Each table contains 30 fields,include varchar,text,int,text[] format, all using append optimized,column orientation and zstd compression.
Data import is performed using Spark. After multiple tests, the basic paper data and author data can be imported successfully; however, random errors occur exclusively in the reference table.

spark error like this

database error like this
What you think should happen instead
No response
How to reproduce
copy function is
def apply(df: DataFrame, pgUrl: String, tableName: String, connectionProperties: Properties): Unit = {
df.rdd.foreachPartition { iter =>
val conn = DriverManager.getConnection(pgUrl, connectionProperties)
val copyManager = new CopyManager(conn.asInstanceOf[BaseConnection])
val sql = s"COPY $tableName FROM STDIN WITH (FORMAT csv, NULL '\N')"
val sb = new StringBuilder
iter.foreach { row: Row =>
sb.append(rowToCsv(row) + "\n")
}
val reader = new StringReader(sb.toString)
copyManager.copyIn(sql, reader)
conn.close()
}
}
use this function import 1 billion data concurrently
Operating System
Rocky 9.7
Anything else
A temporary solution is to submit the data in multiple batches, which allows all data to be fully imported
Are you willing to submit PR?
Code of Conduct