-
Notifications
You must be signed in to change notification settings - Fork 343
Closed
Labels
good first issueGood for newcomersGood for newcomers
Description
Question
Hi! I am trying to add 1 million existing Parquet files to an Iceberg table using the add_files
procedure. I am inserting in 1000 batches of 1000 files. Every batch takes longer than the previous batch, and at this point each batch is taking around 1-2 minutes. This is much much slower than Spark, which remains consistent throughout the entire insertion. How can I speed this up? The Parquet files already have metadata so perhaps that could be exploited somehow? Below is the code I am using:
warehouse_path = "/warehouse"
catalog = load_catalog(
"pyiceberg",
**{
'type': 'sql',
"uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
"warehouse": f"file://{warehouse_path}",
},
)
catalog.create_namespace_if_not_exists("default")
# Load the batches of files to import
batches = os.listdir(args.file_dir)
first_file = os.path.join(args.file_dir, "batch_0", os.listdir(os.path.join(args.file_dir, "batch_0"))[0])
# Create table using schema of the first file
df = pq.read_table(first_file)
table = catalog.create_table_if_not_exists(
f"default.{args.table}",
schema=df.schema,
)
batch_idx = 1
for batch_dir in batches:
print(f"Adding batch {batch_idx}")
batch_dir = os.path.join(args.file_dir, batch_dir)
file_paths = os.listdir(batch_dir)
file_paths = [batch_dir + '/' + s for s in file_paths]
table.add_files(file_paths=file_paths)
batch_idx += 1
Metadata
Metadata
Assignees
Labels
good first issueGood for newcomersGood for newcomers