Skip to content

docs: clarify check_duplicate_files option in the add_files api docs #2132

@thijsheijden

Description

@thijsheijden

Question

Hi! I am trying to add 1 million existing Parquet files to an Iceberg table using the add_files procedure. I am inserting in 1000 batches of 1000 files. Every batch takes longer than the previous batch, and at this point each batch is taking around 1-2 minutes. This is much much slower than Spark, which remains consistent throughout the entire insertion. How can I speed this up? The Parquet files already have metadata so perhaps that could be exploited somehow? Below is the code I am using:

warehouse_path = "/warehouse"
catalog = load_catalog(
    "pyiceberg",
    **{
        'type': 'sql',
        "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
        "warehouse": f"file://{warehouse_path}",
    },
)
catalog.create_namespace_if_not_exists("default")

# Load the batches of files to import
batches = os.listdir(args.file_dir)
first_file = os.path.join(args.file_dir, "batch_0", os.listdir(os.path.join(args.file_dir, "batch_0"))[0])

# Create table using schema of the first file
df = pq.read_table(first_file)
table = catalog.create_table_if_not_exists(
    f"default.{args.table}",
    schema=df.schema,
)

batch_idx = 1
for batch_dir in batches:
    print(f"Adding batch {batch_idx}")
    batch_dir =  os.path.join(args.file_dir, batch_dir)
    file_paths = os.listdir(batch_dir)
    file_paths = [batch_dir + '/' + s for s in file_paths]
    table.add_files(file_paths=file_paths)
    batch_idx += 1

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions