Help diagnosing duplicate inserts after merge/upsert #3719

octaviohd · 2025-08-30T21:28:04Z

octaviohd
Aug 30, 2025

Hi everyone,

I’d really appreciate your help with a duplication issue I’m hitting when using deltalake merges (Python).

Context

Backend: Azure Blob Storage
Libraries: deltalake 1.1.4 (Python), Polars 1.31.0 (source is a LazyFrame/DataFrame)
Goal: Idempotent upsert (re-running the same input should not create new rows)

Delta Table Schema

Schema(
   [Field(area_type_code, PrimitiveType("string"), nullable=True), 
   Field(map_code, PrimitiveType("string"), nullable=True), 
   Field(fuel, PrimitiveType("string"), nullable=True), 
   Field(datetime, PrimitiveType("timestamp_ntz"), nullable=True), 
   Field(period_name, PrimitiveType("string"), nullable=True), 
   Field(period_granularity, PrimitiveType("string"), nullable=True), 
   Field(power, PrimitiveType("double"), nullable=True), 
   Field(energy, PrimitiveType("double"), nullable=True)]
)

Upsert approach (per chunk)

Split source into chunks (I tried 2M and 10M rows).
For each chunk, reload the Delta table (so inserts/updates from prior chunks are visible).
Chain merge → when_matched_update → when_not_matched_insert:

merge_results = delta_table.merge(
   source=df_chunk,
   predicate=merge_predicate,
   source_alias='source',
   target_alias='target',
   writer_properties=writer_properties,
   streamed_exec=True,
).when_matched_update(
   predicate=match_predicate,
   updates=update_mapping
).when_not_matched_insert(
   updates=insert_mapping
).execute()

My predicates and mappings look like:

Merge predicate: target.area_type_code = source.area_type_code AND target.map_code = source.map_code AND target.fuel = source.fuel AND target.datetime = source.datetime AND target.period_granularity = source.period_granularity

For the merge predicate I also tried contraining to partitoins found in the chunk, e.g., AND IN target.period_granularity IN ('hourly', 'daily') AND target.area_type_code IN ('BZN'). The values come from the chunk's distinct.

Match predicate: target.power != source.power OR target.energy != source.energy
Update mapping: {'power': 'source.power', 'energy': 'source.energy'}
Insert mapping: {'period_name': 'source.period_name', 'period_granularity': 'source.period_granularity', 'area_type_code': 'source.area_type_code', 'energy': 'source.energy', 'power': 'source.power', 'map_code': 'source.map_code', 'datetime': 'source.datetime', 'fuel': 'source.fuel'}

My undestanding is that the merge predicate determines whether a record in the source exists or not in the target; the match predicate is what decides whether an already existing record needs to be updated or not and the mappings are basically telling which values from the source need to end up in which columns from the target.

Problem

I am running the upsert multiple times using the same source data. The first time, the delta table is created and the total number of rows comes to 10,240,472. This is matching the number of rows in the input dataframe. When I run it again - same source data, no changes - I see some inserts as per the dictionary retured by the (TableMerger) execute method. This also matches the number of rows in the delta table after I load it into a dataframe and get the number of rows. I am making sure I do not have any NULL or NAN values in all of the columns used in the merge predicate (i.e. pk columns if you wish).

'num_source_rows': 240472,
'num_target_rows_inserted': 29782,
'num_target_rows_updated': 4429,
'num_target_rows_deleted': 0,
'num_target_rows_copied': 471766,
'num_output_rows': 505977,
'num_target_files_scanned': 21,
'num_target_files_skipped_during_scan': 0,
'num_target_files_added': 20,
'num_target_files_removed': 18,

I load the delta table into a polars or pandas dataframe and I can see the duplicates. I even went as far as to query and load the rows for a duplicated key and compare the values for each column and each row and no differences are detected.

I also tried excluding the datetime column from the merge predicate and use the period_name column - sort of a string representation of the datetime column) but data still keeps being inserted.

My questions are:
Has anybody experienced something similar?
Any ideas on which additional checks I can perform to find out what may be going wrong?
Is there a recommended way to ensure deterministic matching on timestamp_ntz (e.g., normalization/precision) that I might be missing?
Would a preliminary delete-then-merge be advisable here, or is there a more efficient safeguard?

Thank you in advance for any pointers or checks I can run. Happy to provide more details.

sidjas17 · 2025-09-10T18:06:23Z

sidjas17
Sep 10, 2025

The issue may exist as you are specifying two conditions in the match predicate. Try with a single condition (only target.power != source.power) and see if you are getting duplicates

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Help diagnosing duplicate inserts after merge/upsert #3719

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Help diagnosing duplicate inserts after merge/upsert #3719

Uh oh!

octaviohd Aug 30, 2025

Replies: 1 comment

Uh oh!

sidjas17 Sep 10, 2025

octaviohd
Aug 30, 2025

sidjas17
Sep 10, 2025