Replies: 1 comment
-
The issue may exist as you are specifying two conditions in the match predicate. Try with a single condition (only |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone,
I’d really appreciate your help with a duplication issue I’m hitting when using deltalake merges (Python).
Context
Delta Table Schema
Upsert approach (per chunk)
My predicates and mappings look like:
target.area_type_code = source.area_type_code AND target.map_code = source.map_code AND target.fuel = source.fuel AND target.datetime = source.datetime AND target.period_granularity = source.period_granularity
For the merge predicate I also tried contraining to partitoins found in the chunk, e.g.,
AND IN target.period_granularity IN ('hourly', 'daily') AND target.area_type_code IN ('BZN')
. The values come from the chunk's distinct.Match predicate:
target.power != source.power OR target.energy != source.energy
Update mapping:
{'power': 'source.power', 'energy': 'source.energy'}
Insert mapping:
{'period_name': 'source.period_name', 'period_granularity': 'source.period_granularity', 'area_type_code': 'source.area_type_code', 'energy': 'source.energy', 'power': 'source.power', 'map_code': 'source.map_code', 'datetime': 'source.datetime', 'fuel': 'source.fuel'}
My undestanding is that the merge predicate determines whether a record in the source exists or not in the target; the match predicate is what decides whether an already existing record needs to be updated or not and the mappings are basically telling which values from the source need to end up in which columns from the target.
Problem
I am running the upsert multiple times using the same source data. The first time, the delta table is created and the total number of rows comes to 10,240,472. This is matching the number of rows in the input dataframe. When I run it again - same source data, no changes - I see some inserts as per the dictionary retured by the (TableMerger)
execute
method. This also matches the number of rows in the delta table after I load it into a dataframe and get the number of rows. I am making sure I do not have any NULL or NAN values in all of the columns used in the merge predicate (i.e. pk columns if you wish).I load the delta table into a polars or pandas dataframe and I can see the duplicates. I even went as far as to query and load the rows for a duplicated key and compare the values for each column and each row and no differences are detected.
I also tried excluding the datetime column from the merge predicate and use the period_name column - sort of a string representation of the datetime column) but data still keeps being inserted.
My questions are:
Has anybody experienced something similar?
Any ideas on which additional checks I can perform to find out what may be going wrong?
Is there a recommended way to ensure deterministic matching on timestamp_ntz (e.g., normalization/precision) that I might be missing?
Would a preliminary delete-then-merge be advisable here, or is there a more efficient safeguard?
Thank you in advance for any pointers or checks I can run. Happy to provide more details.
Beta Was this translation helpful? Give feedback.
All reactions