Performance issue with UPDATE ... NOT EXISTS on large tables (Hash Anti Join, full scans)

Hi,

I'm hitting serious performance issues when marking missing rows between source/target tables.

```sql
explain
UPDATE pgcompare.dc_source s
SET compare_result = 'm'
WHERE s.tid = 282
  AND NOT EXISTS (
      SELECT 1
      FROM pgcompare.dc_target t
      WHERE t.tid = 282
        AND s.pk_hash = t.pk_hash
  );

"Update on dc_source s  (cost=1083743.09..5648966.56 rows=0 width=0)"
"  ->  Hash Anti Join  (cost=1083743.09..5648966.56 rows=41306268 width=20)"
"        Hash Cond: ((s.pk_hash)::text = (t.pk_hash)::text)"
"        ->  Seq Scan on dc_source s  (cost=0.00..2574867.50 rows=62358713 width=39)"
"              Filter: (tid = 282)"
"        ->  Hash  (cost=824329.60..824329.60 rows=12771079 width=39)"
"              ->  Seq Scan on dc_target t  (cost=0.00..824329.60 rows=12771079 width=39)"
"                    Filter: (tid = 282)"
```

I think it’s hard to achieve meaningful optimization just by tuning the SQL, so I've two questions:

1. For tables with tens of millions of rows, Is there any suggested batch size as a reference?
2. For this very large table, is it possible to avoid a full-table comparison? Can I just compare a specific time range instead?

Thanks for any advice.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue with UPDATE ... NOT EXISTS on large tables (Hash Anti Join, full scans) #91

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance issue with UPDATE ... NOT EXISTS on large tables (Hash Anti Join, full scans) #91

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions