Skip to content

Performance issue with UPDATE ... NOT EXISTS on large tables (Hash Anti Join, full scans) #91

@cobolbaby

Description

@cobolbaby

Hi,

I'm hitting serious performance issues when marking missing rows between source/target tables.

explain
UPDATE pgcompare.dc_source s
SET compare_result = 'm'
WHERE s.tid = 282
  AND NOT EXISTS (
      SELECT 1
      FROM pgcompare.dc_target t
      WHERE t.tid = 282
        AND s.pk_hash = t.pk_hash
  );

"Update on dc_source s  (cost=1083743.09..5648966.56 rows=0 width=0)"
"  ->  Hash Anti Join  (cost=1083743.09..5648966.56 rows=41306268 width=20)"
"        Hash Cond: ((s.pk_hash)::text = (t.pk_hash)::text)"
"        ->  Seq Scan on dc_source s  (cost=0.00..2574867.50 rows=62358713 width=39)"
"              Filter: (tid = 282)"
"        ->  Hash  (cost=824329.60..824329.60 rows=12771079 width=39)"
"              ->  Seq Scan on dc_target t  (cost=0.00..824329.60 rows=12771079 width=39)"
"                    Filter: (tid = 282)"

I think it’s hard to achieve meaningful optimization just by tuning the SQL, so I've two questions:

  1. For tables with tens of millions of rows, Is there any suggested batch size as a reference?
  2. For this very large table, is it possible to avoid a full-table comparison? Can I just compare a specific time range instead?

Thanks for any advice.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions