Skip to content

Conversation

@jiaqizho
Copy link
Contributor

@jiaqizho jiaqizho commented Aug 11, 2025

The autovacuum launcher process periodically launches workers to vacuum the table. During this process, the UDF pg_catalog.gp_acquire_sample_rows will be called. Also the vacuum task always be canceled by launcher.

The plan of pg_catalog.gp_acquire_sample_rows is:

                                   QUERY PLAN
---------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..45.02 rows=3000 width=32)
   Output: (gp_acquire_sample_rows('17018'::oid, 1250, false))
   ->  ProjectSet  (cost=0.00..5.02 rows=1000 width=32)
         Output: gp_acquire_sample_rows('17018'::oid, 1250, false)
         ->  Result  (cost=0.00..0.01 rows=1 width=0)
 Optimizer: Postgres query optimizer
(6 rows)

In actual examples, we often encounter relcache leaks caused by pg_catalog.gp_acquire_sample_rows. In fact, this warning is not caused by the UDF itself.

The following are the complete steps to reproduce(not stable reproduce)

1. User use the insert/update/delete SQL. Auto-vacuum is enabled.
2. The auto-vacuum worker process call the `pg_catalog.gp_acquire_sample_rows`
    2.1 The vacuum launches in master cancel the vacuum query.
    2.2 The vacuum worker in master process the interrupt in the intercontect.
        So the gather motion will be aborted.
    2.3 The segment do the tuple sender in the motion(`doSendTuple`).But it
        found the connection is NOT alive. Also it have not recv the SIGN INT
        in this time.
        So segment mark the `StopRequested` to true, and finish the current motion,
        and the function `pg_catalog.gp_acquire_sample_rows` in project set can't 
        call the `table_close` in this time.
    2.4 The segment call the `PortalDrop` to destory the resowner which inside the current 
        portal, and current portal status won't be FAIL, because current segment still
        have not recv the SIGN INT.
        The resowner found the leaked relcache, log the WARNING.
3. After step2, segments recv the SIGN INT, But nothing to do.

Fixes #ISSUE_Number

What does this PR do?

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

CI Skip Instructions


The autovacuum launcher process periodically launches workers to vacuum the table.
During this process, the UDF `pg_catalog.gp_acquire_sample_rows` will be called.
Also the vacuum task always be canceled by launcher.

The plan of `pg_catalog.gp_acquire_sample_rows` is:
```
                                   QUERY PLAN
---------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..45.02 rows=3000 width=32)
   Output: (gp_acquire_sample_rows('17018'::oid, 1250, false))
   ->  ProjectSet  (cost=0.00..5.02 rows=1000 width=32)
         Output: gp_acquire_sample_rows('17018'::oid, 1250, false)
         ->  Result  (cost=0.00..0.01 rows=1 width=0)
 Optimizer: Postgres query optimizer
(6 rows)
```

In actual examples, we often encounter relcache leaks caused by `pg_catalog.gp_acquire_sample_rows`.
In fact, this warning is not caused by the UDF itself.

The following are the complete steps to reproduce(not stable reproduce)

1. User use the insert/update/delete SQL. Auto-vacuum is enabled.
2. The auto-vacuum worker process call the `pg_catalog.gp_acquire_sample_rows`
    2.1 The vacuum launches in master cancel the vacuum query.
    2.2 The vacuum worker in master process the interrupt in the intercontect.
        So the gather motion will be aborted.
    2.3 The segment do the tuple sender in the motion(`doSendTuple`).But it
        found the connection is NOT alive. Also it have not recv the SIGN INT
        in this time.
        So segment mark the `StopRequested` to true, and finish the current motion,
        and the function `pg_catalog.gp_acquire_sample_rows` in project set can't
        call the `table_close` in this time.
    2.4 The segment call the `PortalDrop` to destory the resowner which inside the current
        portal, and current portal status won't be FAIL, because current segment still
        have not recv the SIGN INT.
        The resowner found the leaked relcache, log the WARNING.
3. After step2, segments recv the SIGN INT, But nothing to do.
@my-ship-it my-ship-it force-pushed the fix-relcache-leak-in-auto-vac branch from 53479fd to 8719d2f Compare August 14, 2025 01:30
@my-ship-it my-ship-it merged commit f225eeb into apache:main Aug 14, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants