Skip to content

Add pg_vaccumen — proactive vacuum maintenance tool#15

Open
xtpclark wants to merge 24 commits intoomniti-labs:masterfrom
xtpclark:pg_vaccumen
Open

Add pg_vaccumen — proactive vacuum maintenance tool#15
xtpclark wants to merge 24 commits intoomniti-labs:masterfrom
xtpclark:pg_vaccumen

Conversation

@xtpclark
Copy link

@xtpclark xtpclark commented Feb 5, 2026

Summary

  • Python replacement/companion to tools/manual_vacuum.sh for proactive vacuum maintenance
  • Vacuums tables before they hit autovacuum_freeze_max_age, spreading work over nightly runs to avoid emergency anti-wraparound vacuum spikes
  • Inspired by manual_vacuum.sh, modernized with Python/psycopg3

Features

  • Dynamic thresholds: Reads autovacuum_freeze_max_age at runtime, uses percentages so it adapts when the setting changes
  • Blocker detection: Long-running transactions, replication slots, prepared transactions
  • Autovacuum-aware: Skips tables already being vacuumed by autovacuum or other sessions
  • Parallel workers: --workers N with global concurrency control via advisory locks
  • Size filtering: --max-size <GB> to skip oversized tables, prioritize smaller ones
  • Transaction rate tracking: Estimates days until autovacuum with stored baseline
  • Metrics collection: Stores vacuum duration/size history for trend analysis
  • Bloat analysis: --check-bloat for dead tuple ratio reporting via pg_stat_user_tables
  • Least-privilege support: Runs with pg_monitor + pg_maintain — no superuser needed
  • Jenkins-ready: Exit codes (0=OK, 1=error, 2=warning, 3=critical) with parameterized Jenkinsfile
  • Works on any PostgreSQL 9.4+ using --host (Aurora AWS integrations optional via --cluster)

Files

  • tools/pg_vaccumen/pg_vaccumen.py — Main script
  • tools/pg_vaccumen/Jenkinsfile — Jenkins pipeline
  • tools/pg_vaccumen/requirements.txt — Dependencies (boto3, psycopg)
  • tools/pg_vaccumen/README.md — Full documentation

Production Tested

Running nightly against Aurora PostgreSQL 17 clusters with 750M autovacuum_freeze_max_age, 112M+ transactions/day, and 16TB tables. Successfully brought a cluster from 86.5% of freeze max down to 11% through proactive maintenance.

Skip tables larger than N GB with --max-size to focus on smaller tables
when a few monsters dominate the queue. Automatically detect and skip
tables where autovacuum is already running (opt-out with
--no-skip-autovacuum). Dry-run output now shows table sizes and
[autovacuum running] annotations.
PostgreSQL may return timestamps with 1-5 fractional digits (e.g.
.44841) but Python < 3.11 fromisoformat requires exactly 0, 3, or 6.
Pad to 6 digits before parsing.
Anonymized walkthrough: triage a 245-table backlog with monster tables,
lower threshold to find hidden work, vacuum small tables first with
--max-size, then increase size limit in phases.
VACUUM on large tables can run for hours. If the connection role has a
statement_timeout configured, it would cancel vacuum mid-operation.
Always set explicitly at connection time (default 0 = no timeout).
The rollback + autocommit toggle resets session state. Set
statement_timeout directly in autocommit mode right before VACUUM
to guarantee it takes effect.
All Python CLI options now have corresponding Jenkins parameters.
Reorder sections so new users see installation, sample output, and the
real-world scenario before hitting the locking behavior and threshold
theory deep-dives. Update sample output to show new size column and
autovacuum annotations. Add all Jenkins parameters to pipeline table.
- --check-bloat / --bloat-pct: dead tuple analysis via pg_stat_user_tables,
  reports tables with high dead/live ratio for pg_repack candidates
- --workers N: parallel vacuum via ThreadPoolExecutor with Queue-based
  connection pool, hard cap at 8, auto-reduced by validate_workers()
  checking maintenance_work_mem and max_connections headroom
- get_vacuum_activity(): detects both autovacuum workers AND manual
  VACUUM from other sessions, annotates dry-run output accordingly
- README: parallel workers warnings, killing a running vacuum guide,
  bloat analysis docs, tuning guidance additions
- Jenkinsfile: CHECK_BLOAT, BLOAT_PCT, WORKERS parameters
The previous Ctrl+C fix only covered the parallel (--workers) path.
Single-worker vacuum now catches KeyboardInterrupt and prints a
clean summary instead of a traceback.
Instead of relying on a stale snapshot taken before the vacuum loop,
query pg_stat_activity per-table just before issuing VACUUM.  This
detects vacuums started by other pg_vaccumen instances or autovacuum
workers that began after our initial check.

Applies to both sequential and parallel (--workers) execution paths.
Every VACUUM operation (sequential or parallel) must acquire a
numbered advisory lock slot before executing.  All pg_vaccumen
instances on the same database share the lock namespace, so
--workers N becomes a global concurrency limit rather than
per-instance.

- VACUUM_LOCK_NAMESPACE (0x70675661) for advisory lock keys
- acquire_vacuum_slot(): blocks with polling until a slot is free
- release_vacuum_slot(): frees the slot after VACUUM completes
- get_vacuum_slots_in_use(): reports slot status before execution
- Locks auto-release on disconnect (crash-safe, no stale state)
Update README to reflect that --workers is a global concurrency limit
enforced via PostgreSQL advisory locks across all pg_vaccumen
instances on the same database.
The ThreadPoolExecutor context manager's __exit__ calls
shutdown(wait=True) before the KeyboardInterrupt handler runs,
blocking on the first Ctrl+C. Fix by managing the executor
directly with shutdown(wait=False).

Also suppress spurious "FAILED: connection socket closed" messages
from workers whose connections are closed during shutdown, using
a threading.Event flag.
Rewrite the backlog catch-up walkthrough based on actual production
experience: save baseline first, use parallel workers, and document
the vacuum_freeze_min_age floor that causes infinite re-vacuum at
25% threshold.
Document the minimum PostgreSQL grants (pg_monitor, pg_maintain,
schema CREATE/USAGE) needed for a dedicated pg_vaccumen service
account, with fallback notes for PostgreSQL < 16.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant