Skip to content

[DPE-9356] Update PostgreSQL mount paths to /var/lib/pg with 16/main subdirectory#1186

Merged
marceloneppel merged 40 commits into16/edgefrom
change-storage-mounts
Mar 5, 2026
Merged

[DPE-9356] Update PostgreSQL mount paths to /var/lib/pg with 16/main subdirectory#1186
marceloneppel merged 40 commits into16/edgefrom
change-storage-mounts

Conversation

@marceloneppel
Copy link
Member

@marceloneppel marceloneppel commented Dec 18, 2025

Issue

The database uses the mount points provided by Juju as the data directories. This fails the database bootstrap process when there is some content in those mount points, like lost+found.

Also, the storage mount paths use /var/lib/postgresql/*, which doesn't follow the 16/main subdirectory convention expected by PostgreSQL tooling.

Solution

Following the spec DA230, migrate storage mount paths from /var/lib/postgresql/* to /var/lib/pg/* with a 16/main subdirectory structure, and create a compatibility symlink from the standard PostgreSQL data path.

Path changes

Storage Old path New path
Archive /var/lib/postgresql/archive /var/lib/pg/archive/16/main
Data /var/lib/postgresql/data/pgdata /var/lib/pg/data/16/main
Logs (WAL) /var/lib/postgresql/logs /var/lib/pg/logs/16/main/pg_wal
Temp /var/lib/postgresql/temp /var/lib/pg/temp/16/main/pgsql_tmp

Key changes

  • Storage mounts (metadata.yaml): Update all four storage mount locations to /var/lib/pg/*
  • Named constants (constants.py): Extract storage paths into ARCHIVE_STORAGE_PATH, DATA_STORAGE_PATH, LOGS_STORAGE_PATH, TEMP_STORAGE_PATH and use them consistently across the codebase
  • Symlink compatibility: Create a symlink from /var/lib/postgresql/16/main/var/lib/pg/data/16/main so PostgreSQL and Patroni reference the canonical path; always recreate the symlink (removing any real directory the OCI image ships first)
  • Patroni & pgBackRest templates: Update patroni.yml.j2 and pgbackrest.conf.j2 with new data_dir, waldir, and temp tablespace paths
  • Backup/restore (backups.py): Clear all storage directories during restore preparation so pg_basebackup --waldir/--tablespace options work on replicas; use actual pgdata path (not symlink) for file operations
  • Async replication (async_replication.py): Prevent system ID mismatch by only clearing pgdata on non-leaders after the standby leader is confirmed running; check member_started before clearing pgdata on standby units; fix disk usage check to use the actual pgdata path
  • Replica startup (charm.py): Defer replica startup until its endpoint is present in the members list, preventing pg_basebackup rejections; clear stale WAL/temp directories repeatedly until pgdata is fully populated (PG_VERSION exists)
  • Security: Fix shell injection risk by passing command arguments as lists instead of split strings
  • Upgrade tests disabled: Temporarily disable upgrade-related spread tests since upgrades cannot be performed when storage definitions change
  • New integration test: Add test_symlinks.py to verify the data directory symlink works correctly

Checklist

  • I have added or updated any relevant documentation.
  • I have cleaned any remaining cloud resources from my accounts.

Fixes: #1009

Migrate storage mount paths from /var/lib/postgresql/* to /var/lib/pg/*
and configure PostgreSQL 16 data directories to use the 16/main
subfolder structure.

Path changes:
- Archive: /var/lib/postgresql/archive → /var/lib/pg/archive
- Data:    /var/lib/postgresql/data/pgdata → /var/lib/pg/data/16/main
- Logs:    /var/lib/postgresql/logs → /var/lib/pg/logs/16/main (WAL)
- Temp:    /var/lib/postgresql/temp → /var/lib/pg/temp/16/main

Key changes:
- Update metadata.yaml storage mount locations
- Update POSTGRESQL_DATA_PATH constant
- Configure pgdata_path, waldir, and temp tablespace paths
- Add make_parents=True to Pebble make_dir for nested directory creation
- Update pgbackrest.conf.j2 with new data path
- Update patroni.yml.j2 with new data_dir and waldir
- Update async_replication.py cleanup paths
- Update authorisation_rules_observer.py Patroni config path

Note: Archive path (/var/lib/pg/archive) intentionally does not use the
16/main subfolder as it is not used in Patroni or PostgreSQL configuration.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
@github-actions github-actions bot added the Libraries: Out of sync The charm libs used are out-of-sync label Dec 18, 2025
@codecov
Copy link

codecov bot commented Dec 18, 2025

Codecov Report

❌ Patch coverage is 53.01205% with 39 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.55%. Comparing base (ccb262c) to head (26f96b4).
⚠️ Report is 1 commits behind head on 16/edge.

Files with missing lines Patch % Lines
src/charm.py 54.05% 14 Missing and 3 partials ⚠️
src/patroni.py 14.28% 12 Missing ⚠️
src/relations/async_replication.py 46.66% 8 Missing ⚠️
src/backups.py 80.00% 0 Missing and 2 partials ⚠️

❌ Your project status has failed because the head coverage (68.55%) is below the target coverage (70.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff             @@
##           16/edge    #1186      +/-   ##
===========================================
- Coverage    68.96%   68.55%   -0.42%     
===========================================
  Files           16       16              
  Lines         3809     3864      +55     
  Branches       574      588      +14     
===========================================
+ Hits          2627     2649      +22     
- Misses         980     1008      +28     
- Partials       202      207       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ix_pod

Patch the OCI image health-check script (/scripts/self-signed-checker.py) to use the
charm's storage path for the peer CA file instead of the hardcoded
/var/lib/postgresql/data/peer_ca.pem. Make the operation idempotent and
non-blocking, and call it from pebble-ready and early pod-fix flows to avoid
race conditions with Pebble layer installation.

Add _fix_health_check_script(container) to read, replace, and write the script.
Log and handle pull/push errors; on push failure attempt an in-container sed
fallback (exec ["sh","-c","sed -i 's|OLD|NEW|g' ..."]).
Invoke the fixer from _on_postgresql_pebble_ready (after _create_pgdata) and
from _fix_pod (early) so the script is corrected as early as possible.
Update unit tests to cover the patching behaviour and the exec fallback.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
- Replace broad try/except/pass patterns with contextlib.suppress where appropriate
  (fixes SIM105 and S110 warnings).
- Ensure the _fix_health_check_script fallback exec wait is wrapped with
  contextlib.suppress(Exception) to avoid swallowing exceptions without intent.
- Use contextlib.suppress to defensively guard _fix_pod health-check fixes while
  still logging inner failures.
- Re-order and tidy imports to satisfy ruff import ordering rules.
- Apply ruff format across src/, tests/, and scripts/ to normalize formatting.

Notes:
- These are non-functional changes (linting / formatting); behavior unchanged.
- Formatting touched multiple test and helper files; no logic was altered.
- Tests were not executed as part of this change; please run the unit/integration
  suites locally (tox -e unit / tox -e integration) before pushing if desired.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
- Apply ruff/auto-formatting to integration test files to satisfy lint/format checks:
  - tests/integration/ha_tests/conftest.py
  - tests/integration/ha_tests/helpers.py
  - tests/integration/test_invalid_database.py
  - tests/integration/test_invalid_extra_user_roles.py

- Non-functional: only whitespace/line-wrapping/formatting changes were made.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Temporarily disable the upgrade-related integration tests by constraining
their spread task configurations so they will not be scheduled in CI.

Changes:
- Added a `systems:` block to the following spread task files to prevent
  these tasks from running across the broader CI fleet:
  - tests/spread/test_upgrade.py/task.yaml
  - tests/spread/test_upgrade_skip_pre_upgrade_check.py/task.yaml
  - tests/spread/test_upgrade_from_stable.py/task.yaml
  - tests/spread/test_async_replication_upgrade.py/task.yaml

Reason:
- Upgrades cannot be performed when storage definitions change. These
  tests fail in situations where storage definitions are modified, so
  they are being disabled until the upgrade path for storage changes is
  resolved.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
- Add PGDATA_PATH = f"{STORAGE_PATH}/16/main"
- Update tests/integration/test_storage.py and tests/integration/test_charm.py to use PGDATA_PATH
- Fix import ordering and lint issues

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
marceloneppel and others added 20 commits January 19, 2026 16:17
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
* Update canonical/data-platform-workflows action to v41 (#1211)

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

* feat(storage): add symlink from default PostgreSQL data path

Create a symlink from /var/lib/postgresql/16/main to /var/lib/pg/data/16/main
to provide compatibility with the standard PostgreSQL data directory location.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

---------

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
…ql-k8s-operator into change-storage-mounts

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Update POSTGRESQL_DATA_PATH and pgdata_path to use the symlink
path (/var/lib/postgresql/16/main) instead of the actual storage
path (/var/lib/pg/data/16/main) for Patroni, pgbackrest, and tests.

This ensures all components reference the canonical symlink path
that PostgreSQL and Patroni expect.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Non-leader units in a standby cluster now clear pgdata only after the
standby leader is confirmed running, preventing system ID mismatch errors
during pg_basebackup. Also fixes disk usage check to use actual pgdata
path and handles Patroni API unavailability gracefully during slot sync.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…erations

Update pgbackrest config template to use the proper POSTGRESQL_DATA_PATH
constant and fix _empty_data_files to remove the actual directory contents
rather than just the symlink, ensuring backup and restore operations work
correctly.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
- Update data directory paths to match current structure (/var/lib/pg/data/16/main, etc.)
- Replace rm -rf with find -delete to preserve mount point directories

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
… compatibility

Clear archive, data, logs, and temp directories during backup restore
preparation. This ensures pg_basebackup can use --waldir and --tablespace
options when new replicas join after a restore, as these options require
empty directories.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Remove the logic that cleared logs and temp directories when a non-leader
unit rejoined with an empty data directory. This cleanup was intended for
pg_basebackup during restore operations but is no longer needed.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Clear WAL and tablespace directories when a replica joins an initialized
cluster, as PersistentVolumes may retain data from previous pods and
pg_basebackup requires empty target directories.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…n standby

When a new unit joins an existing standby cluster, check if the database
is already running before clearing pgdata. Previously, _clear_pgdata()
was called unconditionally before the member_started check, which could
delete the data directory while PostgreSQL was running - causing it to
crash and leaving the unit stuck in "awaiting for member to start".

Now the order is:
1. If member is already started, return early (unit is working)
2. If member is not started, clear pgdata if needed and continue setup

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
The `_fix_health_check_script` method was a workaround for an incorrect
hardcoded certificate path in the OCI image. This is no longer needed
as the upstream image has been fixed. Update the OCI image reference
accordingly.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…s are cleared on retry

The OCI image ships /var/lib/postgresql/16/main as a real directory,
which prevents the symlink from being created on replicas. Always
recreate the symlink (removing the real directory first if needed)
instead of skipping when the path already exists.

Also change stale storage directory cleanup to repeat until pgdata is
fully populated (PG_VERSION exists), rather than running only once per
unit lifecycle. This ensures retries after a failed pg_basebackup find
clean directories.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
dragomirp
dragomirp previously approved these changes Feb 17, 2026
taurus-forever
taurus-forever previously approved these changes Mar 3, 2026
Copy link
Contributor

@taurus-forever taurus-forever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exceptionally well done! Nothing found after incentive testing. 🥳

One question: PR description says /var/lib/pg/archive/16/main
but the deployed system has no 16/main for archive:

root@postgresql-k8s-3:/var/lib/postgresql# ls -la /var/lib/pg/archive/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
total 24                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
drwxr-xr-x 3 postgres postgres  4096 Mar  2 23:31 .                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
drwxr-xr-x 6 root     root      4096 Mar  2 23:32 ..                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
drwx------ 2 root     root     16384 Mar  2 23:31 lost+found  

Outdated PR description? @marceloneppel

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
@marceloneppel marceloneppel requested a review from a team as a code owner March 3, 2026 13:42
@marceloneppel marceloneppel requested review from carlcsaposs-canonical, dragomirp and taurus-forever and removed request for a team March 3, 2026 13:42
@github-actions github-actions bot added Libraries: Out of sync The charm libs used are out-of-sync and removed Libraries: OK The charm libs used are OK and in-sync labels Mar 3, 2026
taurus-forever
taurus-forever previously approved these changes Mar 3, 2026
Replace CHARM_BASE references with CHARM_BASE_NOBLE in HA self-healing
and new relations integration tests

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
@marceloneppel
Copy link
Member Author

We can merge either this PR or #1325 first, then rebase, because they touch similar places regarding the base fix for the test app.

Exceptionally well done! Nothing found after incentive testing. 🥳

One question: PR description says /var/lib/pg/archive/16/main but the deployed system has no 16/main for archive:

root@postgresql-k8s-3:/var/lib/postgresql# ls -la /var/lib/pg/archive/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
total 24                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
drwxr-xr-x 3 postgres postgres  4096 Mar  2 23:31 .                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
drwxr-xr-x 6 root     root      4096 Mar  2 23:32 ..                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
drwx------ 2 root     root     16384 Mar  2 23:31 lost+found  

About that, I synced with @taurus-forever to create a follow-up PR to fix that.

@marceloneppel marceloneppel merged commit 40e678b into 16/edge Mar 5, 2026
150 of 154 checks passed
@marceloneppel marceloneppel deleted the change-storage-mounts branch March 5, 2026 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Libraries: Out of sync The charm libs used are out-of-sync

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PG16-K8s cannot be deployed to Canonical K8s. Error: calling self.fix_leader_annotation() (PG14-K8s works)

4 participants