Skip to content

[DPE-6762] Catch archive command timeout#1328

Merged
marceloneppel merged 5 commits intomainfrom
dpe-6762-catch-archive-timeout-error
Dec 16, 2025
Merged

[DPE-6762] Catch archive command timeout#1328
marceloneppel merged 5 commits intomainfrom
dpe-6762-catch-archive-timeout-error

Conversation

@marceloneppel
Copy link
Member

@marceloneppel marceloneppel commented Dec 2, 2025

Issue

If pgBackRest cannot archive a WAL file in less than 60 seconds, it will fail it's check command and the charm will be stuck in a blocked status.

Solution

In the stanza check step, handle the error code 82 (Archive operation timeout) in the same way we already do for error code 49 (Cannot connect to host) in the stanza initialisation step:

if return_code == 49:
. This way, the user can fix the problem and run juju resolve.

Fixes #784
Fixes #794

Tested manually by adding sleep 90 && before the actual archive command in templates/patroni.yml.j2 and deploying the charm + connecting it to Microceph's RadosGW.

pgBackRest error codes extracted from source code: error-codes.csv. I think we shouldn't turn the charm into an error state for every error code, so I'm limiting it to error code 82 for now. We may expand the list of error codes in which we do that in the charm in the future.

Also, sometimes the error message from pgBackRest is sent to stdout instead of stderr (and we don't log them, as seen in #1280), so it's important to check both. I ported the changes from #1320 to this PR to handle that situation.

Checklist

  • I have added or updated any relevant documentation.
  • I have cleaned any remaining cloud resources from my accounts.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
@github-actions github-actions bot added the Libraries: Out of sync The charm libs used are out-of-sync label Dec 2, 2025
@codecov
Copy link

codecov bot commented Dec 2, 2025

Codecov Report

❌ Patch coverage is 85.00000% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.84%. Comparing base (5d96db0) to head (ad38836).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/backups.py 84.61% 5 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1328      +/-   ##
==========================================
+ Coverage   75.75%   75.84%   +0.08%     
==========================================
  Files          16       16              
  Lines        4187     4218      +31     
  Branches      633      640       +7     
==========================================
+ Hits         3172     3199      +27     
- Misses        793      796       +3     
- Partials      222      223       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@marceloneppel marceloneppel added the bug Something isn't working as expected label Dec 2, 2025
Sometimes the error message is sent to the stdout instead of the stderr, so it's important to check both

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
@marceloneppel marceloneppel marked this pull request as ready for review December 3, 2025 13:23
taurus-forever
taurus-forever previously approved these changes Dec 9, 2025
Copy link
Contributor

@taurus-forever taurus-forever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Copy link
Contributor

@onkolahmet onkolahmet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes Marcelo! Looks great overall, just had a few questions.

- Replace magic number 82 with PGBACKREST_ARCHIVE_TIMEOUT_ERROR_CODE constant
- Use semicolon separator for multiple error messages (improved readability)
- Document TimeoutError handling to preserve error state for juju resolve
- Update corresponding unit tests

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
marceloneppel added a commit that referenced this pull request Dec 10, 2025
Concatenate multiple ERROR/WARN lines with "; " instead of " " for
improved readability when debugging backup issues.
Updated unit test to match new separator format.
Cherry-picked from #1328 (commit 1f1b601).

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
marceloneppel added a commit that referenced this pull request Dec 12, 2025
Port archive timeout error handling (error code 82) from PR #1328 to
allow users to fix network issues and retry with `juju resolve`.
When archive operations timeout, the charm enters error state instead
of blocked state, enabling recovery via juju resolve.
Fixes #1346

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
marceloneppel added a commit that referenced this pull request Dec 12, 2025
Port archive timeout error handling (error code 82) from PR #1328 to
allow users to fix network issues and retry with `juju resolve`.
When archive operations timeout, the charm enters error state instead
of blocked state, enabling recovery via juju resolve.
Fixes #1346

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
taurus-forever
taurus-forever previously approved these changes Dec 15, 2025
Copy link
Contributor

@taurus-forever taurus-forever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

onkolahmet
onkolahmet previously approved these changes Dec 15, 2025
Copy link
Contributor

@onkolahmet onkolahmet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks Marcelo!

marceloneppel added a commit that referenced this pull request Dec 15, 2025
Port archive timeout error handling (error code 82) from PR #1328 to
allow users to fix network issues and retry with `juju resolve`.
When archive operations timeout, the charm enters error state instead
of blocked state, enabling recovery via juju resolve.
Fixes #1346

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Co-authored-by: Carl Csaposs <carl.csaposs@canonical.com>
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@gmail.com>
@marceloneppel marceloneppel merged commit 8d45417 into main Dec 16, 2025
464 of 478 checks passed
@marceloneppel marceloneppel deleted the dpe-6762-catch-archive-timeout-error branch December 16, 2025 11:04
marceloneppel added a commit that referenced this pull request Dec 16, 2025
* Extract warning and error messages from pgBackRest logs

Sometimes the error message is sent to the stdout instead of the stderr, so it's important to check both

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

* Use semicolon separator for pgBackRest error messages

Concatenate multiple ERROR/WARN lines with "; " instead of " " for
improved readability when debugging backup issues.
Updated unit test to match new separator format.
Cherry-picked from #1328 (commit 1f1b601).

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

* [DPE-9070] Handle pgBackRest archive timeout in check_stanza (#1350)

Port archive timeout error handling (error code 82) from PR #1328 to
allow users to fix network issues and retry with `juju resolve`.
When archive operations timeout, the charm enters error state instead
of blocked state, enabling recovery via juju resolve.
Fixes #1346

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

* Standardize pgBackRest logging to use stderr consistently

Addresses code review feedback from PR #1320 (review #3577653265) about
potential log duplication when checking both stdout and stderr for errors.

Changes:
- Add PGBACKREST_LOG_LEVEL_STDERR constant (--log-level-stderr=warn) to
  ensure all pgBackRest commands output errors/warnings to stderr
- Update all 9 pgBackRest command invocations to include the new flag
- Simplify _extract_error_message() to only check stderr (removes stdout
  parameter entirely)
- Update all 8 call sites and 7 unit tests to use new single-parameter API
- Remove redundant test for stdout handling

This approach eliminates the risk of log duplication while maintaining
predictable error extraction. All errors and warnings now consistently
appear in stderr, making error handling more reliable.

Reference: https://pgbackrest.org/configuration.html#section-log
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

* Revert "Standardize pgBackRest logging to use stderr consistently"

This reverts commit 13a9ac4.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

* Reapply "Standardize pgBackRest logging to use stderr consistently"

This reverts commit 1c3eb04.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

* Fix test_initialise_stanza and cleanup unused variables

- Add --log-level-stderr=warn to test_initialise_stanza expected command to
  match implementation from commit d3a46b6
- Replace unused stdout variables with '_' in check_stanza,
  _is_primary_pgbackrest_service_running, and _on_restore_action
- Add --log-level-console=debug to backup command for detailed output
Fixes assertion error in tests/unit/test_backups.py::test_initialise_stanza

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

---------

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
a-velasco pushed a commit that referenced this pull request Jan 15, 2026
* Extract warning and error messages from pgBackRest logs

Sometimes the error message is sent to the stdout instead of the stderr, so it's important to check both

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

* Use semicolon separator for pgBackRest error messages

Concatenate multiple ERROR/WARN lines with "; " instead of " " for
improved readability when debugging backup issues.
Updated unit test to match new separator format.
Cherry-picked from #1328 (commit 1f1b601).

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

* [DPE-9070] Handle pgBackRest archive timeout in check_stanza (#1350)

Port archive timeout error handling (error code 82) from PR #1328 to
allow users to fix network issues and retry with `juju resolve`.
When archive operations timeout, the charm enters error state instead
of blocked state, enabling recovery via juju resolve.
Fixes #1346

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

* Standardize pgBackRest logging to use stderr consistently

Addresses code review feedback from PR #1320 (review #3577653265) about
potential log duplication when checking both stdout and stderr for errors.

Changes:
- Add PGBACKREST_LOG_LEVEL_STDERR constant (--log-level-stderr=warn) to
  ensure all pgBackRest commands output errors/warnings to stderr
- Update all 9 pgBackRest command invocations to include the new flag
- Simplify _extract_error_message() to only check stderr (removes stdout
  parameter entirely)
- Update all 8 call sites and 7 unit tests to use new single-parameter API
- Remove redundant test for stdout handling

This approach eliminates the risk of log duplication while maintaining
predictable error extraction. All errors and warnings now consistently
appear in stderr, making error handling more reliable.

Reference: https://pgbackrest.org/configuration.html#section-log
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

* Revert "Standardize pgBackRest logging to use stderr consistently"

This reverts commit 13a9ac4.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

* Reapply "Standardize pgBackRest logging to use stderr consistently"

This reverts commit 1c3eb04.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

* Fix test_initialise_stanza and cleanup unused variables

- Add --log-level-stderr=warn to test_initialise_stanza expected command to
  match implementation from commit d3a46b6
- Replace unused stdout variables with '_' in check_stanza,
  _is_primary_pgbackrest_service_running, and _on_restore_action
- Add --log-level-console=debug to backup command for detailed output
Fixes assertion error in tests/unit/test_backups.py::test_initialise_stanza

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

---------

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Signed-off-by: andreia <andreia.velasco@canonical.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working as expected Libraries: Out of sync The charm libs used are out-of-sync

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Failures with create-backup doesn't surface the error Units stuck blocked with failed to initialize stanza, check your S3 settings

4 participants