fix(api): rewrite rls_transaction to retry mid-query replica failures with primary fallback by josema-xyz · Pull Request #10374 · prowler-cloud/prowler

josema-xyz · 2026-03-18T13:01:45Z

EDIT: Superseded by #10379. Closing this in favor of a simpler approach that fixes the same bug without changing any call sites. The new PR uses Django's execute_wrapper API inside rls_transaction to intercept mid-query OperationalError on the replica, retry with backoff, and fall back to primary. Same with rls_transaction(...) interface, ~80 lines changed in one file instead of 2400 across 35.

Context

When the read replica dies mid-query, the retry and primary fallback logic in rls_transaction never executes. The function is a @contextmanager generator that can only yield once. After yielding the cursor to the caller, any error hits a guard that re-raises immediately — the retry loop is unreachable.

Also, REPLICA_MAX_ATTEMPTS=3 only gives 2 replica tries because the primary fallback consumes one.

Description

Rewrites rls_transaction from a @contextmanager to a class with an iterator protocol. All call sites change from with rls_transaction(...) to for attempt in rls_transaction(...): with attempt:. The for loop re-executes the entire body on failure, so both connection-setup errors and mid-query errors are retried.

Changes:

rls_transaction is now a class that yields _RLSAttempt context managers
_RLSAttempt.__enter__ retries connection-setup failures inline
_RLSAttempt.__exit__ catches mid-query failures and lets the loop continue
REPLICA_MAX_ATTEMPTS=3 now means 3 replica tries + 1 primary fallback
Deleted the hand-rolled retry loop in integrations.py
Migrated all call sites and test mocks to the new for/with pattern, over 100 of each

Steps to review

Deep look at:

Core rewrite: api/src/backend/api/db_utils.py
Special cases: renderers.py (conditional), integrations.py (deleted manual retry), scan.py (deadlock loop)
New tests in test_db_utils.py for mid-body retry and max attempts semantics

Run tests and use the application locally.

Checklist

Review if the code is being covered by tests.
Review if code is being documented following this specification https://github.com/google/styleguide/blob/gh-pages/pyguide.md#38-comments-and-docstrings
Review if backport is needed.
Review if is needed to change the README.md
Ensure new entries are added to CHANGELOG.md, if applicable.

API

All issue/task requirements work as expected on the API
Endpoint response output (if applicable)
EXPLAIN ANALYZE output for new/modified queries or indexes (if applicable)
Performance test results (if applicable)
Any other relevant evidence of the implementation (if applicable)
Verify if API specs need to be regenerated.
Check if version updates are required (e.g., specs, Poetry, etc.).
Ensure new entries are added to CHANGELOG.md, if applicable.

License

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

… with primary fallback

github-actions · 2026-03-18T13:02:08Z

✅ All necessary CHANGELOG.md files have been updated.

github-actions · 2026-03-18T13:02:13Z

✅ Conflict Markers Resolved

All conflict markers have been successfully resolved in this pull request.

api/src/backend/api/db_utils.py

+    def _handle_retry(self, error):
+        try:
+            connections[self._alias].close()
+        except Exception:


In general, to fix an "empty except" issue, either narrow the exception type and handle it explicitly, or at least document and log why the exception is being ignored, so that failures are observable and justified. Avoid bare except Exception: with a pass, especially in infrastructure code like database utilities.

Here, the best minimal fix that doesn’t change existing external behavior is to keep swallowing the exception (so retries continue unimpeded) but add a log message in the except block explaining that closing the connection failed and that the error is being ignored. We already have a logger in this module, so no new imports are needed. Concretely, in api/src/backend/api/db_utils.py, inside the _RLSAttempt._handle_retry method, replace:

235: try: 236: connections[self._alias].close() 237: except Exception: 238: pass

with something like:

235: try: 236: connections[self._alias].close() 237: except Exception as close_error: 238: logger.warning( 239: "Failed to close DB connection for alias %s during RLS retry; " 240: "continuing with retry. Error: %r", 241: self._alias, 242: close_error, 243: )

This preserves the semantics (no re-raise), adds a clear explanation for the ignored exception, and provides diagnostic information if connection closing starts failing.

…R-1225-fix-read-queries-on-the-read-replica-that-doesnt-use-the-write-replica-on-retries

… with primary fallback - Add api/CHANGELOG.md entry

github-actions · 2026-03-18T13:05:36Z

🔒 Container Security Scan

Image: prowler-api:3ba071a
Last scan: 2026-03-18 13:11:09 UTC

📊 Vulnerability Summary

Severity	Count
🔴 Critical	4
Total	4

3 package(s) affected

⚠️ Action Required

Critical severity vulnerabilities detected. These should be addressed before merging:

Review the detailed scan results
Update affected packages to patched versions
Consider using a different base image if updates are unavailable

📋 Resources:

Download full report (see artifacts)
View in Security tab
Scanned with Trivy

Copilot

Pull request overview

This PR refactors the API’s Postgres RLS transaction helper to correctly retry when a read-replica fails mid-query, including a primary DB fallback, and migrates the codebase to the new retryable for/with usage pattern.

Changes:

Replaces rls_transaction from a single-yield @contextmanager into an iterable that yields per-attempt context managers (replica retries + primary fallback).
Migrates production call sites to for attempt in rls_transaction(...): with attempt: so mid-body OperationalError can trigger a full re-execution.
Updates/adjusts unit + integration tests and removes the now-redundant hand-rolled retry logic in integrations.

Reviewed changes

Copilot reviewed 36 out of 36 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
api/src/backend/api/db_utils.py	Core `rls_transaction` rewrite (iterator + retry/fallback semantics).
api/src/backend/tasks/jobs/integrations.py	Removes manual replica retry loop; relies on `rls_transaction` retries.
api/src/backend/tasks/jobs/scan.py	Migrates scan DB operations to new for/with retry pattern.
api/src/backend/tasks/tasks.py	Migrates task DB access (scheduled scans, outputs, integrations) to new pattern.
api/src/backend/tasks/jobs/backfill.py	Migrates backfill jobs to new pattern.
api/src/backend/tasks/jobs/deletion.py	Migrates deletion workflows to new pattern.
api/src/backend/tasks/jobs/export.py	Migrates export path DB access to new pattern.
api/src/backend/tasks/jobs/muting.py	Migrates muting job DB access to new pattern.
api/src/backend/tasks/jobs/report.py	Migrates report generation DB access to new pattern.
api/src/backend/tasks/jobs/reports/base.py	Migrates report data loading to new pattern.
api/src/backend/tasks/jobs/threatscore.py	Migrates threatscore DB access to new pattern.
api/src/backend/tasks/jobs/threatscore_utils.py	Migrates threatscore data-loading/aggregation to new pattern.
api/src/backend/tasks/jobs/attack_paths/scan.py	Migrates attack paths scan DB access to new pattern.
api/src/backend/tasks/jobs/attack_paths/findings.py	Migrates attack paths findings fetch/enrichment to new pattern.
api/src/backend/tasks/jobs/attack_paths/db_utils.py	Migrates attack paths DB helpers to new pattern.
api/src/backend/tasks/beat.py	Migrates beat-scheduled scan creation to new pattern.
api/src/backend/config/celery.py	Migrates task-result persistence to new pattern.
api/src/backend/api/base_views.py	Migrates request initialization tenant/user RLS setup to new pattern.
api/src/backend/api/renderers.py	Migrates conditional include-render RLS wrapping to new pattern.
api/src/backend/api/v1/views.py	Migrates SAML domain/config lookups to new pattern.
api/src/backend/api/utils.py	Migrates integration config updates to new pattern.
api/src/backend/api/decorators.py	Migrates provider-deletion guard queries to new pattern.
api/src/backend/api/adapters.py	Migrates social signup post-create writes to new pattern.
api/src/backend/api/migrations/0008_daily_scheduled_tasks_update.py	Updates migration DB writes to new pattern.
api/src/backend/api/management/commands/findings.py	Updates management command DB writes to new pattern.
api/src/backend/conftest.py	Updates fixtures to new pattern.
api/src/backend/api/tests/test_db_utils.py	Adds/updates unit tests for mid-body retry + attempt semantics.
api/src/backend/api/tests/integration/test_rls_transaction.py	Updates integration tests to new for/with usage.
api/src/backend/api/tests/test_utils.py	Updates mocks for iterable `rls_transaction`.
api/src/backend/api/tests/test_decorators.py	Updates mocks for iterable `rls_transaction`.
api/src/backend/tasks/tests/test_export.py	Updates mocks for iterable `rls_transaction`.
api/src/backend/tasks/tests/test_integrations.py	Updates mocks; removes obsolete manual-retry test.
api/src/backend/tasks/tests/test_scan.py	Updates mocks for iterable `rls_transaction`.
api/src/backend/tasks/tests/test_tasks.py	Updates mocks for iterable `rls_transaction`.
api/src/backend/tasks/tests/test_attack_paths_scan.py	Updates mocks for iterable `rls_transaction`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

api/src/backend/tasks/jobs/scan.py

+                    else:
+                        tag_instance = tag_cache[tag_key]
+                    tags.append(tag_instance)
            resource_instance.upsert_or_delete_tags(tags=tags)


api/src/backend/tasks/jobs/backfill.py

+            # Fetch all compliance requirement overview rows for this scan
+            requirement_rows = ComplianceRequirementOverview.objects.filter(
+                tenant_id=tenant_id, scan_id=scan_id
+            ).values(
+                "compliance_id",
+                "requirement_id",
+                "requirement_status",
+            )

-        # Group by (compliance_id, requirement_id) across regions
-        requirement_statuses = defaultdict(
-            lambda: {"fail_count": 0, "pass_count": 0, "total_count": 0}
-        )
+            if not requirement_rows:
+                return {"status": "no compliance data to backfill"}


api/src/backend/tasks/jobs/backfill.py

+            completed_scans = (
+                Scan.objects.filter(**scan_filter)
+                .order_by("provider_id", "-completed_at")
+                .values("id", "provider_id", "completed_at")
+            )

-        if not completed_scans:
-            return {"status": "no scans to backfill"}
+            if not completed_scans:
+                return {"status": "no scans to backfill"}

-        # Keep only latest scan per provider/day
-        latest_scans_by_day = {}
-        for scan in completed_scans:
-            key = (scan["provider_id"], scan["completed_at"].date())
-            if key not in latest_scans_by_day:
-                latest_scans_by_day[key] = scan
+            # Keep only latest scan per provider/day
+            latest_scans_by_day = {}
+            for scan in completed_scans:
+                key = (scan["provider_id"], scan["completed_at"].date())
+                if key not in latest_scans_by_day:
+                    latest_scans_by_day[key] = scan


api/src/backend/tasks/jobs/backfill.py

+            completed_scans = (
+                Scan.objects.filter(**scan_filter)
+                .order_by("-completed_at")
+                .values("id", "completed_at")
+            )

-        if not completed_scans:
-            return {"status": "no scans to backfill"}
+            if not completed_scans:
+                return {"status": "no scans to backfill"}

-        # Keep only latest scan per day
-        latest_scans_by_day = {}
-        for scan in completed_scans:
-            key = scan["completed_at"].date()
-            if key not in latest_scans_by_day:
-                latest_scans_by_day[key] = scan
+            # Keep only latest scan per day
+            latest_scans_by_day = {}
+            for scan in completed_scans:
+                key = scan["completed_at"].date()
+                if key not in latest_scans_by_day:
+                    latest_scans_by_day[key] = scan


codecov · 2026-03-18T13:18:32Z

Codecov Report

❌ Patch coverage is 84.64223% with 176 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.32%. Comparing base (5a3475b) to head (f12c72f).

Additional details and impacted files

@@             Coverage Diff             @@
##           master   #10374       +/-   ##
===========================================
+ Coverage   56.85%   93.32%   +36.47%     
===========================================
  Files          87      218      +131     
  Lines        2846    30567    +27721     
===========================================
+ Hits         1618    28528    +26910     
- Misses       1228     2039      +811

Flag	Coverage Δ
api	`93.32% <84.64%> (?)`
prowler-py3.10-oraclecloud	`?`
prowler-py3.11-oraclecloud	`?`
prowler-py3.12-oraclecloud	`?`
prowler-py3.9-oraclecloud	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
prowler	`∅ <ø> (∅)`
api	`93.32% <84.64%> (∅)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

josema-xyz · 2026-03-18T15:32:47Z

Closing this in favor of #10379. After reviewing the approach I realized the for/with rewrite at every call site is overkill for what's fundamentally a query-level retry problem.

The replacement PR keeps rls_transaction as a context manager and uses connection.execute_wrappers to catch OperationalError during cursor.execute() on the replica. It retries on replica with backoff, then falls back to primary, transparently, no call-site changes. Also fixes the off-by-one in REPLICA_MAX_ATTEMPTS and closes stale connections between retries.

The .iterator() limitation (server-side cursor fetches via fetchmany()) exists in both approaches, neither can retry mid-iteration without risking duplicate rows.

fix(api): rewrite rls_transaction to retry mid-query replica failures…

864559c

… with primary fallback

josema-xyz requested a review from a team as a code owner March 18, 2026 13:01

Copilot AI review requested due to automatic review settings March 18, 2026 13:01

github-actions bot added component/api review-django-migrations This PR contains changes in Django migrations labels Mar 18, 2026

Copilot started reviewing on behalf of josema-xyz March 18, 2026 13:02 View session

github-advanced-security bot found potential problems Mar 18, 2026

View reviewed changes

josema-xyz added 2 commits March 18, 2026 14:03

Merge branch 'master' of github.com:prowler-cloud/prowler into PROWLE…

27bc88e

…R-1225-fix-read-queries-on-the-read-replica-that-doesnt-use-the-write-replica-on-retries

fix(api): rewrite rls_transaction to retry mid-query replica failures…

f12c72f

… with primary fallback - Add api/CHANGELOG.md entry

Copilot AI reviewed Mar 18, 2026

View reviewed changes

josema-xyz mentioned this pull request Mar 18, 2026

fix(api): add query-level retry with primary fallback to rls_transaction via execute_wrapper #10379

Open

13 tasks

josema-xyz closed this Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(api): rewrite rls_transaction to retry mid-query replica failures with primary fallback#10374

fix(api): rewrite rls_transaction to retry mid-query replica failures with primary fallback#10374
josema-xyz wants to merge 3 commits intomasterfrom
PROWLER-1225-fix-read-queries-on-the-read-replica-that-doesnt-use-the-write-replica-on-retries

josema-xyz commented Mar 18, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 18, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 18, 2026 •

edited

Loading

Uh oh!

Check notice

Copilot Autofix

github-actions bot commented Mar 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

codecov bot commented Mar 18, 2026 •

edited

Loading

Uh oh!

josema-xyz commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

@@ -234,8 +234,13 @@
                 def _handle_retry(self, error):
                     try:
                         connections[self._alias].close()
-                    except Exception:
-                        pass
+                    except Exception as close_error:
+                        logger.warning(
+                            "Failed to close DB connection for alias %s during RLS retry; "
+                            "continuing with retry. Error: %r",
+                            self._alias,
+                            close_error,
+                        )
                     attempt = self._iterator._attempt
                     max_att = self._iterator._max_attempts
                     delay = REPLICA_RETRY_BASE_DELAY * (2 ** (attempt - 1))

Conversation

josema-xyz commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Description

Changes:

Steps to review

Checklist

API

License

Uh oh!

github-actions bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Check notice

Copilot Autofix

github-actions bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔒 Container Security Scan

📊 Vulnerability Summary

⚠️ Action Required

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

codecov bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

josema-xyz commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

josema-xyz commented Mar 18, 2026 •

edited

Loading

github-actions bot commented Mar 18, 2026 •

edited

Loading

github-actions bot commented Mar 18, 2026 •

edited

Loading

github-actions bot commented Mar 18, 2026 •

edited

Loading

codecov bot commented Mar 18, 2026 •

edited

Loading