Skip to content

fix(api): rewrite rls_transaction to retry mid-query replica failures with primary fallback#10374

Closed
josema-xyz wants to merge 3 commits intomasterfrom
PROWLER-1225-fix-read-queries-on-the-read-replica-that-doesnt-use-the-write-replica-on-retries
Closed

fix(api): rewrite rls_transaction to retry mid-query replica failures with primary fallback#10374
josema-xyz wants to merge 3 commits intomasterfrom
PROWLER-1225-fix-read-queries-on-the-read-replica-that-doesnt-use-the-write-replica-on-retries

Conversation

@josema-xyz
Copy link
Contributor

@josema-xyz josema-xyz commented Mar 18, 2026

EDIT: Superseded by #10379. Closing this in favor of a simpler approach that fixes the same bug without changing any call sites. The new PR uses Django's execute_wrapper API inside rls_transaction to intercept mid-query OperationalError on the replica, retry with backoff, and fall back to primary. Same with rls_transaction(...) interface, ~80 lines changed in one file instead of 2400 across 35.

Context

When the read replica dies mid-query, the retry and primary fallback logic in rls_transaction never executes. The function is a @contextmanager generator that can only yield once. After yielding the cursor to the caller, any error hits a guard that re-raises immediately — the retry loop is unreachable.

Also, REPLICA_MAX_ATTEMPTS=3 only gives 2 replica tries because the primary fallback consumes one.

Description

Rewrites rls_transaction from a @contextmanager to a class with an iterator protocol. All call sites change from with rls_transaction(...) to for attempt in rls_transaction(...): with attempt:. The for loop re-executes the entire body on failure, so both connection-setup errors and mid-query errors are retried.

Changes:

  • rls_transaction is now a class that yields _RLSAttempt context managers
  • _RLSAttempt.__enter__ retries connection-setup failures inline
  • _RLSAttempt.__exit__ catches mid-query failures and lets the loop continue
  • REPLICA_MAX_ATTEMPTS=3 now means 3 replica tries + 1 primary fallback
  • Deleted the hand-rolled retry loop in integrations.py
  • Migrated all call sites and test mocks to the new for/with pattern, over 100 of each

Steps to review

Deep look at:

  • Core rewrite: api/src/backend/api/db_utils.py
  • Special cases: renderers.py (conditional), integrations.py (deleted manual retry), scan.py (deadlock loop)
  • New tests in test_db_utils.py for mid-body retry and max attempts semantics

Run tests and use the application locally.

Checklist

API

  • All issue/task requirements work as expected on the API
  • Endpoint response output (if applicable)
  • EXPLAIN ANALYZE output for new/modified queries or indexes (if applicable)
  • Performance test results (if applicable)
  • Any other relevant evidence of the implementation (if applicable)
  • Verify if API specs need to be regenerated.
  • Check if version updates are required (e.g., specs, Poetry, etc.).
  • Ensure new entries are added to CHANGELOG.md, if applicable.

License

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@josema-xyz josema-xyz requested a review from a team as a code owner March 18, 2026 13:01
Copilot AI review requested due to automatic review settings March 18, 2026 13:01
@github-actions github-actions bot added component/api review-django-migrations This PR contains changes in Django migrations labels Mar 18, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 18, 2026

✅ All necessary CHANGELOG.md files have been updated.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 18, 2026

Conflict Markers Resolved

All conflict markers have been successfully resolved in this pull request.

def _handle_retry(self, error):
try:
connections[self._alias].close()
except Exception:

Check notice

Code scanning / CodeQL

Empty except Note

'except' clause does nothing but pass and there is no explanatory comment.

Copilot Autofix

AI 5 days ago

In general, to fix an "empty except" issue, either narrow the exception type and handle it explicitly, or at least document and log why the exception is being ignored, so that failures are observable and justified. Avoid bare except Exception: with a pass, especially in infrastructure code like database utilities.

Here, the best minimal fix that doesn’t change existing external behavior is to keep swallowing the exception (so retries continue unimpeded) but add a log message in the except block explaining that closing the connection failed and that the error is being ignored. We already have a logger in this module, so no new imports are needed. Concretely, in api/src/backend/api/db_utils.py, inside the _RLSAttempt._handle_retry method, replace:

235:         try:
236:             connections[self._alias].close()
237:         except Exception:
238:             pass

with something like:

235:         try:
236:             connections[self._alias].close()
237:         except Exception as close_error:
238:             logger.warning(
239:                 "Failed to close DB connection for alias %s during RLS retry; "
240:                 "continuing with retry. Error: %r",
241:                 self._alias,
242:                 close_error,
243:             )

This preserves the semantics (no re-raise), adds a clear explanation for the ignored exception, and provides diagnostic information if connection closing starts failing.


Suggested changeset 1
api/src/backend/api/db_utils.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/api/src/backend/api/db_utils.py b/api/src/backend/api/db_utils.py
--- a/api/src/backend/api/db_utils.py
+++ b/api/src/backend/api/db_utils.py
@@ -234,8 +234,13 @@
     def _handle_retry(self, error):
         try:
             connections[self._alias].close()
-        except Exception:
-            pass
+        except Exception as close_error:
+            logger.warning(
+                "Failed to close DB connection for alias %s during RLS retry; "
+                "continuing with retry. Error: %r",
+                self._alias,
+                close_error,
+            )
         attempt = self._iterator._attempt
         max_att = self._iterator._max_attempts
         delay = REPLICA_RETRY_BASE_DELAY * (2 ** (attempt - 1))
EOF
@@ -234,8 +234,13 @@
def _handle_retry(self, error):
try:
connections[self._alias].close()
except Exception:
pass
except Exception as close_error:
logger.warning(
"Failed to close DB connection for alias %s during RLS retry; "
"continuing with retry. Error: %r",
self._alias,
close_error,
)
attempt = self._iterator._attempt
max_att = self._iterator._max_attempts
delay = REPLICA_RETRY_BASE_DELAY * (2 ** (attempt - 1))
Copilot is powered by AI and may make mistakes. Always verify output.
…R-1225-fix-read-queries-on-the-read-replica-that-doesnt-use-the-write-replica-on-retries
… with primary fallback - Add api/CHANGELOG.md entry
@github-actions
Copy link
Contributor

github-actions bot commented Mar 18, 2026

🔒 Container Security Scan

Image: prowler-api:3ba071a
Last scan: 2026-03-18 13:11:09 UTC

📊 Vulnerability Summary

Severity Count
🔴 Critical 4
Total 4

3 package(s) affected

⚠️ Action Required

Critical severity vulnerabilities detected. These should be addressed before merging:

  • Review the detailed scan results
  • Update affected packages to patched versions
  • Consider using a different base image if updates are unavailable

📋 Resources:

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the API’s Postgres RLS transaction helper to correctly retry when a read-replica fails mid-query, including a primary DB fallback, and migrates the codebase to the new retryable for/with usage pattern.

Changes:

  • Replaces rls_transaction from a single-yield @contextmanager into an iterable that yields per-attempt context managers (replica retries + primary fallback).
  • Migrates production call sites to for attempt in rls_transaction(...): with attempt: so mid-body OperationalError can trigger a full re-execution.
  • Updates/adjusts unit + integration tests and removes the now-redundant hand-rolled retry logic in integrations.

Reviewed changes

Copilot reviewed 36 out of 36 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
api/src/backend/api/db_utils.py Core rls_transaction rewrite (iterator + retry/fallback semantics).
api/src/backend/tasks/jobs/integrations.py Removes manual replica retry loop; relies on rls_transaction retries.
api/src/backend/tasks/jobs/scan.py Migrates scan DB operations to new for/with retry pattern.
api/src/backend/tasks/tasks.py Migrates task DB access (scheduled scans, outputs, integrations) to new pattern.
api/src/backend/tasks/jobs/backfill.py Migrates backfill jobs to new pattern.
api/src/backend/tasks/jobs/deletion.py Migrates deletion workflows to new pattern.
api/src/backend/tasks/jobs/export.py Migrates export path DB access to new pattern.
api/src/backend/tasks/jobs/muting.py Migrates muting job DB access to new pattern.
api/src/backend/tasks/jobs/report.py Migrates report generation DB access to new pattern.
api/src/backend/tasks/jobs/reports/base.py Migrates report data loading to new pattern.
api/src/backend/tasks/jobs/threatscore.py Migrates threatscore DB access to new pattern.
api/src/backend/tasks/jobs/threatscore_utils.py Migrates threatscore data-loading/aggregation to new pattern.
api/src/backend/tasks/jobs/attack_paths/scan.py Migrates attack paths scan DB access to new pattern.
api/src/backend/tasks/jobs/attack_paths/findings.py Migrates attack paths findings fetch/enrichment to new pattern.
api/src/backend/tasks/jobs/attack_paths/db_utils.py Migrates attack paths DB helpers to new pattern.
api/src/backend/tasks/beat.py Migrates beat-scheduled scan creation to new pattern.
api/src/backend/config/celery.py Migrates task-result persistence to new pattern.
api/src/backend/api/base_views.py Migrates request initialization tenant/user RLS setup to new pattern.
api/src/backend/api/renderers.py Migrates conditional include-render RLS wrapping to new pattern.
api/src/backend/api/v1/views.py Migrates SAML domain/config lookups to new pattern.
api/src/backend/api/utils.py Migrates integration config updates to new pattern.
api/src/backend/api/decorators.py Migrates provider-deletion guard queries to new pattern.
api/src/backend/api/adapters.py Migrates social signup post-create writes to new pattern.
api/src/backend/api/migrations/0008_daily_scheduled_tasks_update.py Updates migration DB writes to new pattern.
api/src/backend/api/management/commands/findings.py Updates management command DB writes to new pattern.
api/src/backend/conftest.py Updates fixtures to new pattern.
api/src/backend/api/tests/test_db_utils.py Adds/updates unit tests for mid-body retry + attempt semantics.
api/src/backend/api/tests/integration/test_rls_transaction.py Updates integration tests to new for/with usage.
api/src/backend/api/tests/test_utils.py Updates mocks for iterable rls_transaction.
api/src/backend/api/tests/test_decorators.py Updates mocks for iterable rls_transaction.
api/src/backend/tasks/tests/test_export.py Updates mocks for iterable rls_transaction.
api/src/backend/tasks/tests/test_integrations.py Updates mocks; removes obsolete manual-retry test.
api/src/backend/tasks/tests/test_scan.py Updates mocks for iterable rls_transaction.
api/src/backend/tasks/tests/test_tasks.py Updates mocks for iterable rls_transaction.
api/src/backend/tasks/tests/test_attack_paths_scan.py Updates mocks for iterable rls_transaction.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

else:
tag_instance = tag_cache[tag_key]
tags.append(tag_instance)
resource_instance.upsert_or_delete_tags(tags=tags)
Comment on lines +128 to +138
# Fetch all compliance requirement overview rows for this scan
requirement_rows = ComplianceRequirementOverview.objects.filter(
tenant_id=tenant_id, scan_id=scan_id
).values(
"compliance_id",
"requirement_id",
"requirement_status",
)

# Group by (compliance_id, requirement_id) across regions
requirement_statuses = defaultdict(
lambda: {"fail_count": 0, "pass_count": 0, "total_count": 0}
)
if not requirement_rows:
return {"status": "no compliance data to backfill"}
Comment on lines +231 to +245
completed_scans = (
Scan.objects.filter(**scan_filter)
.order_by("provider_id", "-completed_at")
.values("id", "provider_id", "completed_at")
)

if not completed_scans:
return {"status": "no scans to backfill"}
if not completed_scans:
return {"status": "no scans to backfill"}

# Keep only latest scan per provider/day
latest_scans_by_day = {}
for scan in completed_scans:
key = (scan["provider_id"], scan["completed_at"].date())
if key not in latest_scans_by_day:
latest_scans_by_day[key] = scan
# Keep only latest scan per provider/day
latest_scans_by_day = {}
for scan in completed_scans:
key = (scan["provider_id"], scan["completed_at"].date())
if key not in latest_scans_by_day:
latest_scans_by_day[key] = scan
Comment on lines +604 to +618
completed_scans = (
Scan.objects.filter(**scan_filter)
.order_by("-completed_at")
.values("id", "completed_at")
)

if not completed_scans:
return {"status": "no scans to backfill"}
if not completed_scans:
return {"status": "no scans to backfill"}

# Keep only latest scan per day
latest_scans_by_day = {}
for scan in completed_scans:
key = scan["completed_at"].date()
if key not in latest_scans_by_day:
latest_scans_by_day[key] = scan
# Keep only latest scan per day
latest_scans_by_day = {}
for scan in completed_scans:
key = scan["completed_at"].date()
if key not in latest_scans_by_day:
latest_scans_by_day[key] = scan
@codecov
Copy link

codecov bot commented Mar 18, 2026

Codecov Report

❌ Patch coverage is 84.64223% with 176 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.32%. Comparing base (5a3475b) to head (f12c72f).

Additional details and impacted files
@@             Coverage Diff             @@
##           master   #10374       +/-   ##
===========================================
+ Coverage   56.85%   93.32%   +36.47%     
===========================================
  Files          87      218      +131     
  Lines        2846    30567    +27721     
===========================================
+ Hits         1618    28528    +26910     
- Misses       1228     2039      +811     
Flag Coverage Δ
api 93.32% <84.64%> (?)
prowler-py3.10-oraclecloud ?
prowler-py3.11-oraclecloud ?
prowler-py3.12-oraclecloud ?
prowler-py3.9-oraclecloud ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
prowler ∅ <ø> (∅)
api 93.32% <84.64%> (∅)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@josema-xyz
Copy link
Contributor Author

Closing this in favor of #10379. After reviewing the approach I realized the for/with rewrite at every call site is overkill for what's fundamentally a query-level retry problem.

The replacement PR keeps rls_transaction as a context manager and uses connection.execute_wrappers to catch OperationalError during cursor.execute() on the replica. It retries on replica with backoff, then falls back to primary, transparently, no call-site changes. Also fixes the off-by-one in REPLICA_MAX_ATTEMPTS and closes stale connections between retries.

The .iterator() limitation (server-side cursor fetches via fetchmany()) exists in both approaches, neither can retry mid-iteration without risking duplicate rows.

@josema-xyz josema-xyz closed this Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/api review-django-migrations This PR contains changes in Django migrations

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants