✨(storage) implement tiered storage #486

sylvinus · 2026-01-16T11:09:29Z

This allows to use S3-compatible object storage to offload blobs, making Postgres much lighter. We design for storing ~1B emails on a single instance.

Fixes #185.

Summary by CodeRabbit

New Features
- Tiered blob storage: offload message blobs to object storage with transparent retrieval and optional encryption
- Configurable blob encryption with key rotation support
- Hourly scheduled offload of eligible blobs
Chores
- Management command to verify and repair tiered storage integrity
- Automatic cleanup of object storage when blobs are deleted
- Test and environment additions to ensure storage buckets exist and validate workflows

_{✏️ Tip: You can customize this high-level summary in your review settings.}

This allows to use S3-compatible object storage to offload blobs, making Postgres much lighter. We design for storing ~1B emails on a single instance.

coderabbitai · 2026-01-16T11:09:42Z

📝 Walkthrough

Walkthrough

Implements tiered blob storage with optional encryption and offload to object storage. Adds Blob storage metadata, a TieredStorageService, Celery offload tasks, verification/rotation management command, settings and env defaults, tests, and Docker Compose bucket initialization.

Changes

Cohort / File(s)	Summary
Core model & enums `src/backend/core/enums.py`, `src/backend/core/models.py`, `src/backend/core/migrations/0014_blob_encryption_key_id_blob_storage_location_and_more.py`	Added `BlobStorageLocationChoices`, `Blob.encryption_key_id`, `Blob.storage_location`; made `raw_content` nullable; added storage-key utilities, storage-aware get_content/save flows and offload-aware save behavior.
Tiered storage service & tasks `src/backend/core/services/tiered_storage.py`, `src/backend/core/services/tiered_storage_tasks.py`	New `TieredStorageService` with SHA256 keys, Fernet encryption, deduplication, upload/download, existence/orphan checks. Added Celery tasks to offload blobs and per-blob offload with transactional locking and retry logic.
Management command `src/backend/core/management/commands/verify_tiered_storage.py`	New management command to verify db↔storage consistency, detect/delete orphans, verify hashes, and perform re-encryption (with dry-run and limits).
Settings & configuration `src/backend/messages/settings.py`, `env.d/development/backend.defaults`	New storage entry `message-blobs`, encryption settings (`MESSAGES_BLOB_ENCRYPTION_KEYS`, `MESSAGES_BLOB_ENCRYPTION_ACTIVE_KEY_ID`), tiered offload toggles and thresholds, and object-storage credential vars in env defaults.
Signals & indexing toggles `src/backend/core/signals.py`, `src/backend/core/services/search/search.py`, `src/backend/core/api/viewsets/config.py`	Added post-delete signal to cleanup object storage for object-stored blobs; switched some getattr-based OPENSEARCH flag checks to direct attribute access; config viewset now populates missing keys with None.
Compose & Celery scheduling `compose.yaml`, `src/backend/messages/celery_app.py`	Compose now ensures `st-messages/msg-blobs` bucket exists. Added Celery beat job `offload-blobs-to-object-storage` running hourly.
Tests & test infra `src/backend/core/tests/...`, `src/backend/core/tests/conftest.py`	Extensive new tests for tiered storage, tasks, and verify command; autouse fixture to ensure S3 buckets (`message-imports`, `message-blobs`) exist; minor test fixes and package init additions.
Utilities `src/backend/core/utils.py`	`JSONValue.to_python` treats empty/whitespace string as None to allow defaults.
Misc small changes `src/backend/core/tests/commands/__init__.py`, `src/backend/core/tests/services/__init__.py`, `src/backend/core/tests/tasks/__init__.py`, small task test var rename	New test package inits and minor test variable naming adjustments.

Sequence Diagram

sequenceDiagram
    actor App as Application
    participant Blob
    participant Tiered as TieredStorageService
    participant Object as Object Storage
    participant DB as PostgreSQL

    rect rgba(76, 175, 80, 0.5)
    Note over App,DB: Blob creation (encrypt & store)
    App->>Blob: create_blob(compressed_content)
    Blob->>Tiered: encrypt(data)
    Tiered-->>Blob: encrypted_bytes, key_id
    Blob->>DB: save(encrypted_bytes, key_id, storage=POSTGRES)
    DB-->>Blob: blob_id
    end

    rect rgba(33, 150, 243, 0.5)
    Note over App,Object: Offload to object storage
    App->>Tiered: upload_blob(blob)
    Tiered->>Object: PUT /blobs/... (encrypted)
    Object-->>Tiered: success
    Tiered->>DB: update(blob.storage_location=OBJECT_STORAGE, raw_content=NULL)
    DB-->>Tiered: ✓
    end

    rect rgba(244, 67, 54, 0.5)
    Note over App,DB: Retrieval (transparent)
    App->>Blob: get_content()
    alt storage == POSTGRES
        Blob->>Tiered: decrypt(raw_content, key_id)
        Tiered-->>Blob: decrypted_bytes
    else storage == OBJECT_STORAGE
        Blob->>Object: GET /blobs/...
        Object-->>Blob: encrypted_bytes
        Blob->>Tiered: decrypt(encrypted_bytes, key_id)
        Tiered-->>Blob: decrypted_bytes
    end
    Blob->>Blob: decompress(...)
    Blob-->>App: original content
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested reviewers

sdemagny

Poem

🐰 I dug a burrow for each byte,
Encrypted, tidied, tucked up tight,
Off they hop to blob-filled skies,
Keys and queues and tests—hooray!
A little rabbit clap of bytes.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	Title clearly indicates the main feature: implementing tiered storage, which aligns with the primary objective of the changeset.
Linked Issues check	✅ Passed	The PR implements all core coding requirements from issue `#185`: enables Blob instances in separate model [`#185`], supports S3-compatible object storage [`#185`], and provides offloading mechanism.
Out of Scope Changes check	✅ Passed	All changes are scoped to tiered storage implementation. Minor ancillary changes (search setting refactoring, JSONValue whitespace handling) are necessary supporting adjustments directly tied to core functionality.
Docstring Coverage	✅ Passed	Docstring coverage is 96.23% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@src/backend/core/management/commands/verify_tiered_storage.py`:
- Around line 466-495: The current flow writes the newly encrypted object via
self.service.storage.save(storage_key, ...) before updating
blob.encryption_key_id inside transaction.atomic(), risking storage/DB
inconsistency if the DB update fails; instead, write the new encrypted bytes to
a temporary object (e.g. derive a temp key from storage_key and new_key_id)
using self.service.storage.save(temp_key, ContentFile(encrypted)), then perform
the DB update inside transaction.atomic() (update blob.encryption_key_id and
save), and only after the transaction succeeds atomically remove/rename the temp
object to the final storage_key (or copy temp→final and delete temp) so storage
and DB remain consistent; reference symbols: self.service.storage.save,
storage_key, temp_key (create), self.service.encrypt, blob.encryption_key_id,
transaction.atomic.

In `@src/backend/core/services/tiered_storage.py`:
- Around line 31-43: In __init__, the enabled gate currently checks for an
OPTIONS.endpoint_url which wrongly disables valid S3 setups; instead set
self.enabled based on presence of the "message-blobs" storage config itself
(e.g. check that settings.STORAGES contains a non-empty "message-blobs" entry).
Update the assignment to self.enabled to use
settings.STORAGES.get("message-blobs") (or "message-blobs" in settings.STORAGES
and truthy) rather than digging for OPTIONS.endpoint_url so AWS S3 configs
without endpoint_url remain enabled.

🧹 Nitpick comments (3)

src/backend/core/services/tiered_storage_tasks.py (1)
68-133: Consider adding retry for transient failures.

The task handles lock contention gracefully by returning "locked" status, but transient failures (network issues, temporary S3 unavailability) at line 131 are logged and returned as errors without retry. The periodic offload_blobs_task will eventually re-queue these blobs, but adding explicit retry behavior for transient exceptions (e.g., ConnectionError, Timeout) could improve reliability.
💡 Optional: Add retry for transient failures
-@celery_app.task(bind=True)
+@celery_app.task(bind=True, autoretry_for=(ConnectionError, TimeoutError), retry_backoff=True, max_retries=3)
 def offload_single_blob_task(self, blob_id: str) -> Dict[str, Any]:
src/backend/core/models.py (1)
1536-1557: Enforce storage_location/raw_content invariants at the DB layer.
With raw_content now nullable, inconsistent states (e.g., OBJECT_STORAGE + non-null content)
become possible and will surface as runtime errors in get_content. A check constraint makes
the invariant explicit and avoids silent drift. This will require a migration.
♻️ Proposed constraint
         constraints = [
             models.CheckConstraint(
                 check=(
                     models.Q(mailbox__isnull=False) | models.Q(maildomain__isnull=False)
                 ),
                 name="blob_has_owner",
             ),
+            models.CheckConstraint(
+                check=(
+                    models.Q(
+                        storage_location=BlobStorageLocationChoices.POSTGRES,
+                        raw_content__isnull=False,
+                    )
+                    | models.Q(
+                        storage_location=BlobStorageLocationChoices.OBJECT_STORAGE,
+                        raw_content__isnull=True,
+                    )
+                ),
+                name="blob_storage_location_matches_content",
+            ),
         ]
As per coding guidelines, enforce data integrity with model constraints.

Also applies to: 1583-1589
src/backend/core/services/tiered_storage.py (1)
244-281: Guard against orphan-delete races and capture delete errors.
There’s a TOCTOU window between the reference count (Line 259-263) and deletion (Line 274-275);
a concurrent offload could add a reference after the count and still have its object deleted.
Consider an advisory lock keyed by SHA256 or a transactional guard around the check+delete.

Also, capture the storage deletion exception to Sentry so cleanup failures are observable.
♻️ Suggested Sentry capture
 from cryptography.fernet import Fernet
+from sentry_sdk import capture_exception
@@
-        except Exception as e:  # pylint: disable=broad-except
-            logger.warning("Failed to delete blob from storage %s: %s", key, e)
+        except Exception as exc:  # pylint: disable=broad-except
+            capture_exception(exc)
+            logger.warning("Failed to delete blob from storage %s: %s", key, exc)
             return False
As per coding guidelines, capture and report exceptions to Sentry.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ec4c7ef and 8858822.

📒 Files selected for processing (22)

compose.yaml
env.d/development/backend.defaults
src/backend/core/api/viewsets/config.py
src/backend/core/enums.py
src/backend/core/management/commands/verify_tiered_storage.py
src/backend/core/migrations/0014_blob_encryption_key_id_blob_storage_location_and_more.py
src/backend/core/models.py
src/backend/core/services/search/search.py
src/backend/core/services/tiered_storage.py
src/backend/core/services/tiered_storage_tasks.py
src/backend/core/signals.py
src/backend/core/tests/commands/__init__.py
src/backend/core/tests/commands/test_verify_tiered_storage.py
src/backend/core/tests/conftest.py
src/backend/core/tests/services/__init__.py
src/backend/core/tests/services/test_tiered_storage.py
src/backend/core/tests/tasks/__init__.py
src/backend/core/tests/tasks/test_task_send_message.py
src/backend/core/tests/tasks/test_tiered_storage_tasks.py
src/backend/core/utils.py
src/backend/messages/celery_app.py
src/backend/messages/settings.py

🧰 Additional context used

📓 Path-based instructions (6)

src/backend/**/*.py