Skip to content

Data Archival

Dev Singh edited this page Aug 31, 2025 · 3 revisions

There are often records which are no longer needed to be stored for most use cases, but we want to keep around just in case (or for things like Calendar, for cool historical purposes). We can store these records in S3 instead of DynamoDB, where it is cheaper (albeit harder to query). Our archival system uses a serverless pipeline to move expired DynamoDB records to S3 for cost-effective, long-term storage.

Since we operate at a small scale, we considered leaving the data in hot storage until expiry (i.e. 5 years). However, at least in the case of a system like Events where there are often full table scans, this would reduce performance significantly.

Specific lambda function names and retention values can be found in terraform/modules/archival/ and the module parameters in the the Terraform envs.


Reading Archived Data

See this Python notebook for an example on how to read the data into Pandas.


1. Trigger and Capture

The process starts when a record's Time To Live (TTL) attribute expires in a monitored DynamoDB table. DynamoDB automatically deletes the item, which generates a REMOVE event in the table's DynamoDB Stream. This event contains the full data of the deleted record. For examples on how TTL is generated for DynamoDB records, see src/common/constants.ts.


2. Processing and Ingestion

A Lambda function processes the above stream events.

  • Filtered Invocation: The aws_lambda_event_source_mapping uses a filter_criteria to ensure the Lambda only runs for TTL-driven deletions (principalId = "dynamodb.amazonaws.com"), ignoring other types of deletes.

  • Data Enrichment & Forwarding: The Lambda script deserializes the record data and adds two metadata fields: __infra_archive_resource (the table name) and __infra_archive_timestamp. In the Lambda code, you can specify parsing functions to derive the timestamp from the data itself, instead of the archival time, making it easier to query.

It then sends these enriched records in batches to a Kinesis Firehose delivery stream.


3. S3 Storage and Organization

Kinesis Firehose manages the data delivery to the S3 bucket, ddb-archive.

  • Dynamic Partitioning: Firehose uses its built-in processing_configuration with JQ to extract the metadata fields from the JSON payload.

  • Organized Storage: It uses this extracted data for dynamic partitioning, organizing files into a logical path structure defined by the prefix parameter: resource=!{partitionKeyFromQuery:resource}/year=!{partitionKeyFromQuery:year}/....

  • Efficiency: Before writing to S3, Firehose automatically compresses the batched data to GZIP format.


4. Storage Lifecycle Management

The S3 bucket uses lifecycle rules to optimize storage costs over time.

  • Intelligent-Tiering: The bucket lifecycle configuration resource transitions all objects to the INTELLIGENT_TIERING storage class after 1 day.

  • Archive Access: The bucket lifecycle configuration resource further moves data to the low-cost ARCHIVE_ACCESS tier after 180 days of no access.

  • Permanent Deletion: The module can optionally be provided the TableDeletionDays map to dynamically create lifecycle rules that permanently delete archived data from S3 on a per-table basis after a specified number of days.

Clone this wiki locally