-
Notifications
You must be signed in to change notification settings - Fork 0
Data Archival
There are often records which are no longer needed to be stored for most use cases, but we want to keep around just in case (or for things like Calendar, for cool historical purposes). We can store these records in S3 instead of DynamoDB, where it is cheaper (albeit harder to query). Our archival system uses a serverless pipeline to move expired DynamoDB records to S3 for cost-effective, long-term storage.
Since we operate at a small scale, we considered leaving the data in hot storage until expiry (i.e. 5 years). However, at least in the case of a system like Events where there are often full table scans, this would reduce performance significantly.
Specific lambda function names and retention values can be found in terraform/modules/archival/
and the module parameters in the the Terraform envs.
See this Python notebook for an example on how to read the data into Pandas.
The process starts when a record's Time To Live (TTL) attribute expires in a monitored DynamoDB table. DynamoDB automatically deletes the item, which generates a REMOVE
event in the table's DynamoDB Stream. This event contains the full data of the deleted record. For examples on how TTL is generated for DynamoDB records, see src/common/constants.ts
.
A Lambda function processes the above stream events.
-
Filtered Invocation: The
aws_lambda_event_source_mapping
uses afilter_criteria
to ensure the Lambda only runs for TTL-driven deletions (principalId = "dynamodb.amazonaws.com"
), ignoring other types of deletes. -
Data Enrichment & Forwarding: The Lambda script deserializes the record data and adds two metadata fields:
__infra_archive_resource
(the table name) and__infra_archive_timestamp
. In the Lambda code, you can specify parsing functions to derive the timestamp from the data itself, instead of the archival time, making it easier to query.
It then sends these enriched records in batches to a Kinesis Firehose delivery stream.
Kinesis Firehose manages the data delivery to the S3 bucket, ddb-archive
.
-
Dynamic Partitioning: Firehose uses its built-in
processing_configuration
with JQ to extract the metadata fields from the JSON payload. -
Organized Storage: It uses this extracted data for dynamic partitioning, organizing files into a logical path structure defined by the
prefix
parameter:resource=!{partitionKeyFromQuery:resource}/year=!{partitionKeyFromQuery:year}/...
. -
Efficiency: Before writing to S3, Firehose automatically compresses the batched data to
GZIP
format.
The S3 bucket uses lifecycle rules to optimize storage costs over time.
-
Intelligent-Tiering: The bucket lifecycle configuration resource transitions all objects to the
INTELLIGENT_TIERING
storage class after 1 day. -
Archive Access: The bucket lifecycle configuration resource further moves data to the low-cost
ARCHIVE_ACCESS
tier after 180 days of no access. -
Permanent Deletion: The module can optionally be provided the
TableDeletionDays
map to dynamically create lifecycle rules that permanently delete archived data from S3 on a per-table basis after a specified number of days.