Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
172 changes: 172 additions & 0 deletions rfc/refactor-space-diff-table.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
# RFC: Space Diff Refactoring

## Authors

- [Natalie Bravo](https://github.com/bravonatalie), [Storacha Network](https://storacha.network/)

## Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119](https://datatracker.ietf.org/doc/html/rfc2119).

## Introduction

The current `space-diff` table has accumulated some structural and operational issues that impact billing, usage calculation, and system reliability. This RFC proposes structural changes to make usage calculation efficient, prevent duplicate diffs, and simplify long-term maintenance.

### Problem Statement

### 1. Duplicate space diffs

Past bugs caused multiple diffs to be written for the same cause (e.g. failed uploads). This resulted in duplicated diffs that inflate usage, slow down queries and create “ghost” usage for spaces that should be empty after deletion.

This behavior should be **structurally impossible** going forward.

### 2. Usage calculation timeouts

A single space can generate a very large number of diff entries within the current month. When this happens, usage record calculation often times out because the system needs to aggregate too many records.

**Current mitigation (temporary):**

* A *space diff compaction* script that:
* Aggregates many diffs into a single “summary” diff.
* Archives the original diffs into a separate table.

This is an ad-hoc workaround and not a long-term solution.

## Current `space-diff` usage model

The `space-diff` table is the single **source of truth** for billing. It is written to by different sources depending on the protocol.

### Source A: Modern Blob Protocol

```
blob/accept OR blob/remove → blob-registry.register()
1. allocation table entry (legacy compatibility)
→ TransactWrite {
2. blob-registry table entry (primary storage)
3. space-diff table entry (billing)
}

```

- **Location**: `upload-api/stores/blob-registry.js`

### Source B: Legacy Store Protocol

Deprecated, but still operational for existing clients.

```
store/add OR store/remove receipt → UCAN stream → ucan-stream-handler → space-diff table

```

- **Location**: `billing/functions/ucan-stream.js`

### How usage is calculated today

This flow is used during billing runs for each space:

**Initial state**

* Load the space snapshot from `space-snapshot` for the `from` date
* If no snapshot exists, assume the space was empty (`size = 0`)

**Usage calculation**

* Base usage = `initialSize × periodDurationMs`
* Fetch all space diffs for the billing period
* Iterate diffs in chronological order:
* `size += diff.delta`
* `usage += size × timeSinceLastChange`
* where `timeSinceLastChange = diff.receiptAt - lastReceiptAt`

**Storage**

* Store final space size in `space-snapshot` with `recordedAt = to`
* Store total usage in `usage` (byte-milliseconds)

## Proposal

### Fix for problem 1: Duplicate diffs

To guarantee uniqueness and prevent future duplication:

* Use **`cause` as the sort key (SK)** of the `space-diff` table
* This makes it impossible to insert two diffs for the same `(space, cause)` pair

#### Open design concern

Using `cause` as the SK removes natural chronological ordering.

**Proposed solution**

* Add a **GSI with a timestamp-based sort key**
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉


This enables:

* Efficient chronological queries
* Time-based pagination
* Retention policies (e.g. deleting data older than 1 year)

The additional cost is acceptable, especially since older diffs can be safely deleted after the retention window.

#### Migration plan (high level)

1. Create a **new `space-diff` table** with:
* Correct PK design
* `cause` as SK
* GSI for timestamp-based queries
2. Export data from the existing table
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might even be able to just skip importing the old data - we can spin this system up alongside it, use it in parallel until february, and then cut over to the new table for usage reporting and billing and leave the old diffs in the old table

3. Deduplicate and transform records
4. Import data into the new table
5. Update application code to use the new schema
6. Decommission the old table after validation is complete

### Fix for problem 2: Usage calculation timeouts

Introduce a new table (e.g. `space-usage-month`) keyed by `provider#space#YYYY-MM` that is updated atomically on each diff write, making billing reads **O(1)**.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically the space metrics table - could you use that instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yes, we could combine the space-metrics and snapshots with more frequent runs. However, we would still need to query the full list of customers and spaces every time to iterate though the diffs. This would reduce the time spent calculating usage and billing, but the proposed new table would consolidate currently dispersed data into a single place (snapshots, usage, and size) while also providing near real-time visibility into each space. This feels like a better approach for scalability to me, but I'd like to know if you have a different thought since you have a better view of the system. @travis might also have good input here.

@alanshaw I have a question about space-metrics: is it still actively used? I only see writes to it and no reads, does it still make sense to keep it around?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you speak to "updated atomically on each diff write" - how is this applied? Is it lambda triggered on an insert event to the space diff table?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we transition from one month to the next, retaining the current total, so that concurrent updates succeed and are all applied as expected?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the proposed new table would consolidate currently dispersed data into a single place (snapshots, usage, and size)

Yes, but at the expense of another dynamo table (storage, writes) and lambda invocation per space diff insert. It's just more infra to run, maintain and (potentially) go wrong, whereas upping the cron frequency for calculating snapshots basically gets you everything you need...unless I'm missing something?

while also providing near real-time visibility into each space

We already have this though...right? Latest snapshot + diffs since is the real time.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest snapshot + diffs since is the real time.

yea I think the issue here is that not all spaces can calculate this in a reasonable amount of time - if there are 200k diffs to add to a space snapshot it generally just doesn't work


#### Core idea

Maintain a running usage accumulator instead of scanning historical diffs.

**Algorithm**

1. Track `lastSize` and `lastChangeAt` per `(provider, space, month)`
2. On each incoming diff:
* `usage += lastSize × (receiptAt - lastChangeAt)`
* `lastSize += delta`
* `lastChangeAt = receiptAt`
3. At end-of-month billing:
* `usage += lastSize × (periodEnd - lastChangeAt)`
* Finalize and snapshot

**Additional fields**

* `sizeStart`
* `sizeEnd`
* `lastReceiptAt`
* `subscription`

**Behavior**

* `space-diff` remains for audit and idempotency
* Billing reads exclusively from `space-usage-month`
* `calculatePeriodUsage`:
* First tries the aggregator
* Falls back to a GSI scan if missing
* Aggregator becomes the canonical source for the billing month

**Retention**

* Keep `space-diff` entries for N months using TTL
* Archive older diffs to S3 (TBD)

**Considerations**
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my only major concern here is the potential race condition if two diffs for the same space read the current total at the same time - we could mitigate this by having a single queue reader processing the diffs - it should go pretty fast and uploads aren't THAT high frequency so having a single queue reader process diffs and update the accumulator should be plenty - this would solve race conditions pretty conclusively I think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my only major concern here is the potential race condition if two diffs for the same space read the current total at the same time

For incrementing? You'd use an update command with ADD which would consistently increment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure how this works for the transition from one month to the next though...


- The accumulator MUST process diffs for a space in **ascending** `receiptAt` order. If the write path can deliver out-of-order events and strict ordering cannot be guaranteed, this solution SHOULD be revisited. Pragmatic mitigations include:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this strictly necessary? I think we could just choose the later of "current accumulator date" or "new diff receiptAt" - this will ensure we always have the "latest" date for a particular accumulator, even in the case where we process diffs out of order

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed this again later and don’t think it’s a problem, especially since receiptAt is generated server-side when we add the diff entry to the table.

- Buffer within a small window and sort incoming diffs.
- Recompute a localized suffix by reading recent diffs via the time GSI and re-applying from the last stable checkpoint.

- Alternative when strict ordering is infeasible:
- Use time-bucketed diffs (hour/day): persist per-bucket, order-independent aggregates (e.g., Σdelta and Σ(delta × (bucketEnd − receiptAt))). At billing time, iterate buckets in chronological order to compute exact monthly usage, where no event sorting required.
- Maintain a size-only monthly state (track `lastSize` and `lastChangeAt`) to accelerate space usage report. Note: this does NOT remove the need to iterate diffs for the billing run.