Skip to content

Commit 29db332

Browse files
committed
rfc: initial proposal for refactoring the space-diff table
1 parent ec8f6a1 commit 29db332

File tree

1 file changed

+172
-0
lines changed

1 file changed

+172
-0
lines changed

rfc/refactor-space-diff-table.md

Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
# RFC: Space Diff Refactoring
2+
3+
## Authors
4+
5+
- [Natalie Bravo](https://github.com/bravonatalie), [Storacha Network](https://storacha.network/)
6+
7+
## Language
8+
9+
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119](https://datatracker.ietf.org/doc/html/rfc2119).
10+
11+
## Introduction
12+
13+
The current `space-diff` table has accumulated some structural and operational issues that impact billing, usage calculation, and system reliability. This RFC proposes structural changes to make usage calculation efficient, prevent duplicate diffs, and simplify long-term maintenance.
14+
15+
### Problem Statement
16+
17+
### 1. Duplicate space diffs
18+
19+
Past bugs caused multiple diffs to be written for the same cause (e.g. failed uploads). This resulted in duplicated diffs that inflate usage, slow down queries and create “ghost” usage for spaces that should be empty after deletion.
20+
21+
This behavior should be **structurally impossible** going forward.
22+
23+
### 2. Usage calculation timeouts
24+
25+
A single space can generate a very large number of diff entries within the current month. When this happens, usage record calculation often times out because the system needs to aggregate too many records.
26+
27+
**Current mitigation (temporary):**
28+
29+
* A *space diff compaction* script that:
30+
* Aggregates many diffs into a single “summary” diff.
31+
* Archives the original diffs into a separate table.
32+
33+
This is an ad-hoc workaround and not a long-term solution.
34+
35+
## Current `space-diff` usage model
36+
37+
The `space-diff` table is the single **source of truth** for billing. It is written to by different sources depending on the protocol.
38+
39+
### Source A: Modern Blob Protocol
40+
41+
```
42+
blob/accept OR blob/remove → blob-registry.register()
43+
1. allocation table entry (legacy compatibility)
44+
→ TransactWrite {
45+
2. blob-registry table entry (primary storage)
46+
3. space-diff table entry (billing)
47+
}
48+
49+
```
50+
51+
- **Location**: `upload-api/stores/blob-registry.js`
52+
53+
### Source B: Legacy Store Protocol
54+
55+
Deprecated, but still operational for existing clients.
56+
57+
```
58+
store/add OR store/remove receipt → UCAN stream → ucan-stream-handler → space-diff table
59+
60+
```
61+
62+
- **Location**: `billing/functions/ucan-stream.js`
63+
64+
### How usage is calculated today
65+
66+
This flow is used during billing runs for each space:
67+
68+
**Initial state**
69+
70+
* Load the space snapshot from `space-snapshot` for the `from` date
71+
* If no snapshot exists, assume the space was empty (`size = 0`)
72+
73+
**Usage calculation**
74+
75+
* Base usage = `initialSize × periodDurationMs`
76+
* Fetch all space diffs for the billing period
77+
* Iterate diffs in chronological order:
78+
* `size += diff.delta`
79+
* `usage += size × timeSinceLastChange`
80+
* where `timeSinceLastChange = diff.receiptAt - lastReceiptAt`
81+
82+
**Storage**
83+
84+
* Store final space size in `space-snapshot` with `recordedAt = to`
85+
* Store total usage in `usage` (byte-milliseconds)
86+
87+
## Proposal
88+
89+
### Fix for problem 1: Duplicate diffs
90+
91+
To guarantee uniqueness and prevent future duplication:
92+
93+
* Use **`cause` as the sort key (SK)** of the `space-diff` table
94+
* This makes it impossible to insert two diffs for the same `(space, cause)` pair
95+
96+
#### Open design concern
97+
98+
Using `cause` as the SK removes natural chronological ordering.
99+
100+
**Proposed solution**
101+
102+
* Add a **GSI with a timestamp-based sort key**
103+
104+
This enables:
105+
106+
* Efficient chronological queries
107+
* Time-based pagination
108+
* Retention policies (e.g. deleting data older than 1 year)
109+
110+
The additional cost is acceptable, especially since older diffs can be safely deleted after the retention window.
111+
112+
#### Migration plan (high level)
113+
114+
1. Create a **new `space-diff` table** with:
115+
* Correct PK design
116+
* `cause` as SK
117+
* GSI for timestamp-based queries
118+
2. Export data from the existing table
119+
3. Deduplicate and transform records
120+
4. Import data into the new table
121+
5. Update application code to use the new schema
122+
6. Decommission the old table after validation is complete
123+
124+
### Fix for problem 2: Usage calculation timeouts
125+
126+
Introduce a new table (e.g. `space-usage-month`) keyed by `provider#space#YYYY-MM` that is updated atomically on each diff write, making billing reads **O(1)**.
127+
128+
#### Core idea
129+
130+
Maintain a running usage accumulator instead of scanning historical diffs.
131+
132+
**Algorithm**
133+
134+
1. Track `lastSize` and `lastChangeAt` per `(provider, space, month)`
135+
2. On each incoming diff:
136+
* `usage += lastSize × (receiptAt - lastChangeAt)`
137+
* `lastSize += delta`
138+
* `lastChangeAt = receiptAt`
139+
3. At end-of-month billing:
140+
* `usage += lastSize × (periodEnd - lastChangeAt)`
141+
* Finalize and snapshot
142+
143+
**Additional fields**
144+
145+
* `sizeStart`
146+
* `sizeEnd`
147+
* `lastReceiptAt`
148+
* `subscription`
149+
150+
**Behavior**
151+
152+
* `space-diff` remains for audit and idempotency
153+
* Billing reads exclusively from `space-usage-month`
154+
* `calculatePeriodUsage`:
155+
* First tries the aggregator
156+
* Falls back to a GSI scan if missing
157+
* Aggregator becomes the canonical source for the billing month
158+
159+
**Retention**
160+
161+
* Keep `space-diff` entries for N months using TTL
162+
* Archive older diffs to S3 (TBD)
163+
164+
**Considerations**
165+
166+
- The accumulator MUST process diffs for a space in **ascending** `receiptAt` order. If the write path can deliver out-of-order events and strict ordering cannot be guaranteed, this solution SHOULD be revisited. Pragmatic mitigations include:
167+
- Buffer within a small window and sort incoming diffs.
168+
- Recompute a localized suffix by reading recent diffs via the time GSI and re-applying from the last stable checkpoint.
169+
170+
- Alternative when strict ordering is infeasible:
171+
- Use time-bucketed diffs (hour/day): persist per-bucket, order-independent aggregates (e.g., Σdelta and Σ(delta × (bucketEnd − receiptAt))). At billing time, iterate buckets in chronological order to compute exact monthly usage, where no event sorting required.
172+
- Maintain a size-only monthly state (track `lastSize` and `lastChangeAt`) to accelerate space usage report. Note: this does NOT remove the need to iterate diffs for the billing run.

0 commit comments

Comments
 (0)