Skip to content

Commit 24041b2

Browse files
louisliu2048louis.liuzjg555543
authored
feat: implement data pruning mechanism (okx#700)
Co-authored-by: louis.liu <louis.liu@okg.com> Co-authored-by: Barry <zjg555543@163.com>
1 parent ecb60fb commit 24041b2

33 files changed

+7940
-28
lines changed

.github/workflows/ci_zkevm.yml

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -243,6 +243,30 @@ jobs:
243243
run: sudo -E make test-e2e
244244
working-directory: test
245245

246+
test-e2e-prune:
247+
strategy:
248+
fail-fast: false
249+
runs-on: ubuntu-latest
250+
steps:
251+
- name: Install Foundry
252+
uses: foundry-rs/foundry-toolchain@v1
253+
- name: Make Foundry available to sudo
254+
run: |
255+
FOUNDRY_DIR=$(dirname $(which cast))
256+
sudo ln -sf "$FOUNDRY_DIR/cast" /usr/local/bin/cast
257+
sudo ln -sf "$FOUNDRY_DIR/forge" /usr/local/bin/forge
258+
sudo ln -sf "$FOUNDRY_DIR/anvil" /usr/local/bin/anvil
259+
sudo cast --version
260+
- name: Checkout code
261+
uses: actions/checkout@v3
262+
263+
- name: Build Docker
264+
run: make build-docker
265+
266+
- name: Test
267+
run: sudo -E make test-e2e-prune
268+
working-directory: test
269+
246270
# For X Layer, split db and ac
247271
test-data-loss:
248272
strategy:

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,3 +132,7 @@ bridge-config-artifact
132132
kurtosis-cli.tar.gz
133133
zk/debug_tools/rpc-cache/cache.db
134134
zk/debug_tools/rpc-cache/rpc-cache
135+
cmd/prune-mdbx-data/compact-db-tool
136+
cmd/prune-mdbx-data/prune-chaindata-tool
137+
cmd/prune-mdbx-data/list-tables-tool
138+
cmd/prune-mdbx-data/prune-tool

Dockerfile.local

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,19 @@ RUN --mount=type=cache,target=/root/.cache \
4040
--mount=type=cache,target=/go/pkg/mod \
4141
make db-tools
4242

43+
# Add complete project source for building prune-tool
44+
ADD . .
45+
46+
# Build prune-tool and all sub-command tools using Makefile
47+
RUN --mount=type=cache,target=/root/.cache \
48+
--mount=type=cache,target=/tmp/go-build \
49+
--mount=type=cache,target=/go/pkg/mod \
50+
cd cmd/prune-mdbx-data && make build && \
51+
cp prune-tool /app/build/bin/prune-tool && \
52+
cp list-tables-tool /app/build/bin/list-tables-tool && \
53+
cp prune-chaindata-tool /app/build/bin/prune-chaindata-tool && \
54+
cp compact-db-tool /app/build/bin/compact-db-tool
55+
4356
# Install dlv (Delve debugger)
4457
RUN --mount=type=cache,target=/root/.cache \
4558
--mount=type=cache,target=/tmp/go-build \
@@ -65,6 +78,10 @@ WORKDIR /home/erigon
6578
## then give each binary its own layer
6679
COPY --from=tools-builder /app/build/bin/mdbx_copy /usr/local/bin/mdbx_copy
6780
COPY --from=tools-builder /app/build/bin/dlv /usr/local/bin/dlv
81+
COPY --from=tools-builder /app/build/bin/prune-tool /usr/local/bin/prune-tool
82+
COPY --from=tools-builder /app/build/bin/list-tables-tool /usr/local/bin/list-tables-tool
83+
COPY --from=tools-builder /app/build/bin/prune-chaindata-tool /usr/local/bin/prune-chaindata-tool
84+
COPY --from=tools-builder /app/build/bin/compact-db-tool /usr/local/bin/compact-db-tool
6885
COPY --from=builder /app/build/bin/cdk-erigon /usr/local/bin/cdk-erigon
6986
COPY --from=builder /app/build/bin/smt-db-split /usr/local/bin/smt-db-split
7087
COPY --from=builder /app/cmd/smt-db-split/prepare-db-split.sh /usr/local/bin/prepare-db-split.sh
Lines changed: 239 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,239 @@
1+
# Active Tables Analysis for X Layer zkEVM
2+
3+
This document analyzes all tables with non-zero size in the X Layer zkEVM database and explains how each is handled during the pruning process.
4+
5+
## Database Overview
6+
7+
Based on real mainnet data analysis:
8+
- **Chaindata Database**: 70.0 GB table data, 75.7 GB actual size (5.7 GB difference)
9+
- **SMT Database**: 57.0 GB table data, 105.5 GB actual size (48.5 GB difference)
10+
11+
## Chaindata Database - Active Tables Analysis
12+
13+
**Column Legend:**
14+
- **Moderate**: ✅ = Fully deleted, 🔄 = Batch-based partial pruning, 🛡️ = Protected
15+
- **Aggressive**: ✅ = Fully deleted, 🔄 = Batch-based partial pruning, ✅* = Historical data only, 🛡️ = Protected
16+
17+
**🔄 Important Note**: Batch-based pruning (🔄) means the table is NOT fully deleted, but rather **partially cleaned** by removing old batch data while preserving recent batches.
18+
19+
**💡 Safe Alternative**: For users seeking zero-risk cleanup, use `compact-db -in-place` which provides 30-35% space savings without any data deletion.
20+
21+
### 🔥 Large Tables (>1GB) - Primary Targets
22+
23+
| Table Name | Size | Description | Moderate | Aggressive |
24+
|------------|------|-------------|----------|-----------|
25+
| **Header** | 17.1 GB | Block headers with full metadata | 🛡️ | 🔄 |
26+
| **StorageChangeSet** | 10.0 GB | Historical storage state changes | 🛡️ |* |
27+
| **StorageHistory** | 6.6 GB | Storage change history index | 🛡️ | 🛡️ |
28+
| **BlockTransaction** | 5.2 GB | Complete transaction RLP data |||
29+
| **TransactionLog** | 4.8 GB | Transaction execution logs | 🔄 | 🔄 |
30+
| **PlainState** | 4.4 GB | Current account/storage state | 🛡️ | 🛡️ |
31+
| **HashedStorage** | 4.0 GB | Hashed storage keys/values | 🛡️ | 🛡️ |
32+
| **AccountChangeSet** | 2.5 GB | Historical account state changes | 🛡️ |* |
33+
| **HeaderNumber** | 2.1 GB | Block number to header hash mapping | 🛡️ | 🔄 |
34+
| **hermez_intermediate_tx_stateRoots** | 1.7 GB | zkEVM intermediate state roots | 🔄 | 🔄 |
35+
| **BlockBody** | 1.7 GB | Block body data (uncle hashes, etc) | 🛡️ | 🔄 |
36+
| **TxSender** | 1.7 GB | Transaction sender addresses | 🔄 | 🔄 |
37+
| **block_info_roots** | 1.5 GB | zkEVM block info roots | 🔄 | 🔄 |
38+
| **CanonicalHeader** | 1.5 GB | Canonical chain headers | 🛡️ | 🔄 |
39+
40+
### 🟡 Medium Tables (100MB-1GB) - Secondary Targets
41+
42+
| Table Name | Size | Description | Moderate | Aggressive |
43+
|------------|------|-------------|----------|------------|
44+
| **BlockTransactionLookup** | 896.0 MB | Transaction hash to block mapping |||
45+
| **hermez_txPricePercentage** | 854.5 MB | Transaction price percentages |||
46+
| **hermez_blockBatches** | 779.6 MB | Block to batch mappings | 🛡️ | 🛡️ |
47+
| **Receipt** | 711.0 MB | Transaction receipts | 🔄 | 🔄 |
48+
| **LogTopicIndex** | 600.6 MB | Event log topic index |||
49+
| **batch_blocks** | 303.6 MB | Batch to block relationships | 🛡️ | 🛡️ |
50+
| **hermez_stateRoots** | 252.8 MB | zkEVM state roots | 🛡️ | 🛡️ |
51+
| **AccountHistory** | 155.8 MB | Account change history |||
52+
| **CallFromIndex** | 142.8 MB | Contract call source index |||
53+
| **CallToIndex** | 130.4 MB | Contract call destination index |||
54+
55+
### 🟢 Small Tables (1MB-100MB) - Utility Data
56+
57+
| Table Name | Size | Description | Moderate | Aggressive |
58+
|------------|------|-------------|----------|------------|
59+
| **Code** | 75.2 MB | Smart contract bytecode | 🛡️ | 🛡️ |
60+
| **CallTraceSet** | 60.6 MB | Contract call traces |||
61+
| **HashedAccount** | 48.2 MB | Hashed account addresses | 🛡️ | 🛡️ |
62+
| **LogAddressIndex** | 32.6 MB | Event log address index |||
63+
| **l1_info_tree_updates_by_ger** | 18.5 MB | L1 info tree updates by GER | 🛡️ | 🛡️ |
64+
| **HashedCodeHash** | 11.3 MB | Hashed contract code hashes | 🛡️ | 🛡️ |
65+
| **l1_info_tree_updates** | 11.0 MB | L1 info tree updates | 🛡️ | 🛡️ |
66+
| **PlainCodeHash** | 9.3 MB | Plain contract code hashes | 🛡️ | 🛡️ |
67+
| **hermez_forkIds** | 6.4 MB | zkEVM fork identifiers | 🛡️ | 🛡️ |
68+
| **l1_info_roots** | 4.8 MB | L1 information roots | 🛡️ | 🛡️ |
69+
| **l1_info_leaves** | 3.2 MB | L1 information leaves | 🛡️ | 🛡️ |
70+
| **batch_ends** | 3.2 MB | Batch ending markers | 🛡️ | 🛡️ |
71+
| **hermez_globalExitRootsSaved** | 3.0 MB | Saved global exit roots | 🛡️ | 🛡️ |
72+
| **InnerTx** | 2.9 MB | Inner transaction data | 🛡️ | 🛡️ |
73+
| **HermezSmtLastRoot** | 2.9 MB | Last SMT root hashes | 🛡️ | 🛡️ |
74+
| **hermez_l1Sequences** | 2.6 MB | L1 sequence data | 🛡️ | 🛡️ |
75+
| **block_l1_block_hashes** | 2.3 MB | L1 block hash references | 🛡️ | 🛡️ |
76+
| **hermez_globalExitRoots** | 2.3 MB | Global exit roots | 🛡️ | 🛡️ |
77+
| **latest_used_ger** | 2.3 MB | Latest used global exit roots | 🛡️ | 🛡️ |
78+
| **hermez_l1Verifications** | 2.0 MB | L1 verification data | 🛡️ | 🛡️ |
79+
80+
### 🔧 System Tables (<1MB) - Configuration & Metadata
81+
82+
| Table Name | Size | Description | Moderate | Aggressive |
83+
|------------|------|-------------|----------|------------|
84+
| **block_l1_info_tree_index** | 1.2 MB | L1 info tree index | 🛡️ | 🛡️ |
85+
| **Config** | 8.0 KB | Node configuration | 🛡️ | 🛡️ |
86+
| **DbInfo** | 8.0 KB | Database metadata | 🛡️ | 🛡️ |
87+
| **SyncStage** | 8.0 KB | Synchronization stages | 🛡️ | 🛡️ |
88+
| **plain_state_version** | 8.0 KB | State version tracking | 🛡️ | 🛡️ |
89+
| **smt_depths** | 8.0 KB | SMT tree depth info | 🛡️ | 🛡️ |
90+
| **HeadersTotalDifficulty** | 8.0 KB | Chain total difficulty | 🛡️ | 🛡️ |
91+
| **IncarnationMap** | 8.0 KB | Account incarnation mapping | 🛡️ | 🛡️ |
92+
| **LastBlock** | 8.0 KB | Last processed block info | 🛡️ | 🛡️ |
93+
| **LastHeader** | 8.0 KB | Last header info | 🛡️ | 🛡️ |
94+
| **MaxTxNum** | 8.0 KB | Maximum transaction number | 🛡️ | 🛡️ |
95+
| **Sequence** | 8.0 KB | Database sequence numbers | 🛡️ | 🛡️ |
96+
| **Migration** | 8.0 KB | Database migration info | 🛡️ | 🛡️ |
97+
| **Issuance** | 8.0 KB | Token issuance tracking | 🛡️ | 🛡️ |
98+
99+
## SMT Database - Active Tables Analysis
100+
101+
### 🔥 SMT Core Tables (Critical for zkEVM)
102+
103+
| Table Name | Size | Description | Pruning Strategy |
104+
|------------|------|-------------|------------------|
105+
| **HermezSmt** | 44.6 GB | Main SMT tree nodes | 🛡️ **Never Touched** - Core zkEVM proof data |
106+
| **HermezSmtMetadata** | 6.8 GB | SMT node metadata | 🛡️ **Never Touched** - SMT structure info |
107+
| **HermezSmtHashKey** | 5.2 GB | SMT hash to key mapping | 🛡️ **Never Touched** - SMT indexing |
108+
| **HermezSmtAccountValues** | 446.4 MB | Account values in SMT | 🛡️ **Never Touched** - Current state in SMT |
109+
| **HermezSmtStats** | 8.0 KB | SMT statistics | 🛡️ **Never Touched** - SMT performance data |
110+
111+
## Pruning Mode Comparison
112+
113+
### Moderate Mode (~18-20GB savings)
114+
**Strategy**: Conservative cleanup with maximum stability protection
115+
- ✅ Removes: History tables, index tables (9 tables, ~8.5GB)
116+
- 🔄 Batch-based pruning: Receipt, TxSender, TransactionLog, etc. (5 tables, ~10.4GB)
117+
- 🛡️ Protects: Header, CanonicalHeader, BlockBody, hermez_blockBatches (avoid dependency issues)
118+
- ✅ Protects: All SMT data, current state, essential indexes
119+
- **Safe for**: Production sequencer nodes, maximum stability required
120+
121+
### Aggressive Mode (~53-55GB savings)
122+
**Strategy**: Enhanced cleanup including header ecosystem (with stability protection)
123+
- ✅ Removes: All Moderate targets PLUS header ecosystem cleanup
124+
- 🔄 Header ecosystem: Header, CanonicalHeader, HeaderNumber, BlockBody (enhanced batch-based pruning)
125+
- ✅ Historical cleanup: AccountChangeSet + StorageChangeSet history (~12.5GB)
126+
- 🛡️ Preserves: hermez_blockBatches for node stability
127+
- ⚠️ **Trade-off**: Some historical RPC queries may fail
128+
- **Best for**: Space-constrained environments, non-archival nodes
129+
130+
## Critical Protection Rules
131+
132+
### Always Protected Tables
133+
1. **Current State**: `PlainState`, `HashedAccount`, `HashedStorage`
134+
2. **zkEVM Core**: All `hermez_*` configuration and bridge tables
135+
3. **SMT Data**: All `HermezSmt*` tables
136+
4. **Node Operation**: `Config`, `DbInfo`, `SyncStage`, `LastBlock`
137+
5. **Small Tables**: 5 tables with minimal data (user-specified protection)
138+
- `block_l1_info_tree_index`, `plain_state_version`, `smt_depths`
139+
- `HeadersTotalDifficulty`, `MaxTxNum`
140+
- **Rationale**: Data size < 10MB each, cleanup benefit negligible, safer to preserve
141+
142+
### Header Table Consistency Strategy
143+
**Important Fix**: Header-related tables use **mode-specific protection strategies**
144+
145+
**Moderate Mode Strategy** (Conservative Approach):
146+
- 🛡️ `Header` - fully protected (avoid dependency issues)
147+
- 🛡️ `CanonicalHeader` - fully protected (avoid dependency issues)
148+
- 🛡️ `HeaderNumber` - fully protected (maintain compatibility)
149+
- 🛡️ `BlockBody` - fully protected (avoid dependency issues)
150+
151+
**Aggressive Mode Strategy** (Header Ecosystem Cleanup):
152+
- 🔄 `Header` - enhanced batch-based pruning
153+
- 🔄 `CanonicalHeader` - enhanced batch-based pruning
154+
- 🔄 `HeaderNumber` - batch-based pruning (only in Aggressive mode)
155+
- 🔄 `BlockBody` - enhanced batch-based pruning
156+
157+
**Rationale**: Moderate mode protects header tables to avoid MDBX_EKEYMISMATCH errors with AccountChangeSet/StorageChangeSet dependencies.
158+
159+
## Pruning Mode Summary Statistics
160+
161+
### Moderate Mode (Default Recommended)
162+
- **Tables Deleted**:
163+
-**Direct Deletion** (9 tables, ~8.5 GB): BlockTransaction, BlockTransactionLookup, hermez_txPricePercentage, LogTopicIndex, AccountHistory, CallFromIndex, CallToIndex, CallTraceSet, LogAddressIndex
164+
- 🔄 **Batch-Based Pruning** (5 tables, ~10.4 GB): Receipt, TxSender, TransactionLog, hermez_intermediate_tx_stateRoots, block_info_roots
165+
- 🛡️ **Large Tables Protected** (4 tables, ~22.8 GB): Header, CanonicalHeader, hermez_blockBatches, BlockBody
166+
- 🛡️ **Small Tables Protected** (5 tables, <10MB): block_l1_info_tree_index, plain_state_version, smt_depths, HeadersTotalDifficulty, MaxTxNum
167+
- **Space Saved**: ~18-20 GB
168+
- **Strategy**: Conservative cleanup, maximum stability for production sequencer nodes
169+
170+
### Aggressive Mode
171+
- **Tables Deleted**: All Moderate mode deletions PLUS:
172+
- 🔄 **Header Ecosystem Cleanup** (4 additional tables, ~22.8 GB): Header, CanonicalHeader, HeaderNumber, BlockBody (enhanced batch-based pruning)
173+
-* **Historical State Cleanup** (2 tables, ~12.5 GB): AccountChangeSet, StorageChangeSet (historical data beyond recent batches)
174+
- 🛡️ **Preserved for Stability** (1 table, ~779.6 MB): hermez_blockBatches (critical for node operation)
175+
- **Space Saved**: ~53-55 GB
176+
- **Strategy**: Enhanced cleanup including header ecosystem, some historical queries may fail
177+
178+
### Conditionally Pruned Tables
179+
1. **Large Block Data**: Pruned by batch, keeping recent data (moderate+)
180+
2. **Historical Changes**: Removed in aggressive mode only
181+
3. **Indexes**: Removed in all pruning modes
182+
4. **Logs**: Pruned by batch in all modes
183+
184+
## Compaction Potential Analysis
185+
186+
### Chaindata Database
187+
- **Current Difference**: 5.7 GB (7.5%)
188+
- **Compaction Potential**: ~4-5 GB recovery
189+
- **Recommended**: After pruning for maximum efficiency
190+
191+
### SMT Database
192+
- **Current Difference**: 48.5 GB (46.0%!)
193+
- **Analysis**: Extremely high fragmentation, likely from:
194+
- SMT tree node updates and reorganization
195+
- Batch processing creating many temporary entries
196+
- Historical SMT operations leaving large freelist
197+
- **Compaction Potential**: ~30-45 GB recovery
198+
- **Highly Recommended**: SMT database is prime candidate for compaction
199+
200+
## Optimal Operation Sequence
201+
202+
```bash
203+
# 1. Analyze current state
204+
./prune-tool list-tables /path/to/datadir
205+
206+
# 2. Prune unnecessary data first
207+
./prune-tool prune-chaindata /path/to/datadir moderate --keep-recent-batches=10
208+
209+
# 3. Compact both databases for maximum space recovery (in-place mode)
210+
./prune-tool compact-db -source /path/to/datadir/chaindata -in-place
211+
./prune-tool compact-db -source /path/to/datadir/smt -in-place
212+
213+
# Expected total savings: ~108-110 GB (53-55GB from pruning + 55GB from compaction)
214+
```
215+
216+
## Key Insights
217+
218+
1. **SMT Database** has massive compaction potential (46% fragmentation)
219+
2. **Moderate mode** is very conservative - only ~18-20GB savings to ensure maximum stability
220+
3. **Header tables** (17.1GB) are protected in Moderate mode but prunable in Aggressive mode
221+
4. **Historical ChangeSets** (12.5GB total) can be safely removed in aggressive mode
222+
5. **Combined approach** (aggressive prune + compact) can save 108-110GB from original 182GB (~59-60%)
223+
6. **zkEVM tables** require special protection but many are small
224+
225+
## Quick Reference Table
226+
227+
| Mode | Direct Deletions | Batch-Based Pruning (🔄) | Historical Cleanup | Total Space Saved |
228+
|------|------------------|--------------------------|-------------------|-------------------|
229+
| **Moderate (Recommended)** | 9 tables (~8.5 GB) | 5 tables (~10.4 GB) | None | ~18-20 GB |
230+
| **Aggressive** | 9 tables (~8.5 GB) | 9 tables (~33.2 GB) | 2 tables* (~12.5 GB) | ~53-55 GB |
231+
232+
**Notes:**
233+
- *Historical cleanup = only removes historical data beyond recent batches
234+
- Batch-based pruning preserves recent N batches (default: 10)
235+
- All modes preserve SMT data and critical zkEVM operational tables
236+
- **⚠️ Important**: Table deletion does NOT immediately reduce file size - requires `compact-db` to reclaim space
237+
- **Real-world impact**: Moderate alone = ~29-31% savings, Aggressive alone = ~35-38% savings, Aggressive + Compaction = ~66-68% savings (from 182GB total)
238+
239+
This analysis enables targeted, safe database optimization while preserving zkEVM functionality.

cmd/prune-mdbx-data/Makefile

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
.PHONY: build clean build-tools build-main list-tables-tool prune-chaindata-tool compact-db-tool
2+
3+
build: build-tools build-main
4+
5+
build-main:
6+
go build -o prune-tool main.go
7+
8+
build-tools: list-tables-tool prune-chaindata-tool compact-db-tool
9+
10+
list-tables-tool:
11+
cd cmd/list-tables && go build -o ../../list-tables-tool .
12+
13+
prune-chaindata-tool:
14+
cd cmd/prune-chaindata && go build -o ../../prune-chaindata-tool .
15+
16+
compact-db-tool:
17+
cd cmd/compact-db && go build -o ../../compact-db-tool .
18+
19+
clean:
20+
rm -f prune-tool list-tables-tool prune-chaindata-tool compact-db-tool

0 commit comments

Comments
 (0)