Delete stale values from the shared database LRU caches in persist #3550

bhartnett · 2025-08-07T02:38:55Z

This change appears to improve performance by a few percent.

I ran block imports over the first 3 million blocks and here are the results:

baseline.csv vs delete-all.csv
                       bps_x     bps_y      tps_x      tps_y  time_x time_y    bpsd    tpsd   timed
block_number                                                                                       
(499713, 777522]    8,865.48  9,614.48  14,682.15  15,934.45     43s    38s  19.58%  19.58%  -6.28%
(777522, 1055332]   5,903.32  6,185.96  16,865.01  17,907.18     54s    48s  16.38%  16.38%  -1.28%
(1055332, 1333142]  5,053.96  5,622.88  25,792.41  28,775.77     57s    51s  13.32%  13.32%  -9.20%
(1333142, 1610952]  4,413.58  4,895.69  28,765.84  32,123.93   1m16s   1m1s  29.46%  29.46%  -7.68%
(1610952, 1888761]  3,657.44  4,053.78  25,510.86  28,090.79   1m56s  1m43s  22.56%  22.56%  -7.77%
(1888761, 2166571]  4,393.58  4,397.60  32,989.25  33,007.81    1m8s  1m14s   1.30%   1.30%   9.25%
(2166571, 2444381]  2,009.52  2,368.80  15,735.87  18,544.95  13m40s  13m6s  14.76%  14.76%  -8.79%
(2444381, 2722191]  3,229.39  3,008.85  22,540.26  20,985.28   8m19s  8m28s   0.99%   0.99%   8.00%
(2722191, 3000001]  4,370.49  4,319.19  30,905.05  30,375.02   1m11s  1m11s   9.10%   9.10%   8.76%

blocks: 2492096, baseline: 30m7s, contender: 29m3s
Time (total): -1m3s, -3.54%

bpsd = blocks per sec diff (+), tpsd = txs per sec diff, timed = time to process diff (-)
+ = more is better, - = less is better

arnetheduck · 2025-08-11T10:17:02Z

I ran block imports over the first 3 million blocks

In general, the first 3 million blocks tend to be not representative for lru sizing since the database is still small in general and everything just fits - ie at block height 3m, the whole database is just 1.3gb total. Also, in the sample we can see another effect: the change is efficient in some block ranges but not all - in particular, it may be inefficient past the gas repricing following the shanghai dos attack, as can be seen in the last timing bracket .. for this kind of tests, it's usually better to run them at some more recent block range.

running on a recent block range comes with its own difficulties - in particular, block time is no longer a good measure because it's too easily influenced by OS caching of the database - instead, it's better to look at "number of database lookups" - ie the hit / miss rate of the cache, more or less, and see which one is better - the test has to be long enough to get past the "warm-up" period.

bhartnett · 2025-08-11T12:33:58Z

In general, the first 3 million blocks tend to be not representative for lru sizing since the database is still small in general and everything just fits - ie at block height 3m, the whole database is just 1.3gb total. Also, in the sample we can see another effect: the change is efficient in some block ranges but not all - in particular, it may be inefficient past the gas repricing following the shanghai dos attack, as can be seen in the last timing bracket .. for this kind of tests, it's usually better to run them at some more recent block range.

running on a recent block range comes with its own difficulties - in particular, block time is no longer a good measure because it's too easily influenced by OS caching of the database - instead, it's better to look at "number of database lookups" - ie the hit / miss rate of the cache, more or less, and see which one is better - the test has to be long enough to get past the "warm-up" period.

In that case, I'll run a larger test over a more recent block range and use metrics to count the cache hit / miss rates.

arnetheduck · 2025-08-11T13:13:38Z

use metrics

see also --debug-rdb-print-stats

bhartnett · 2025-08-25T11:57:04Z

I ran another longer test over approx 4 million blocks starting from just after the merge and here are the results:

python scripts/block-import-stats.py ~/Downloads/master.csv ~/Downloads/delete-from-caches.csv
master.csv vs delete-from-caches.csv
                      bps_x  bps_y     tps_x     tps_y    time_x    time_y    bpsd    tpsd    timed
block_number                                                                                       
(15537394, 15981838]  31.25  29.09  4,861.74  4,537.47  3h57m32s  4h16m21s  -6.62%  -6.62%    8.29%
(15981838, 16426282]  28.16  26.49  4,009.14  3,781.07  4h23m14s  4h48m15s  -5.44%  -5.44%   10.34%
(16426282, 16870727]  27.95  26.03  4,109.17  3,825.78  4h24m25s  4h46m30s  -6.85%  -6.85%    8.39%
(16870727, 17315171]  26.20  29.22  3,900.46  4,347.84  4h52m33s  4h19m42s  12.54%  12.54%  -10.41%
(17315171, 17759616]  23.29  26.15  3,409.88  3,836.29  5h23m55s  4h44m34s  14.57%  14.57%  -10.54%
(17759616, 18204060]  22.15  21.81  3,180.56  3,133.36  5h37m12s  5h43m37s  -0.98%  -0.98%    2.53%
(18204060, 18648505]  23.24  21.66  3,386.10  3,157.34  5h19m18s  5h46m37s  -6.74%  -6.74%    8.67%
(18648505, 19092949]  20.52  20.59  3,308.84  3,323.32  6h14m35s  6h12m50s   0.59%   0.59%   -0.26%
(19092949, 19537394]  22.51  23.52  3,744.44  3,901.93  5h43m16s  5h18m34s   6.13%   6.08%   -5.70%

blocks: 3991808, baseline: 45h56m5s, contender: 45h57m5s
Time (total): 59s, 0.04%

bpsd = blocks per sec diff (+), tpsd = txs per sec diff, timed = time to process diff (-)
+ = more is better, - = less is better

In summary no improvement in run time overall.

I also collected the cache hits/misses stats of the delete-from-caches run but lost this for the master run because my computer restarted before I could save it:

NTC 2025-08-23 08:48:15.877+08:00 Import complete                            blockNumber=19537394 slot=8738654 blocks=4000000 txs=602958139 mgas=60575476
vtxLru(4971026)
   state    vtype       miss        hit      total hitrate
 Account    Empty    2992425          0    2992425   0.00%
 Account     Leaf  453503411  391126517  844629928  46.31%
 Account   Branch          0          0          0   -nan%
 Account ExtBranch    9718066   14976638   24694704  60.65%
   World    Empty   36094657          0   36094657   0.00%
   World     Leaf  148597607 1472415091 1621012698  90.83%
   World   Branch          0          0          0   -nan%
   World ExtBranch    3450311   10594154   14044465  75.43%
     all      all  654356477 1889112400 2543468877  74.27%
keyLru(0) 
   state       miss        hit      total hitrate
 Account          0          0          0  -nan%
   World          0          0          0  -nan%
     all          0          0          0  -nan%
branchLru(67108864) 
   state       miss        hit      total hitrate
 Account  170067193 2599398052 2769465245 93.86%
   World   71736101  857820940  929557041 92.28%
     all  241803294 3457218992 3699022286 93.46%

arnetheduck · 2025-08-27T15:43:41Z

This result makes me think that we should have a "refresh" operation in minilru that updates an existing value without updating its position in the lru - this is a reasonable middle ground actually where reads still determine the eviction order but we don't lose the item due to a premature and potentially unnecessary delete.

arnetheduck · 2025-08-27T19:36:10Z

https://github.com/status-im/nim-minilru/pull/5/files

bhartnett · 2025-08-28T02:26:27Z

https://github.com/status-im/nim-minilru/pull/5/files

Ok, I'll try that. Will run another test using this refresh operation instead.

bhartnett · 2025-08-30T11:16:11Z

I've completed another test run using the LRUCache refresh operation and here are the results:

python scripts/block-import-stats.py ~/Downloads/master.csv ~/Downloads/refresh.csv 
master.csv vs refresh.csv
                      bps_x  bps_y     tps_x     tps_y    time_x    time_y    bpsd    tpsd   timed
block_number                                                                                      
(15537394, 15981838]  31.25  30.06  4,861.74  4,679.47  3h57m32s   4h7m21s  -3.60%  -3.60%   4.36%
(15981838, 16426282]  28.16  27.27  4,009.14  3,876.72  4h23m14s   4h33m5s  -3.16%  -3.16%   3.76%
(16426282, 16870727]  27.95  27.26  4,109.17  4,009.70  4h24m25s  4h34m10s  -2.39%  -2.39%   3.77%
(16870727, 17315171]  26.20  29.02  3,900.46  4,319.19  4h52m33s  4h21m24s  11.72%  11.72%  -9.84%
(17315171, 17759616]  23.29  25.51  3,409.88  3,742.77  5h23m55s  4h52m30s  11.75%  11.75%  -8.06%
(17759616, 18204060]  22.15  22.10  3,180.56  3,173.78  5h37m12s  5h38m23s   0.38%   0.38%   1.02%
(18204060, 18648505]  23.24  21.47  3,386.10  3,121.70  5h19m18s  5h46m58s  -7.49%  -7.49%   8.84%
(18648505, 19092949]  20.52  20.90  3,308.84  3,371.25  6h14m35s   6h7m30s   2.18%   2.18%  -1.58%
(19092949, 19537394]  22.51  23.94  3,744.44  3,974.13  5h43m16s   5h13m1s   7.89%   7.84%  -7.41%

blocks: 3991808, baseline: 45h56m5s, contender: 45h14m25s
Time (total): -41m40s, -1.51%

bpsd = blocks per sec diff (+), tpsd = txs per sec diff, timed = time to process diff (-)
+ = more is better, - = less is better

Cache hit stats for the run using refresh:

vtxLru(4971026)
   state    vtype       miss        hit      total hitrate
 Account    Empty    2992425          0    2992425   0.00%
 Account     Leaf  450651331  386659911  837311242  46.18%
 Account   Branch          0          0          0   -nan%
 Account ExtBranch    9696247   14855094   24551341  60.51%
   World    Empty   29894033          0   29894033   0.00%
   World     Leaf  146729708 1408961189 1555690897  90.57%
   World   Branch          0          0          0   -nan%
   World ExtBranch    3486146    8982541   12468687  72.04%
     all      all  643449890 1819458735 2462908625  73.87%
keyLru(0) 
   state       miss        hit      total hitrate
 Account          0          0          0  -nan%
   World          0          0          0  -nan%
     all          0          0          0  -nan%
branchLru(67108864) 
   state       miss        hit      total hitrate
 Account  169910194 2586950957 2756861151 93.84%
   World   71963139  828348511  900311650 92.01%
     all  241873333 3415299468 3657172801 93.39%

arnetheduck · 2025-08-30T11:33:22Z

(18204060, 18648505] 23.24 21.47 3,386.10 3,121.70 5h19m18s 5h46m58s -7.49% -7.49% 8.84%

hmm .. what's going on here? this is oddly consistent between the versions also, would it make sense to rerun master to see that it's not a fluke? ie master vs master run should be near-identical if the benchmarking setup is good.

bhartnett · 2025-08-30T13:09:12Z

(18204060, 18648505] 23.24 21.47 3,386.10 3,121.70 5h19m18s 5h46m58s -7.49% -7.49% 8.84%

hmm .. what's going on here? this is oddly consistent between the versions also, would it make sense to rerun master to see that it's not a fluke? ie master vs master run should be near-identical if the benchmarking setup is good.

Not sure about that. Sure, I'll run master again.

bhartnett · 2025-09-01T11:27:29Z

After creating a new baseline from the latest master branch here are the results:

python scripts/block-import-stats.py /mnt/5aa6b1af-8122-4eec-b52f-bdfb121e74de/master2.csv /mnt/5aa6b1af-8122-4eec-b52f-bdfb121e74de/refresh.csv
master2.csv vs refresh.csv
                      bps_x  bps_y     tps_x     tps_y    time_x    time_y    bpsd    tpsd   timed
block_number                                                                                      
(15537394, 15981838]  30.28  30.06  4,725.96  4,679.47   4h6m30s   4h7m21s   0.51%   0.51%   1.56%
(15981838, 16426282]  29.06  27.27  4,139.39  3,876.72  4h14m50s   4h33m5s  -6.17%  -6.17%   7.15%
(16426282, 16870727]  29.54  27.26  4,342.01  4,009.70   4h10m8s  4h34m10s  -7.69%  -7.69%   9.64%
(16870727, 17315171]  28.90  29.02  4,302.98  4,319.19  4h22m28s  4h21m24s   0.41%   0.41%  -0.40%
(17315171, 17759616]  26.08  25.51  3,821.36  3,742.77  4h45m20s  4h52m30s  -2.16%  -2.16%   2.50%
(17759616, 18204060]  22.30  22.10  3,203.27  3,173.78   5h34m8s  5h38m23s  -0.86%  -0.86%   1.35%
(18204060, 18648505]  22.11  21.47  3,223.00  3,121.70   5h37m7s  5h46m58s  -2.37%  -2.37%   3.47%
(18648505, 19092949]  20.07  20.90  3,235.50  3,371.25  6h21m52s   6h7m30s   4.46%   4.46%  -3.44%
(19092949, 19537394]  21.85  23.94  3,625.09  3,974.13  5h43m36s   5h13m1s   9.83%   9.83%  -8.59%

blocks: 3991808, baseline: 44h56m3s, contender: 45h14m25s
Time (total): 18m22s, 0.68%

bpsd = blocks per sec diff (+), tpsd = txs per sec diff, timed = time to process diff (-)
+ = more is better, - = less is better

Here are the cache hit stats from the latest baseline run:

vtxLru(4971026)
   state    vtype       miss        hit      total hitrate
 Account    Empty    2992425          0    2992425   0.00%
 Account     Leaf  449910636  388293144  838203780  46.32%
 Account   Branch          0          0          0   -nan%
 Account ExtBranch    9691208   14885717   24576925  60.57%
   World    Empty   29880538          0   29880538   0.00%
   World     Leaf  146804391 1408526686 1555331077  90.56%
   World   Branch          0          0          0   -nan%
   World ExtBranch    3488650    8970524   12459174  72.00%
     all      all  642767848 1820676071 2463443919  73.91%
keyLru(0) 
   state       miss        hit      total hitrate
 Account          0          0          0  -nan%
   World          0          0          0  -nan%
     all          0          0          0  -nan%
branchLru(67108864) 
   state       miss        hit      total hitrate
 Account  169861952 2594261252 2764123204 93.85%
   World   71961489  828173509  900134998 92.01%
     all  241823441 3422434761 3664258202 93.40%

bhartnett added 3 commits August 7, 2025 08:48

Delete all cached values.

daad185

Clean up.

5d01fd8

Merge branch 'master' into delete-stale-values-from-db-cache

19739eb

bhartnett requested a review from arnetheduck August 7, 2025 03:00

Merge branch 'master' into delete-stale-values-from-db-cache

974465e

bhartnett marked this pull request as draft August 25, 2025 11:59

bhartnett added 2 commits August 28, 2025 11:22

Use LRUCache refresh instead of delete.

81317a1

Update nim-minilru to master branch.

faceb44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Delete stale values from the shared database LRU caches in persist #3550

Delete stale values from the shared database LRU caches in persist #3550

bhartnett commented Aug 7, 2025

Uh oh!

arnetheduck commented Aug 11, 2025 •

edited

Loading

Uh oh!

bhartnett commented Aug 11, 2025

Uh oh!

arnetheduck commented Aug 11, 2025

Uh oh!

bhartnett commented Aug 25, 2025

Uh oh!

arnetheduck commented Aug 27, 2025

Uh oh!

arnetheduck commented Aug 27, 2025

Uh oh!

bhartnett commented Aug 28, 2025

Uh oh!

bhartnett commented Aug 30, 2025

Uh oh!

arnetheduck commented Aug 30, 2025

Uh oh!

bhartnett commented Aug 30, 2025

Uh oh!

bhartnett commented Sep 1, 2025

Uh oh!

Uh oh!

Delete stale values from the shared database LRU caches in persist #3550

Are you sure you want to change the base?

Delete stale values from the shared database LRU caches in persist #3550

Conversation

bhartnett commented Aug 7, 2025

Uh oh!

arnetheduck commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bhartnett commented Aug 11, 2025

Uh oh!

arnetheduck commented Aug 11, 2025

Uh oh!

bhartnett commented Aug 25, 2025

Uh oh!

arnetheduck commented Aug 27, 2025

Uh oh!

arnetheduck commented Aug 27, 2025

Uh oh!

bhartnett commented Aug 28, 2025

Uh oh!

bhartnett commented Aug 30, 2025

Uh oh!

arnetheduck commented Aug 30, 2025

Uh oh!

bhartnett commented Aug 30, 2025

Uh oh!

bhartnett commented Sep 1, 2025

Uh oh!

Uh oh!

arnetheduck commented Aug 11, 2025 •

edited

Loading