Skip to content

Commit a241050

Browse files
authored
Beacon sync update multi exe heads aware (#2861)
* Log/trace cancellation events in scheduler * Provide `clear()` functions for explicitly flushing data objects * Renaming header cache functions why: More systematic, all functions start with prefix `dbHeader` * Remove `danglingParent` from layout why: Already provided by header cache * Remove `couplerHash` and `headHash` from layout why: No need to cache, `headHash` is unused and `couplerHash` used typically once, only. * Remove `lastLayout` from sync descriptor why: No need to compare changes, saving is always triggered after actively changing the sync layout state * Early reject unsuitable head + finalised header from CL why: The finalised header is only passed by its hash so the header must be fetched somewhere, e.g. from a peer via eth/xx. Also, finalised headers earlier than the `base` from `FC` cannot be handled due to the `Aristo` single state database architecture. Luckily, on a full node, the complete block history is available so unsuitable finalised headers are stored there already which is exploited here to avoid unnecessary network traffic. * Code cosmetics, remove cruft, prettify logging, remove `final` metrics detail: The `final` layout parameter will be deprecated and later removed * Update/re-calibrate syncer logic documentation why: The current implementation sucks if the `FC` module changes the canonical branch in the middle of completing a header chain (due to concurrent updates by the `newPayload()` logic.) * Implement according to re-calibrated syncer docu details: The implementation employs the notion of named layout states (see `SyncLayoutState` in `worker_desc.nim`) which are derived from the state parameter triple `(C,D,H)` as described in `README.md`.
1 parent c525590 commit a241050

File tree

16 files changed

+490
-343
lines changed

16 files changed

+490
-343
lines changed

nimbus/sync/beacon/README.md

Lines changed: 78 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -66,58 +66,67 @@ Implementation, The Gory Details
6666
The following diagram depicts a most general state view of the sync and the
6767
*FC* modules and at a given point of time
6868

69-
0 C L (5)
70-
o------------o-------o
69+
0 L (5)
70+
o--------------------o
7171
| <--- imported ---> |
72-
Y D H
72+
C D H
7373
o---------------------o----------------o
7474
| <-- unprocessed --> | <-- linked --> |
7575

7676
where
7777

78-
* *C* -- coupler, cached **base** entity of the **FC** module, reported at
79-
the time when *H* was set. This determines the maximal way back length
80-
of the *linked* ancestor chain starting at *H*.
81-
82-
* *Y* -- has the same block number as *C* and is often, but not necessarily
83-
equal to *C* (for notation *C~Y* see clause *(6)* below.)
78+
* *C* -- coupler, parent of the left endpoint of the chain of headers or blocks
79+
to be fetched and imported.
8480

85-
* *L* -- **latest**, current value of this entity of the **FC** module (i.e.
86-
now, when looked up)
81+
* *L* -- **latest**, current value of this entity (with the same name) of the
82+
**FC** module (i.e. the current value when looked up.) *L* need not
83+
be a parent of any header of the linked chain `(C,H]` (see below for
84+
notation). Both *L* and *H* might be heads of different forked chains.
8785

88-
* *D* -- dangling, least block number of the linked chain in progress ending
89-
at *H*. This variable is used to record the download state eventually
90-
reaching *Y* (for notation *D<<H* see clause *(6)* below.)
86+
* *D* -- dangling, header with the least block number of the linked chain in
87+
progress ending at *H*. This variable is used to record the download
88+
state eventually reaching *Y* (for notation *D<<H* see clause *(6)*
89+
below.)
9190

92-
* *H* -- head, sync target which typically is the value of a *sync to new head*
93-
request (via RPC)
91+
* *H* -- head, sync target header which typically was the value of a *sync to
92+
new head* request (via RPC)
9493

95-
The internal sync state (as opposed to the general state also including **FC**)
96-
is defined by the triple *(C,D,H)*. Other parameters *L* and *Y* mentioned in
97-
*(5)* are considered ephemeral to the sync state. They are always used by its
98-
latest value and are not cached by the syncer.
94+
The internal sync state (as opposed to the general state also including the
95+
state of **FC**) is defined by the triple *(C,D,H)*. Other parameters like *L*
96+
mentioned in *(5)* are considered ephemeral to the sync state. They are always
97+
seen by its latest values and not cached by the syncer.
9998

10099
There are two order releations and some derivatives used to describe relations
101-
beween headers or blocks.
100+
between headers or blocks.
102101

103102
For blocks or headers A and B, A is said less or equal B if the (6)
104103
block numbers are less or equal. Notation: A <= B.
105104

106-
For blocks or headers A and B, A is said ancestor of, or equal to
107-
B if B is linked to A following up the lineage of parentHash fields
108-
of the block headers. Notation: A << B.
105+
The notation A ~ B stands for A <= B <= A which makes <= an order
106+
relation (relative to ~ rather than ==). If A ~ B does not hold
107+
then the notation A !~ B is used.
108+
109+
The notation A < B stands for A <= B and A !~ B.
110+
111+
The notation B-1 stands for any block or header with block number of
112+
B less one.
113+
109114

110-
The relate notation A ~ B stands for A <= B <= A which is posh for
111-
saying that A and B have the same block numer.
115+
For blocks or headers A and B, writing A <- B stands for the block
116+
A be parent of B (there can only be one parent of B.)
117+
118+
For blocks or headers A and B, A is said ancestor of, or equal to B
119+
if A == B or there is a non-empty parent lineage A <- X <- Y <-..<- B.
120+
Notation: A << B (note that << is an equivalence relation.)
112121

113122
The compact interval notation [A,B] stands for the set {X|A<<X<<B}
114123
and the half open interval notation stands for [A,B]-{A} (i.e. the
115124
interval without the left end point.)
116-
125+
117126
Note that *A<<B* implies *A<=B*. Boundary conditions that hold for the
118127
clause *(5)* diagram are
119128

120-
C ~ Y, C in [0,L], D in [Y,H] (7)
129+
there is a Z in [0,L] with C ~ Z, D is in [C,H] (7)
121130

122131

123132
### Sync Processing
@@ -134,64 +143,70 @@ parameters *C* and *D* are irrelevant here.
134143
Following, there will be a request to advance *H* to a new position as
135144
indicated in the diagram below
136145

137-
0 C (9)
146+
0 B (9)
138147
o------------o-------o
139148
| <--- imported ---> | D
140-
Y H
149+
C H
141150
o--------------------------------------o
142151
| <----------- unprocessed ----------> |
143152

144-
with a new sync state *(C,D,H)*. The parameter *C* in clause *(9)* is set
145-
as the **base** entity of the **FC** module. *Y* is only known by its block
146-
number, *Y~C*. The parameter *D* is set to the download start position *H*.
153+
with a new sync state *(C,H,H)*. The parameter *B* is the **base** entity
154+
of the **FC** module. The parameter *C* is a placeholder with *C ~ B*. The
155+
parameter *D* is set to the download start position *H*.
147156

148-
The syncer then fetches the header chain *(Y,H]* from the network. For the
149-
syncer state *(C,D,H)*, while iteratively fetching headers, only the parameter
150-
*D* will change each time a new header was fetched.
157+
The syncer then fetches the header chain *(C,H]* from the network. While
158+
iteratively fetching headers, the syncer state *(C,D,H)* will only change on
159+
its second position *D* time after a new header was fetched.
151160

152-
Having finished dowlnoading *(Y,H]* one might end up with a situation
161+
Having finished downloading then *C~D-1*. The sync state is *(D-1,D,H)*. One
162+
will end up with a situation like
153163

154-
0 B Z L (10)
155-
o-------------o--o---o
164+
0 Y L (10)
165+
o---------------o----o
156166
| <--- imported ---> |
157-
Y Z H
158-
o---o----------------------------------o
167+
C Z H
168+
o----o---------------------------------o
159169
| <-------------- linked ------------> |
160170

161-
where *Z* is in the intersection of *[B,L]\*(Y,H]* with *B* the current
162-
**base** entity of the **FC** logic. It is only known that *0<<B<<L*
163-
although in many cases *B==C* holds.
171+
for some *Y* in *[0,L]* and *Z* in *(C,H]* where *Y<<Z* with *L* the **latest**
172+
entity of the **FC** logic.
164173

165-
If there is no such *Z* then *(Y,H]* is discarded and sync processing restarts
166-
at clause *(8)* by resetting the sync state (e.g. to *(0,0,0)*.)
174+
If there are no such *Y* and *Z*, then *(C,H]* is discarded and sync processing
175+
restarts at clause *(8)* by resetting the sync state (e.g. to *(0,0,0)*.)
167176

168-
Otherwise assume *Z* is the one with the largest block number of the
169-
intersection *[B,L]\*(Y,H]*. Then the headers *(Z,H]* will be completed to
170-
a lineage of blocks by downloading block bodies.
177+
Otherwise choose *Y* and *Z* with maximal block number of *Y* so that *Y<-Z*.
178+
Then complete *(Y,H]==[Z,H]* to a lineage of blocks by downloading missing
179+
block bodies.
171180

172-
0 Z (11)
173-
o----------------o---o
181+
Having finished with block bodies, the sync state will be expressed as
182+
*(Y,Y,H)*. With the choice of the first two entries equal it is indicated that
183+
the lineage *(Y,H]* is fully populated with blocks.
184+
185+
0 Y (11)
186+
o---------------o----o
174187
| <--- imported ---> |
175-
Z H
176-
o----------------------------------o
177-
| <------------ blocks ----------> |
188+
Y H
189+
o-----------------------------------o
190+
| <------------ blocks -----------> |
178191

179-
The blocks *(Z,H]* will then be imported. While this happens, the internal
180-
state of the **FC** might change/reset so that further import becomes
181-
impossible. Even when starting import, the block *Z* might not be in *[0,L]*
192+
The blocks *(Y,H]* will then be imported and executed. While this happens, the
193+
internal state of the **FC** might change/reset so that further import becomes
194+
impossible. Even when starting import, the block *Y* might not be in *[0,L]*
182195
anymore due to some internal reset of the **FC** logic. In any of those
183196
cases, sync processing restarts at clause *(8)* by resetting the sync state.
184197

185-
Otherwise the block import will end up at
198+
In case all blocks can be imported, one will will end up at
186199

187-
0 Z H L (12)
188-
o----------------o----------------------------------o---o
200+
0 Y H L (12)
201+
o-----------------o---------------------------------o---o
189202
| <--- imported --------------------------------------> |
190203

191204
with *H<<L* for *L* the current value of the **latest** entity of the **FC**
192-
module. In many cases, *H==L* but there are other actors running that might
193-
import blocks quickly after importing *H* so that *H* is seen as ancestor,
194-
different from *L* when this stage is formally done with.
205+
module.
206+
207+
In many cases, *H==L* but there are other actors which also might import blocks
208+
quickly after finishing import of *H* before formally committing this task. So
209+
*H* can become ancestor of *L*.
195210

196211
Now clause *(12)* is equivalent to clause *(8)*.
197212

@@ -295,7 +310,6 @@ be available if *nimbus* is compiled with the additional make flags
295310
| beacon_latest | block height | **L**, *increasing* |
296311
| beacon_coupler | block height | **C**, *increasing* |
297312
| beacon_dangling | block height | **D** |
298-
| beacon_final | block height | **F**, *increasing* |
299313
| beacon_head | block height | **H**, *increasing* |
300314
| beacon_target | block height | **T**, *increasing* |
301315
| | | |

nimbus/sync/beacon/TODO.md

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,3 @@
1-
## Update sync state management to what is described in *README.md*
2-
3-
1. For the moment, the events in *update.nim* need to be adjusted. This will fix an error where the CL forces the EL to fork internally by sending different head request headers with the same bock number.
4-
5-
2. General scenario update. This is mostly error handling.
6-
71
## General TODO items
82

93
* Update/resolve code fragments which are tagged FIXME
@@ -25,3 +19,17 @@ which happened on several `holesky` tests immediately after loging somehing like
2519
or from another machine with literally the same exception text (but the stack-trace differs)
2620

2721
NTC 2024-10-31 21:58:07.616 Finalized blocks persisted file=forked_chain.nim:231 numberOfBlocks=129 last=9cbcc52953a8 baseNumber=2646857 baseHash=9db5c2ac537b
22+
23+
### 3. Mem overflow possible on small breasted systems
24+
25+
Running the exe client, a 1.5G response message was opbserved (on my 8G test system this kills the program as it has already 80% mem load. It happens while syncing holesky at around block #184160 and is reproducible on the 8G system but not yet on the an 80G system.)
26+
27+
[..]
28+
DBG 2024-11-20 16:16:18.871+00:00 Processing JSON-RPC request file=router.nim:135 id=178 name=eth_getLogs
29+
DBG 2024-11-20 16:16:18.915+00:00 Returning JSON-RPC response file=router.nim:137 id=178 name=eth_getLogs len=201631
30+
TRC 2024-11-20 16:16:18.951+00:00 <<< find_node from topics="eth p2p discovery" file=discovery.nim:248 node=Node[94.16.123.192:30303]
31+
TRC 2024-11-20 16:16:18.951+00:00 Neighbours to topics="eth p2p discovery" file=discovery.nim:161 node=Node[94.16.123.192:30303] nodes=[..]
32+
TRC 2024-11-20 16:16:18.951+00:00 Neighbours to topics="eth p2p discovery" file=discovery.nim:161 node=Node[94.16.123.192:30303] nodes=[..]
33+
DBG 2024-11-20 16:16:19.027+00:00 Received JSON-RPC request topics="JSONRPC-HTTP-SERVER" file=httpserver.nim:52 address=127.0.0.1:49746 len=239
34+
DBG 2024-11-20 16:16:19.027+00:00 Processing JSON-RPC request file=router.nim:135 id=179 name=eth_getLogs
35+
DBG 2024-11-20 16:20:23.664+00:00 Returning JSON-RPC response file=router.nim:137 id=179 name=eth_getLogs len=1630240149

nimbus/sync/beacon/worker.nim

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ proc napUnlessSomethingToFetch(
5454

5555
proc setup*(ctx: BeaconCtxRef; info: static[string]): bool =
5656
## Global set up
57-
ctx.setupRpcMagic()
57+
ctx.setupRpcMagic info
5858

5959
# Load initial state from database if there is any
6060
ctx.setupDatabase info
@@ -109,12 +109,10 @@ proc runDaemon*(
109109
## first usable request from the CL (via RPC) stumbles in.
110110
##
111111
# Check for a possible header layout and body request changes
112-
ctx.updateSyncStateLayout info
112+
ctx.updateSyncState info
113113
if ctx.hibernate:
114114
return
115115

116-
ctx.updateBlockRequests info
117-
118116
# Execute staged block records.
119117
if ctx.blocksStagedCanImportOk():
120118

nimbus/sync/beacon/worker/blocks_staged.nim

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -57,13 +57,13 @@ proc fetchAndCheck(
5757
blk.blocks.setLen(offset + ivReq.len)
5858
var blockHash = newSeq[Hash32](ivReq.len)
5959
for n in 1u ..< ivReq.len:
60-
let header = ctx.dbPeekHeader(ivReq.minPt + n).valueOr:
60+
let header = ctx.dbHeaderPeek(ivReq.minPt + n).valueOr:
6161
# There is nothing one can do here
6262
raiseAssert info & " stashed header missing: n=" & $n &
6363
" ivReq=" & $ivReq & " nth=" & (ivReq.minPt + n).bnStr
6464
blockHash[n - 1] = header.parentHash
6565
blk.blocks[offset + n].header = header
66-
blk.blocks[offset].header = ctx.dbPeekHeader(ivReq.minPt).valueOr:
66+
blk.blocks[offset].header = ctx.dbHeaderPeek(ivReq.minPt).valueOr:
6767
# There is nothing one can do here
6868
raiseAssert info & " stashed header missing: n=0" &
6969
" ivReq=" & $ivReq & " nth=" & ivReq.minPt.bnStr
@@ -325,7 +325,7 @@ proc blocksStagedImport*(
325325

326326
# Remove stashed headers for imported blocks
327327
for bn in iv.minPt .. maxImport:
328-
ctx.dbUnstashHeader bn
328+
ctx.dbHeaderUnstash bn
329329

330330
# Update, so it can be followed nicely
331331
ctx.updateMetrics()

nimbus/sync/beacon/worker/blocks_staged/staged_queue.nim

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,10 @@ func blocksStagedQueueIsEmpty*(ctx: BeaconCtxRef): bool =
3131

3232
# ----------------
3333

34+
func blocksStagedQueueClear*(ctx: BeaconCtxRef) =
35+
## Clear queue
36+
ctx.blk.staged.clear
37+
3438
func blocksStagedQueueInit*(ctx: BeaconCtxRef) =
3539
## Constructor
3640
ctx.blk.staged = StagedBlocksQueue.init()

nimbus/sync/beacon/worker/blocks_unproc.nim

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -83,11 +83,6 @@ proc blocksUnprocCovered*(ctx: BeaconCtxRef; pt: BlockNumber): bool =
8383
ctx.blk.unprocessed.covered(pt, pt) == 1
8484

8585

86-
proc blocksUnprocTop*(ctx: BeaconCtxRef): BlockNumber =
87-
let iv = ctx.blk.unprocessed.le().valueOr:
88-
return BlockNumber(0)
89-
iv.maxPt
90-
9186
proc blocksUnprocBottom*(ctx: BeaconCtxRef): BlockNumber =
9287
let iv = ctx.blk.unprocessed.ge().valueOr:
9388
return high(BlockNumber)
@@ -112,14 +107,14 @@ proc blocksUnprocInit*(ctx: BeaconCtxRef) =
112107
## Constructor
113108
ctx.blk.unprocessed = BnRangeSet.init()
114109

115-
proc blocksUnprocSet*(ctx: BeaconCtxRef) =
110+
proc blocksUnprocClear*(ctx: BeaconCtxRef) =
116111
## Clear
117112
ctx.blk.unprocessed.clear()
118113
ctx.blk.borrowed = 0u
119114

120115
proc blocksUnprocSet*(ctx: BeaconCtxRef; minPt, maxPt: BlockNumber) =
121116
## Set up new unprocessed range
122-
ctx.blocksUnprocSet()
117+
ctx.blocksUnprocClear()
123118
# Argument `maxPt` would be internally adjusted to `max(minPt,maxPt)`
124119
if minPt <= maxPt:
125120
discard ctx.blk.unprocessed.merge(minPt, maxPt)

0 commit comments

Comments
 (0)