-
Notifications
You must be signed in to change notification settings - Fork 710
Make PruneExecutionDB not depend on consensusDB #4217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #4217 +/- ##
==========================================
- Coverage 32.90% 32.87% -0.03%
==========================================
Files 471 476 +5
Lines 56576 56781 +205
==========================================
+ Hits 18614 18666 +52
- Misses 34707 34849 +142
- Partials 3255 3266 +11 |
❌ 7 Tests Failed:
View the top 3 failed tests by shortest run time
📣 Thoughts on this report? Let Codecov know! | Powered by Codecov |
execution/gethexec/sync_monitor.go
Outdated
|
|
||
| if executionDB != nil && finalizedBlockHeader != nil { | ||
| finalizedBlockHash := finalizedBlockHeader.Hash() | ||
| err := executionDB.Put(ValidatedBlockHashKey, finalizedBlockHash.Bytes()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a bit confusing: shouldn't we use something like FinalizedBlockHashKey instead of ValidatedBlockHashKey?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, we should be using validatedBlockHeader. I probably got confused with the names. So to get validatedBlockHeader we need something like getFinalityBlockHeader but for validatedBlockHeader. My tentative to do so is:
func (s *SyncMonitor) getValidatedBlockHeader(
validatedFinalityData *arbutil.FinalityData,
) (*types.Header, error) {
if validatedFinalityData == nil {
return nil, nil
}
validatedMsgIdx := validatedFinalityData.MsgIdx
validatedBlockHash := validatedFinalityData.BlockHash
validatedBlockNumber := s.exec.MessageIndexToBlockNumber(validatedMsgIdx)
validatedBlock := s.exec.bc.GetBlockByNumber(validatedBlockNumber)
if validatedBlock == nil {
log.Debug("Validated block not found", "blockNumber", validatedBlockNumber)
return nil, nil
}
if validatedBlock.Hash() != validatedBlockHash {
errorMsg := fmt.Sprintf(
"finality block hash mismatch, blockNumber=%v, block hash provided by consensus=%v, block hash from execution=%v",
validatedBlockNumber,
validatedBlockHash,
validatedBlock.Hash(),
)
return nil, errors.New(errorMsg)
}
return validatedBlock.Header(), nil
}then we can get with:
validatedBlockHeader, err := s.getValidatedBlockHeader(
validatedFinalityData,
)and use that instead of finalizedBlockHeader. I'm just curious if there's a better way of fetching validatedBlockHeader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds reasonable, but I think we should summon @diegoximenes here for his expertise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context, this part that compares the block hash provided by Consensus through the SetFinality call, with the block hash stored in the Execution DB, is a procedure that Tsahi suggested, to make the node a little bit more resistant to some reorg scenarios.
It is OK to have a function like you mentioned, so all cases (safe/finalized/validated) are checked with this block hash matching procedure, and we reuse code.
Something like that, variables should be renamed, func name can also have another name.
func checkBlockHashAndGetHeader(....) ... {
finalityBlockNumber := s.exec.MessageIndexToBlockNumber(finalityMsgIdx)
finalityBlock := s.exec.bc.GetBlockByNumber(finalityBlockNumber)
if finalityBlock == nil {
log.Debug("Finality block not found", "blockNumber", finalityBlockNumber)
return nil, nil
}
if finalityBlock.Hash() != finalityBlockHash {
errorMsg := fmt.Sprintf(
"finality block hash mismatch, blockNumber=%v, block hash provided by consensus=%v, block hash from execution=%v",
finalityBlockNumber,
finalityBlockHash,
finalityBlock.Hash(),
)
return nil, errors.New(errorMsg)
}
return finalityBlock.Header(), nil
}
This func can then be called by the existing getFinalityBlockHeader, and when storing the validated block hash in the DB.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep agreed, combining them into one function to reuse some code
| if chainConfig == nil { | ||
| return nil, errors.New("database doesn't have a chain config (was this node initialized?)") | ||
| } | ||
| consensusDB, err := stack.OpenDatabaseWithOptions("arbitrumdata", node.DatabaseOptions{MetricsNamespace: "arbitrumdata/", ReadOnly: true, PebbleExtraOptions: persistentConfig.Pebble.ExtraOptions("arbitrumdata"), NoFreezer: true}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: stack is an argument of findImportantRoots, and it is not used anymore, so it can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing that along with cacheConfig and persistentConfig since none are used
execution/gethexec/sync_monitor.go
Outdated
|
|
||
| if executionDB != nil && finalizedBlockHeader != nil { | ||
| finalizedBlockHash := finalizedBlockHeader.Hash() | ||
| err := executionDB.Put(ValidatedBlockHashKey, finalizedBlockHash.Bytes()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context, this part that compares the block hash provided by Consensus through the SetFinality call, with the block hash stored in the Execution DB, is a procedure that Tsahi suggested, to make the node a little bit more resistant to some reorg scenarios.
It is OK to have a function like you mentioned, so all cases (safe/finalized/validated) are checked with this block hash matching procedure, and we reuse code.
Something like that, variables should be renamed, func name can also have another name.
func checkBlockHashAndGetHeader(....) ... {
finalityBlockNumber := s.exec.MessageIndexToBlockNumber(finalityMsgIdx)
finalityBlock := s.exec.bc.GetBlockByNumber(finalityBlockNumber)
if finalityBlock == nil {
log.Debug("Finality block not found", "blockNumber", finalityBlockNumber)
return nil, nil
}
if finalityBlock.Hash() != finalityBlockHash {
errorMsg := fmt.Sprintf(
"finality block hash mismatch, blockNumber=%v, block hash provided by consensus=%v, block hash from execution=%v",
finalityBlockNumber,
finalityBlockHash,
finalityBlock.Hash(),
)
return nil, errors.New(errorMsg)
}
return finalityBlock.Header(), nil
}
This func can then be called by the existing getFinalityBlockHeader, and when storing the validated block hash in the DB.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some tests here should be extended to check that validated block hash is properly stored in the DB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding checks for both validated and finalized blocks
system_tests/pruning_test.go
Outdated
| // PathScheme prunes the state trie by itself, so only HashScheme should be tested | ||
| builder.RequireScheme(t, rawdb.HashScheme) | ||
|
|
||
| // Needed to create safe blocks; hence forcing SetFinalityData call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: safe blocks are not used by pruning today, only validated and finalized.
system_tests/pruning_test.go
Outdated
| defer executionDB.Close() | ||
| executionDBEntriesBeforePruning := countStateEntries(executionDB) | ||
|
|
||
| // Since we're dealing with a new executionDB we store both validatedBlockHash and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get this.
builder.L2.cleanup() should guarantee that those validated and finalized block info are persisted to disk.
Those info should already be available when opening the DB again here.
Have you noticed that this is not the case?
Since finality data is pushed to Execution asynchronously, you can check that validated/finalized info is non empty before calling builder.L2.cleanup().
And then check that the validated/finalized info, retrieved through this DB here, is equal, or even more up to date, than the info that you retrieved before calling builder.L2.cleanup().
Validated/finalized info can be more up to date since, between its retrieval and calling builder.L2.cleanup(), SyncMonitor.SetFinalityData could have been called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did observe something finicky but most likely because I was checking the wrong executionDB or something. I'll add the check to make sure data in the recently opened executionDB matches the one from builder.
cmd/pruning/pruning.go
Outdated
| @@ -179,56 +165,21 @@ func findImportantRoots(ctx context.Context, executionDB ethdb.Database, stack * | |||
| return nil, fmt.Errorf("unknown pruning mode: \"%v\"", initConfig.Prune) | |||
| } | |||
| if initConfig.Prune != "minimal" && l1Client != nil { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
l1Client check not needed anymore here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something important is not being tested today.
This is more a lack of test that could have be introduced before, but this PR is changing some code that is not really being tested.
When using a pruning mode that is not "minimal" and not "validator", the trie nodes that are related to the finalized block and genesis should be kept in disk, and the rest of the trie nodes should be pruned.
We can possibly test that by:
- Transferring some eth to an account, this will happen in block B1.
- Transferring some eth to the same account, this will happen in block B2.
- Guarantee that block BF, where BF >= B2, is the latest finalized. There can be other blocks after BF.
- Guarantee that we are able to retrieve the account's balance at B1 and BF.
- Shutdown the node.
- Prune.
- Restart the node.
- Guarantee that we are able to retrieve the account's balance at BF, but it fails when retrieving it at B1.
The same reasoning can apply for the "validator" pruning mode, in which this PR modified code related to it.
This test makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to add a test for that but I'm struggling with step 8. What kind of information should be missing for the blocks smaller than BF? For instance, after restarting the node I'm not able to call builder.L2.Client.BalanceAt for any block including the finalized block; on the other hand, I'm still able to retrieve block hashes through rawdb.ReadCanonicalHash(builder.L2.ExecNode.ExecutionDB, blockNum) from block 1 up until BF and subsequent blocks. What information would be missing for those B1, B2 blocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed the changes that adds checks to TestFinalityDataWaitForBlockValidator and cleans TestPruning. So this seems to be the only missing piece
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being able to read block hashes is expected.
Pruning only removes MPT trie nodes, and block hashes are not stored in trie nodes.
The BalanceAt issue is unexpected.
Which error messages are you seeing?
Can you check running this test without the changes introduced in this PR?
And also add a commit with the implemented test, so I can play with it too?
A brief explanation of why it is unexpected.
With hash scheme, to retrieve a trie node, geth needs to retrieve an entry from the DB, in which the key is the hash of the trie node.
Account's balances are stored in the state trie.
Simplifying a lot here, and making some intentional "mistakes" to make it easier to explain, to retrieve "alice" account's balance:
t1. Retrieves hash(blockHash(B1).stateRootTrie) related node from the DB, lets call it node(t1).
t2. Gets the child hash, related to "a", from node(t1), lets call it hash(node(t1)->a)
t3. Retrieves hash(node(t1)->a) related node from DB, lets call it node(t3).
t4. Gets the child hash, related to "l", from node(t3), lets call it hash(node(t3)->l)
t5. etc
Let S1 be the set of observed node hashes when retrieving "alice" account balance in B1, and SF the set related to BF.
It is quite likely that the intersection of S1 and SF is empty.
Also, it is quite likely that S1 nodes will not be referenced by the tree rooted at blockHash(BF).stateRootTrie.
During pruning we will add blockHash(BF).stateRootTrie as an important root, meaning that nodes reached by it shouldn't be removed.
The other nodes, the ones that are reached by blockHash(B1).stateRootTrie and not by blockHash(BF).stateRootTrie, will be removed.
So we should be able to retrieve "alice" account's balance in BF but not in B1.
Makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so say for some weird reason there was an intersection between S1 and SF, would that mean that pruning procedure would prune all those nodes or would it leave them untouched, because of the fact that blockHash(BF).stateRootTrie being an important root?
The error I'm getting is:
pruning_test.go:346: [] missing trie node 449a81a33e76a7e1d94904d16948071799bde0a23ca1d4968ef838ac5c97fae2 (path ) state 0x449a81a33e76a7e1d94904d16948071799bde0a23ca1d4968ef838ac5c97fae2 is not available, not found
I did test with current master and I'm getting the same behaviour. For instance, I'm only able to fetch balance from the latest block and sometimes from latest block - 1. Yeah, some runs the pruner deletes the second oldest block and sometimes it doesn't. Any older blocks I'm not able to retrieve balances. And that behaviour is the same for modes validator, minimal and full. The only mode that I was able to find balance for older blocks was the empty mode "" but I guess in that case we're not pruning the nodes?
Pushed the WIP test TestStateAfterPruning so you can take a look what I'm testing. Wondering if it's anything related to do how I'm setting the finality block through:
err = builder.L2.ExecNode.SyncMonitor.SetFinalityData(builder.L2.ExecNode.ExecutionDB, &safeFinalityData, &finalizedFinalityData, &validatedFinalityData)I also tried higher values for finality block in:
finalizedMsgIdx := arbutil.MessageIndex(96) // <<<<<<<<<<<<<
finalizedMsgResult, err := builder.L2.ExecNode.ResultAtMessageIndex(finalizedMsgIdx).Await(ctx)
// ...
err = builder.L2.ExecNode.SyncMonitor.SetFinalityData(builder.L2.ExecNode.ExecutionDB, &safeFinalityData, &finalizedFinalityData, &validatedFinalityData)but no dice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not completely sure if opening a full node with a DB created by an archive node will work though. Maybe even restarting as a an archive node is already enough, but not sure if something will fail while starting the node in this case. But this seems the most straightforward approach to try first.
yeah that part worked fine, the problem is that archival mode keeps every state, More info below.
There's an interesting behaviour with running the node in archive mode. Even though I see the expected blocks being added as important roots as part of the pruning procedure, BalanceAt calls still succeeds for most of the blocks. This is what I'm doing:
- Start node in archive mode by setting
builder.execConfig.Caching.Archiveto true - Generate 5,000 blocks
- Set last finalized block to
4850and last validated block to2550(they're >2000block apart as perminRootDistance) - Stop node
- Call
pruning.PruneExecutionDB(...)withvalidatorpruning mode
a. I observe that blocks0,2550and4850have been added as important roots - Start another node with the same executionDB (no archival mode)
- Get L2 chain tip, call it
l2_headwhich is at around block9000 - Call
BalanceAtfor every block between 2000 - 9000 and check which succeed and which doesn't
Say l2_head is 9072
- Blocks between [7484, 8945] failed with
state recreation l2 gas depth limit exceeded - Blocks between [2550, 7483] succeeded
- Blocks between [2521, 2549] failed with
state recreation l2 gas depth limit exceeded - Blocks between [2000, 2520] succeeded
So it doesn't seem like pruning procedure is able to prune blocks in a node in archive mode? I would expect that some blocks would return the error missing trie node. To test that theory, I conducted the same exeriement but instead of running the node in archival mode, I ran a regular not and added the following lines to commit blocks [2000, 5000] to triedDB just before stopping the node:
bc := builder.L2.ExecNode.Backend.ArbInterface().BlockChain()
triedb := bc.StateCache().TrieDB()
for i := 2000; i < 5000; i++ {
header := bc.GetHeaderByNumber(uint64(i))
err = triedb.Commit(header.Root, false)
Require(t, err)
}And that behaved as we expected, meaning, pruning procedure pruned all tried nodes beween blocks [2000, 5000] except for finalized block 4850 and last validated block 2550. So something about spinning up a node in archival node prevents the pruning procedure to prune trieDB nodes the way we expect. With that in mind I tried to dig a bit deeper into how BalanceAt looks for such information:
- when using node in archival mode
BalanceAteventually callsStateAndHeaderFromHeaderhere- the interesting thing is how
BalanceAtis able to find the state for the blocks I'm calling. For instance, for the blocks we mark as important roots which are the last validated and last finalized blocks we are able to find them here [1]
statedb, err := state.New(header.Root, db) ... if header.Root != (common.Hash{}) { headerRoot := header.Root return statedb, func() { db.TrieDB().Dereference(headerRoot) }, nil // <<<<< find state header here }
- and they are successfully returned here:
liveState, liveStateRelease, err := stateFor(bc.StateCache(), bc.Snapshots())(header) if err == nil { ... return liveState, header, nil // <<<<< return state header here }
- now, the other blocks for the archival mode are found through a different code path. First they call this line:
lastState, lastHeader, lastStateRelease, err := FindLastAvailableState(ctx, bc, stateFor(ephemeral, nil), header, nil, maxRecreateStateDepth)- that line calls
stateFormultiple times in an attempt to find the last available state by callingstateForwithheader - 1until it returns something, which eventually does. Meaning, balanceAt is successfully recreating the state; probably because we're dealing with archival mode, and enough state remains thatbalanceAtis able to recreate such state.
- when using node in regular mode
BalanceAteventually callsStateAndHeaderFromHeader, that doesn't change.- for the blocks we mark as important roots which are the last validated and last finalized blocks,
BalanceAtfinds the state through the same code path as the above [1]. - but here we fail to find the header root here which eventually calls newTrieReader:
statedb, err := state.New(header.Root, db) if err != nil { return nil, nil, err }
- to eventually return the error in this line. And unlike in archival mode, in regular mode we are never able to recreate the state for those other blocks, so they fail with
missing trie node
- So it seems we can't use a node in archival mode unfortunetely if we're using
balanceAtto test since it tries to recreate those states and for the most part it's successful.
Do what you are already doing, but instead of using BF here, after restarting the node, you use the latest available block available in the execution DB, that is older or equal to BF...just matching the addHeader behavior here.
- we could do that, meaning, we can have a test running a node that's not in archival mode, and where we don't commit the state manually and check for the blocks that get committed as part of the cleanup which are
HEAD,HEAD - 1, etc. Here we can't test if last validated and last finalized blocks are prunned due to the committing issue. - or we could do that but also manually committing a few blocks manually (last validated and last finalized blocks); then that way we can check that only genesis, last validated and last finalized blocks have not been pruned by pruning procedure.
Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it seems we can't use a node in archival mode unfortunetely if we're using balanceAt to test since it tries to recreate those states and for the most part it's successful.
Makes sense
or we could do that but also manually committing a few blocks manually (last validated and last finalized blocks); then that way we can check that only genesis, last validated and last finalized blocks have not been pruned by pruning procedure.
This approach is good 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the test under TestStateAfterPruningValidator. My original idea was to call testStateAfterPruning for different modes for which all works locally. Similar to TestPruning. But something about the amount of blocks present in this test is preventing me to have more than one at the same time. I think it would be best if I created a ticket to investigate how to add different modes for this test if we still want that going forward. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humm even with just one test it seems to be timing out this CI workflow. Trying once more, but if that is a problem then I think it would be best to bring the number of block down from 5000 to 3000. That means we won't be able to test pruning procedure for adding both last validated and last finalized blocks since they would be less than 2000 blocks apart; but that's probably better than nothing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It passed on the 3rd try, it could also be that workflow is flaky; as I had to restart it in different occasions in the past in different PRs
Refactor PruneExecutionDB, more specifically findImportantRoots so it does not depend on consensusDB, but only on executionDB Signed-off-by: Igor Braga <[email protected]>
Signed-off-by: Igor Braga <[email protected]>
Signed-off-by: Igor Braga <[email protected]>
a59d76b to
4965575
Compare
Signed-off-by: Igor Braga <[email protected]>
Signed-off-by: Igor Braga <[email protected]>
new testStateAfterPruning test performs a more complete check post prunning Signed-off-by: Igor Braga <[email protected]>
Signed-off-by: Igor Braga <[email protected]>
Signed-off-by: Igor Braga <[email protected]>
Signed-off-by: Igor Braga <[email protected]>
Refactor
PruneExecutionDB, more specificallyfindImportantRootsso it does not depend onconsensusDB, but only onexecutionDB. The former PruneExecutionDB used to depend on consensusDB for 2 things:Prunemode"validator"staker.ReadLastValidatedInfoPrunemode"minimal"For the replacements we use:
executionDBto store such validated block using keyLastValidatedBlockHashKeyexecutionDB.Put(validatedBlockHashKey, ...toSetFinalityDatato store latest validated blockrawdb.ReadFinalizedBlockHashandrawdb.ReadHeaderNumberto get finalized block number and hashCloses NIT-4266