-
Notifications
You must be signed in to change notification settings - Fork 312
snapshots: lthash #5954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
snapshots: lthash #5954
Conversation
1962ac8
to
40a602b
Compare
cf0495e
to
fd5559d
Compare
a634319
to
766d347
Compare
e0b3efe
to
2b8601c
Compare
2b8601c
to
878a1ac
Compare
# How many snapshot hash tiles to run. The snapshot hash tiles are | ||
# responsible for verifying the hash of all accounts in the loaded | ||
# snapshot via lthash (lattice hashing). Currently, set to 0 by | ||
# default because it is too slow to run in the full client. TODO: | ||
# enable snapshot hash tiles in the full client and update this | ||
# comment. | ||
hash_tile_count = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make it clear that if the hash tile count is 0, no hashing will be done? Wondering if we should instead have a snapshots.verify
flag instead to enable/disable the hashing, instead of hash_tile_count = 0 => no hashing.
/* TODO: cache the previous lthash somewhere in funk to avoid | ||
recalculating hash. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We considered doing this... it would mean an additional 2048 bytes per account which is probably not ideal.. this storage-compute tradeoff is something we're exploring for transaction execution also and we should revisit this after the hashing changes and database changes are done.
91db563
to
40bb2e2
Compare
40bb2e2
to
5f5ce80
Compare
@@ -698,6 +698,14 @@ user = "" | |||
# requests. | |||
sign_tile_count = 2 | |||
|
|||
# How many snapshot hash tiles to run. The snapshot hash tiles are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few things,
hash_tile_count
is very vague and could refer anything. Be more specific, in this case the convention is<tilename>_tile_count
sosnaplt_tile_count
?- Allowing someone to turn this off is probably not desirable, and we should just make this config production ready now. I'd suggest something like,
# How many snapshot hash tiles to run. The snapshot hash tiles are | |
[layout] | |
# How many snapshot lthash tiles to run. Snapshot hash tiles verify | |
# the contents of accounts in the loaded snapshot, in parallel. Too | |
# few and snapshot loading might be delayed. Once the hash of the | |
# accounts in a snapshot is calculated, if it does not match the | |
# validator will abort with an error. | |
snaplt_tile_count = 1 | |
[development] | |
[snapshots] | |
# Set to true to disable verification of the lthash in the | |
# loaded snapshot. This is not safe or supported in production | |
# and should only be used for development and testing purposes. | |
# | |
# If lthash verification is disabled, the validator will not | |
# start any lthash tiles and the value of `snaplt_tile_count` | |
# in the layout will be ignored. | |
disable_lthash_verification = false |
Once you have this, you can:
- Check that lthash_tile_count > 0
- Check that the development option is not enabled when running against a production cluster (same as some other options, e.g. we check this for --no-sandbox).
@@ -201,6 +202,7 @@ fd_topo_initialize( config_t * config ) { | |||
ulong bank_tile_cnt = config->layout.bank_tile_count; | |||
ulong exec_tile_cnt = config->firedancer.layout.exec_tile_count; | |||
ulong writer_tile_cnt = config->firedancer.layout.writer_tile_count; | |||
ulong hash_tile_cnt = config->firedancer.layout.hash_tile_count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ulong hash_tile_cnt = config->firedancer.layout.hash_tile_count; | |
ulong snaplt_tile_cnt = config->firedancer.layout.snaplt_tile_count; |
} | ||
case FD_SNAPSHOT_HASH_MSG_ACCOUNT_HDR: { | ||
FD_TEST( ctx->state==FD_SNAPHS_STATE_HASHING ); | ||
fd_snapshot_account_t * account = (fd_snapshot_account_t *)fd_chunk_to_laddr_const( ctx->in.wksp, chunk ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be useful to verify
FD_TEST( ctx->acc_data_sz==0UL );
} | ||
case FD_SNAPSHOT_HASH_MSG_ACCOUNT_DATA: { | ||
FD_TEST( ctx->state==FD_SNAPHS_STATE_HASHING ); | ||
fd_memcpy( ctx->acc_data + ctx->acc_data_sz, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the lthash
calculation is just a series of blake3 appends, so you can do it unbuffered / without memcpy by just streaming the account data into blake3 here?
uchar * manifest_lthash = ctx->full ? ctx->hash_info.full_lthash.bytes : ctx->hash_info.incremental_lthash.bytes; | ||
fd_memcpy( manifest_lthash, manifest->accounts_lthash, sizeof(fd_lthash_value_t) ); | ||
} else { | ||
/* disable lthash verification if there's no lthash in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should probably just abort with an error here, or trigger a malformed snapshot.
@@ -106,6 +106,7 @@ extern fd_topo_run_tile_t fd_tile_shredcap; | |||
extern fd_topo_run_tile_t fd_tile_snaprd; | |||
extern fd_topo_run_tile_t fd_tile_snapdc; | |||
extern fd_topo_run_tile_t fd_tile_snapin; | |||
extern fd_topo_run_tile_t fd_tile_snaphs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add to main.c
in firedancer
hash from the running lthash. */ | ||
if( FD_LIKELY( ctx->hash_info.enabled ) ) { | ||
/* Send the existing account to the hash tile */ | ||
/* TODO: cache the previous lthash somewhere in funk to avoid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is extremely rare and not worth the memory. I'd remove the TODO, what you have is good
fd_topo_obj_t * replay_manifest_dcache = fd_topob_obj( topo, "dcache", "replay_manif" ); | ||
fd_pod_insertf_ulong( topo->props, 2UL << 30UL, "obj.%lu.data_sz", replay_manifest_dcache->id ); | ||
fd_pod_insert_ulong( topo->props, "manifest_dcache", replay_manifest_dcache->id ); | ||
FOR(hash_tile_cnt) fd_topob_link( topo, "snapin_hsh", "snapin_hsh", 128UL, sizeof(fd_snapshot_existing_account_t), 1UL ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convention for this kind of message passing is a single snapin_snaph
link (not one per tile). You publish all frags on the link, and the snaphs tiles select which ones they want to process in the before_frag
callback.
|
||
if( FD_LIKELY( ctx->hash_info.enabled ) ) { | ||
for( ulong i=0UL; i<ctx->hash_info.num_hash_tiles; i++ ) { | ||
fd_stem_publish( ctx->stem, FD_SNAPIN_HSH_IDX + i, FD_SNAPSHOT_HASH_MSG_RESET, 0UL, 0UL, 0UL, 0UL, 0UL ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment in topology, you just need one link here
|
||
fd_lthash_zero( &ctx->hash_info.full_lthash ); | ||
fd_lthash_zero( &ctx->hash_info.incremental_lthash ); | ||
fd_lthash_zero( &ctx->hash_info.result_lthash ); | ||
} | ||
|
||
#define STEM_BURST 2UL /* For control fragments, one acknowledgement, and one malformed message */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer correct?
ctx->funk_txn = fd_funk_txn_prepare( ctx->funk, ctx->funk_txn, &incremental_xid, 0 ); | ||
ctx->full = 0; | ||
ctx->state = FD_SNAPIN_STATE_LOADING; | ||
ctx->funk_txn = fd_funk_txn_prepare( ctx->funk, ctx->funk_txn, &incremental_xid, 0 ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra whitespace?
/* A pending ack is waiting for lthash results to come back from | ||
the hashing tiles. Once these hashes come back, the ack can be | ||
sent. */ | ||
if( FD_LIKELY( ctx->hash_info.received_lthashes==ctx->hash_info.num_hash_tiles ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why in after credit, and not in after_frag when you receive the lthash?
|
||
fd_lthash_value_t * ref_lthash = ctx->full ? &ctx->hash_info.full_lthash : &ctx->hash_info.incremental_lthash; | ||
if( FD_UNLIKELY( memcmp( ref_lthash, &ctx->hash_info.result_lthash, sizeof(fd_lthash_value_t) ) ) ) { | ||
FD_LOG_WARNING(( "calculated accounts lthash %s does not match accounts lthash %s in snapshot manifest", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you have to exit with an error here, because we already told replay the snapshot was done loading. Otherwise you need to design a reset mechanism for replay and all other tiles as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
Adds snapshot hashing tiles. They are disabled by default but enabled in CI to verify the snapshot lthash in ledgers. We plan to enable them by default once their performance can keep up with download speed.